Awesome Open Source
Awesome Open Source

Build Status codecov

NeuralSP: Neural network based Speech Processing

How to install

# Set path to CUDA, NCCL
CUDAROOT=/usr/local/cuda
NCCL_ROOT=/usr/local/nccl

export CPATH=$NCCL_ROOT/include:$CPATH
export LD_LIBRARY_PATH=$NCCL_ROOT/lib/:$CUDAROOT/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=$NCCL_ROOT/lib/:$LIBRARY_PATH
export CUDA_HOME=$CUDAROOT
export CUDA_PATH=$CUDAROOT
export CPATH=$CUDA_PATH/include:$CPATH  # for warp-rnnt

# Install miniconda, python libraries, and other tools
cd tools
make KALDI=/path/to/kaldi

Key features

Corpus

  • ASR

    • AISHELL-1
    • AMI
    • CSJ
    • Librispeech
    • Switchboard (+ Fisher)
    • TEDLIUM2/TEDLIUM3
    • TIMIT
    • WSJ
  • LM

    • Penn Tree Bank
    • WikiText2

Front-end

  • Frame stacking
  • Sequence summary network [link]
  • SpecAugment [link]
  • Adaptive SpecAugment [link]

Encoder

  • RNN encoder
    • (CNN-)BLSTM, (CNN-)LSTM, (CNN-)BLGRU, (CNN-)LGRU
    • Latency-controlled BRNN [link]
    • Random state passing (RSP) [link]
  • Transformer encoder [link]
    • Chunk hopping mechanism [link]
    • Relative positional encoding [link]
    • Causal mask
  • Conformer encoder [link]
  • Time-depth separable (TDS) convolution encoder [link] [line]
  • Gated CNN encoder (GLU) [link]

Connectionist Temporal Classification (CTC) decoder

  • Beam search
  • Shallow fusion
  • Forced alignment

RNN-Transducer (RNN-T) decoder [link]

  • Beam search
  • Shallow fusion

Attention-based decoder

  • RNN decoder
    • Shallow fusion
    • Cold fusion [link]
    • Deep fusion [link]
    • Forward-backward attention decoding [link]
    • Ensemble decoding
  • Attention type
    • location-based
    • content-based
    • dot-product
    • GMM attention
  • Streaming RNN decoder specific
    • Hard monotonic attention [link]
    • Monotonic chunkwise attention (MoChA) [link]
    • Delay constrained training (DeCoT) [link]
    • Minimum latency training (MinLT) [link]
    • CTC-synchronous training (CTC-ST) [link]
  • Transformer decoder [link]
  • Streaming Transformer decoder specific
    • Monotonic Multihead Attention [link] [link]

Language model (LM)

  • RNNLM (recurrent neural network language model)
  • Gated convolutional LM [link]
  • Transformer LM
  • Transformer-XL LM [link]
  • Adaptive softmax [link]

Output units

  • Phoneme
  • Grapheme
  • Wordpiece (BPE, sentencepiece)
  • Word
  • Word-char mix

Multi-task learning (MTL)

Multi-task learning (MTL) with different units are supported to alleviate data sparseness.

  • Hybrid CTC/attention [link]
  • Hierarchical Attention (e.g., word attention + character attention) [link]
  • Hierarchical CTC (e.g., word CTC + character CTC) [link]
  • Hierarchical CTC+Attention (e.g., word attention + character CTC) [link]
  • Forward-backward attention [link]
  • LM objective

ASR Performance

AISHELL-1 (CER)

Model dev test
Conformer LAS 4.1 4.5
Transformer 5.0 5.4
Streaming MMA 5.5 6.1

CSJ (WER)

Model eval1 eval2 eval3
Conformer LAS 5.7 4.4 4.9
BLSTM LAS 6.5 5.1 5.6
LC-BLSTM MoChA 7.4 5.6 6.4

Switchboard 300h (WER)

Model SWB CH
BLSTM LAS 9.1 18.8

Switchboard+Fisher 2000h (WER)

Model SWB CH
BLSTM LAS 7.8 13.8

Librispeech (WER)

Model dev-clean dev-other test-clean test-other
Conformer LAS 2.0 4.8 2.1 5.2
Transformer 2.1 5.3 2.4 5.7
BLSTM LAS 2.5 7.2 2.6 7.5
BLSTM RNN-T 2.9 8.5 3.2 9.0
UniLSTM RNN-T 3.7 11.7 4.0 11.6
UniLSTM MoChA 4.1 11.0 4.2 11.2
LC-BLSTM RNN-T 3.3 9.8 3.5 10.2
LC-BLSTM MoChA 3.3 8.8 3.5 9.1
Streaming MMA 2.5 6.9 2.7 7.1

TEDLIUM2 (WER)

Model dev test
Conformer LAS 7.1 7.1
BLSTM LAS 8.1 7.5
LC-BLSTM RNN-T 8.0 7.7
LC-BLSTM MoChA 10.3 8.6
UniLSTM RNN-T 10.7 10.7
UniLSTM MoChA 13.5 11.6

WSJ (WER)

Model test_dev93 test_eval92
BLSTM LAS 8.8 6.2

LM Performance

Penn Tree Bank (PPL)

Model valid test
RNNLM 87.99 86.06
+ cache=100 79.58 79.12
+ cache=500 77.36 76.94

WikiText2 (PPL)

Model valid test
RNNLM 104.53 98.73
+ cache=100 90.86 85.87
+ cache=2000 76.10 72.77

Reference

Dependency


Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
python (53,705
pytorch (2,337
streaming (296
speech-recognition (201
transformer (187
attention-mechanism (128
attention (111
language-model (107
speech (104
seq2seq (103
asr (59
sequence-to-sequence (30
ctc (27
automatic-speech-recognition (17