Fast & Open-source Speech Recognition System Overview

wav2letter++: Facebook’s fast open-source speech recognition system • Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Synnaeve • Vitaliy Liptchinsky, Ronan Collobert • Facebook AI Research Several slides borrowed from Ronan Collobert, who was not hurt in the process

Research • Automatic Speech Recognition – overview and how it works • Acoustic model architectures • Training, ASG loss vs CTC loss (criteria) • Language models, decoding • Agenda • Toolkit • Overview, design • Flashlight • Benchmarks • Word Error Rate • Training/decoding speed

research

acoustic model features ðə kæt sæt the cat sat • automatic speech recognition Phonetic dictionary

acoustic model features decoder language model features end-to-end training Can we make it simple? Can we make it differentiable? Better but scalable? Can we train them? acoustic model decoder the cat sat • end-to-end speech recognition language model

features acoustic model decoder the cat sat • end-to-end speech recognition language model

features Train them Trainable front-end Log-melfilterbanks • Approximate Log-melfilterbanks at initialization • Trained with the rest of the network • “Learning filterbanks from raw speech for phone recognition”, Zeghidour et al., ICASSP 2018 • “End-to-end speech recognition from the raw waveform”, Zeghidour et al., Interspeech 2018

acoustic model How it works Duration model: max |||the|caat|||ssaattt| |the|cat|sat| * let’s remember that | stands for silence | t a e | t c a | t h Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network

acoustic model Make it simple Gated convnet block features GLU 1D ConvNet Dropout Gated Linear Units (GLU): • Address vanishing gradient problem • Successful application to NLP problems • Gates: • ⨂ is element-wise product between matrices • “Language Modeling with Gated Convolutional Networks”, Dauphin et al., ICML, 2017 • “Letter-Based Speech Recognition with Gated ConvNets”, Liptchinsky et al., arXiv 2017

acoustic model Architecture and few tricks Gated convnet block Kernel width = 13 Channels 40 => 200 Dropout 0.2 Gated convnet block Kernel width = 14 Channels 200 => 220 Dropout 0.214 Gated convnet block Kernel width = 29 Channels 826 => 908 Dropout 0.59 features Linear layer 908 => 908 Dropout 0.59 Linear layer 908 => 30 • For each consecutive convolutional layer: increase kernel width, increase channels, increase dropout • Overall network receptive field of ~2.2 seconds, i.e. 2.2 seconds of audio correspond to one character • Motivation: more modeling capacity and regularization towards output layers • “Fully Convolutional Speech Recognition”, Zeghidour et al., arXiv, 2019

language model How it works • Language model • Statistical (n-gram) language models estimate probability distribution of a sequence of words, i.e. 3-gram language model generates • Feed-forward Neural Network models generate probability of a next word given a sequence of words, i.e. . • Character based language models: • Do acoustic models learn language modeling? • Anecdotally, acoustic models were observed to output instead of in noisy audio segments. • Thus, more regularization and more capacity at the output layers of acoustic models.

language model Architecture of feed-forward neural network language models • Word embeddings • Gated Linear Units to the rescue! • Hierarchical Softmax: output probabilities for all words in the dictionary • “Language Modeling with Gated Convolutional Networks”, Dauphin et al., ICML, 2017

C acoustic model A c c Training: ASG loss (criterion) A a a B a a a a B b b C • ASG stands for Auto Segmentation • Segmentation problem: say "cab" is the target (the letter vocabulary is ) • Over 4 frames, can be written caab, ccab, cabb, etc… • Unnormalized transition scores: b b b b c c c c Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network • “Wav2Letter: an End-to-End ConvNet-based Speech Recognition System”, Collobert et al., arXiv, 2016

acoustic model Training: CTC vs ASG • CTCstands for Connectionist Temporal Classification • Extensively used in Speech Recognition, Optical Character Recognition • Has a blank label ø • Handles letøter repetitions • Handles garbage frames CTC ASG VS • “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”, A. Graves et al., ICML, 2006 • “Letter-Based Speech Recognition with Gated ConvNets”, Liptchinsky et al., arXiv 2017

decoder h e Howit works t t t a t u r n s a d o word LM G | acoustic A n n x lexicon L • Beam search, constrained to fixed beam size • Bookkeeping of (A, L, G) positions • At each time step: • For each previous hypothesis (A, L, G, score) • Add new hypothesis constrained to L • If word is emitted, add score from G • Mergenew hypothesis leading to same (L, G) states • For details on the differentiable decoder please check the paper in the footnote. Prefix, L The cat|sa Words, G The cat|sat • “A Fully Differentiable Beam Search Decoder”, Collobert et al., arXiv, 2019

Toolkit

NCCL, MPI Collectives Com. Lib. Recipes WSJ, LibriSpeech... wav2letter++ Executables Train, Test, Decode NN Lib Autograd, Modules Serialization, Training Criteria CTC, ASG, Seq2seq • Why C++? • It’s fast • It’s fast • Type safety/static typing • It’s fast CuDNN, NNPACK Accelerator Package • wav2letter++ design • Why ArrayFire? • JIT compilation • Portability: supports CUDA, CPU, OpenCL ArrayFire Tensor Library • NCCL, MPI • GPU (NVIDIA) and CPU communication libs • CuDNN, NNPACK • GPU (NVIDIA) and CPU accelerator packages • “wav2letter++: The Fastest Open-source Speech Recognition System”, Pratap et al., ICASSP, 2019

def call(self, inputs, mask=None): pos = K.relu(inputs) if K.backend() == 'theano': neg = ( K.pattern_broadcast(self.alpha, self.param_broadcast) * (inputs - math_ops.abs(inputs)) * 0.5) else: neg = -self.alpha * K.relu(-inputs) return pos + neg then added • TensorFlow/Keras evaluated • PReLU implementation

Variable PReLU::forward(const Variable &input) { auto mask = input >= 0.0; return (input * mask) + (input * !mask * tileAs(m_parameters[0], input)); } • ArrayFire NOT evaluated there the JIT avoids intermediate copies • PReLU implementation with a JIT works on CPU and GPU

// Apply a Hamming window for a speech frames in parallel coefs = 0.54 - 0.46 * af::cos(2 * M_PI * af::iota(N, 1) / (N - 1)); af::array multiplyOp(const af::array& a, const af::array& b) { return a * b; } af::batchFunc(coefs, input, multiplyOp); • gfor • parallel (vectorized) loop • batchFunc • execute a function on a batch - in parallel • gfor and batchfunc batched over input

Flashlight Neural Network library • Flashlight • From the creators of Torch • Entirely written in C++ • JIT compilation • CPU and GPU backends • https://github.com/facebookresearch/flashlight

benchmarks

How it is computed: • Levenstein distance between transcription produced by ASR system and the reference, at the word level • Examples: • Word Error Rate (WER) • REF: the cat sat on the mat • HYP: the cat sat mat • WER: 33%, 2 deletions • REF: the cat sat on the mat • HYP: the bat sat on at the mat • WER: 33%, 1 substitution, 1 insertion

all results in Word Error Rate Fully Convolutional ASR Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert [1] DeepRecurrent Neural Networks for AcousticModelling, Chan and Lane [2] Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, Amodei et al. [3] Towardsbetterdecoding and language model integration in sequence to sequencemodels, Chorowski and Jaitly

ASR toolkits

log-scale • 8 GPUs nodes (Tesla V100), 100Gbps InfiniBand • CTC training; Kaldi: LF-MMI • benchmark: training epoch time 30M parameters 2 Convolutions 5 bi-LSTM 100M parameters 19 Convolutions

8 GPUsnodes (Tesla V100), 100Gbps InfiniBand • CTC training; Kaldi: LF-MMI • benchmark: training epoch time 100M parameters 19 Convolutions

does not support n-gram LM • Same pre-computed emissions for all frameworks • LibriSpeech dev-clean, 4-gram LM • benchmark: decoding

Fast & Open-source Speech Recognition System Overview

Fast & Open-source Speech Recognition System Overview

Presentation Transcript

Speech recognition, understanding and conversational interfaces

74.406 Natural Language Processing - Speech Processing -

An overview of the SPHINX Speech Recognition System

Use of Sound in Games

Technical Seminar presentation on Speech Recognition using DWT

Speech Recognition

Automatic Speech Recognition Introduction

What does speech “look” like

FLST: Speech Recognition

A Recognition Model for Speech Coding

Speech Recognition

An overview of the SPHINX Speech Recognition System

Speech Recognition

Bruce Lowerre ’ s HARPY speech recognition system

speech in, speech out

ISSUES IN SPEECH RECOGNITION Shraddha Sharma

Speech Recognition

MM9 Speech Communication

Chapter 7 Speech Recognition Framework

Speech Recognition

Real-Time Speech Recognition

Application of Speech Recognition, Synthesis, Dialog