Decoding Techniques for Large Vocabulary Speech Recognition

􀀀 The Use of Context in Large Vocabulary Speech RecognitionChapter 4 Decoding Julian James OdellMarch 1995 Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Presenter: Ting-Wei Hsu

Introduction

Command in HTK • Tree lexicon constructs tree network Unknowutterance Wordnetwork HMM HVite : can perform 1. forced alignment, (N-best)2. lattice rescoring and3. recognize direct audio input

Ch4.Decoding • This chapter described several decoding techniques suitable for recognition of continuous speech using HMM. • It is concerned with the use of cross word context dependent acoustic and long span language models. • Ideal decoder • 4.2 Time-Synchronous decoding • 4.2.1 Token passing • 4.2.2 Beam pruning • 4.2.3 N-Best decoding • 4.2.4 Limitations • 4.2.5 Back-Off implementation • 4.3 Best First Decoding • 4.3.1 A* Decoding • 4.3.2 The stack decoder for speech recognition • 4.4 A Hybrid approach : stack decoder + time-synchronous

Ch4.Decoding (cont.) 4.1 Requirements • Ideal decoder : It should find the most likely grammatical hypothesis for an unknow utterance • Acoustic model likelihood • Language model likelihood • Recognition is computationally tractable, it’s necessary to share computation between the common portions of different hypotheses.

Ch4.Decoding (cont.) 4.1 Requirements (cont.) • The ideal decoder would have following characteristics • Efficiency: Ensure that the system does not lag behind the speaker. • Accuracy: Find the most likely grammatical sequence of words for each utterance. • Scalability (可擴放性): Increasing the recognition vocabulary can reduce the error rate if the extra words are recognized correctly. • Versatility(多樣性): Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency. • Ex: N-gram language + cross-word context dependent models

Word level 4.2 Time-Synchronous decoding • Time-synchronous searches are essentially simple breadth first searches. • A simple isolated word recognizer • Four word vocabulary (AND BILL BIT BEN)composed of monophone models

4.2 Time-Synchronous decoding (cont.) • 4.2.1 Token passing • Token : A single movable structure holds the likelihood of each partial path, ,together with the traceback information. • Token passing : Using Vitervi algorithm to find the most likely state sequence • Ex :

4.2 Time-Synchronous decoding (cont.) • 4.2.1 Token passing • Adding language model Word level

BIT BILL AND 4.2 Time-Synchronous decoding (cont.) • 4.2.2 Beam Pruning • Reducing the total computational requirements by reducing the size of the search space being actively considered. • Unlike traditional breadth-first search, beam search only expands nodes that are likely to succeed at each level • Local search

4.2 Time-Synchronous decoding (cont.) • 4.2.3 N-Best decoding • The token passing implementation of the Viterbi algorithm can be extended to perform N-best recognition by storing more than one token in each state. • N tokens in each state ,decoder will find the N most likely hypotheses, it’s called word pair approximation. Top 2 s1 s1 s2 s2 s3 s3

4.2 Time-Synchronous decoding (cont.) • 4.2.4 Limitations • The N-Best decoding method is only suited to medium vocabulary systems using bigram language models and word internal context dependent models. phone level

4.2 Time-Synchronous decoding (cont.) • 4.2.4 Limitations (cont.) • Extension : Trigram or longer span language model • The network includes multiple copied of each word to ensure that every transition between words has a unique two word history 4 => 42 Evaluation from Fig 4.2 word level

4.2 Time-Synchronous decoding (cont.) • 4.2.5 Back-Off Implementation Ex: 1. Ben BIT 2. Ben AND

4.3 Best First decoding • Starting by extending the most likely partial path and only later extend less likely hypotheses if extensions to the best path become unpromising. • The lookahead mothod gives best first decoders their main advantage. • Implemented with the A* search • Not time-synchronous • Like death first search • Admissibility : Guarantee to find an optimal solution

4.3 Best First decoding (cont.) • A* search • f(n) = g(n) + h(n) : evaluation function of node n • g(n) : cost from root node to node n, decoded partial path score • h*(n): exact score from node n to a specific leaf node • h(n): estimated score from node n to goal state, heuristic function • Admissibility h(n)>= h*(n) • Partial paths are stored on a stack which is sorted in likelihood order.

4.4 A Hybrid Approach • Time-synchronous + stack decoder

Next Step • Linear Network or Tree Network • One-Pass Tree-Copy Search • Word Graph (lattice) • N-Best rescoring • A* search

Conclusion • Decoding Unknowutterance Wordnetwork HMM HVite : can perform 1. forced alignment, (N-best)2. lattice rescoring and3. recognize direct audio input

Decoding Techniques for Large Vocabulary Speech Recognition