1 / 19

Decoding Techniques for Large Vocabulary Speech Recognition

This chapter discusses various decoding techniques for continuous speech recognition using Hidden Markov Models (HMM), including time-synchronous decoding, beam pruning, N-Best decoding, and best-first decoding. The ideal decoder should be efficient, accurate, scalable, and versatile. A hybrid approach combining time-synchronous and stack decoding is also explored.

milbourn
Télécharger la présentation

Decoding Techniques for Large Vocabulary Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 􀀀 The Use of Context in Large Vocabulary Speech RecognitionChapter 4 Decoding Julian James OdellMarch 1995 Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Presenter: Ting-Wei Hsu

  2. Introduction

  3. Command in HTK • Tree lexicon constructs tree network Unknowutterance Wordnetwork HMM HVite : can perform 1. forced alignment, (N-best)2. lattice rescoring and3. recognize direct audio input

  4. Ch4.Decoding • This chapter described several decoding techniques suitable for recognition of continuous speech using HMM. • It is concerned with the use of cross word context dependent acoustic and long span language models. • Ideal decoder • 4.2 Time-Synchronous decoding • 4.2.1 Token passing • 4.2.2 Beam pruning • 4.2.3 N-Best decoding • 4.2.4 Limitations • 4.2.5 Back-Off implementation • 4.3 Best First Decoding • 4.3.1 A* Decoding • 4.3.2 The stack decoder for speech recognition • 4.4 A Hybrid approach : stack decoder + time-synchronous

  5. Ch4.Decoding (cont.) 4.1 Requirements • Ideal decoder : It should find the most likely grammatical hypothesis for an unknow utterance • Acoustic model likelihood • Language model likelihood • Recognition is computationally tractable, it’s necessary to share computation between the common portions of different hypotheses.

  6. Ch4.Decoding (cont.) 4.1 Requirements (cont.) • The ideal decoder would have following characteristics • Efficiency: Ensure that the system does not lag behind the speaker. • Accuracy: Find the most likely grammatical sequence of words for each utterance. • Scalability (可擴放性): Increasing the recognition vocabulary can reduce the error rate if the extra words are recognized correctly. • Versatility(多樣性): Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency. • Ex: N-gram language + cross-word context dependent models

  7. Word level 4.2 Time-Synchronous decoding • Time-synchronous searches are essentially simple breadth first searches. • A simple isolated word recognizer • Four word vocabulary (AND BILL BIT BEN)composed of monophone models

  8. 4.2 Time-Synchronous decoding (cont.) • 4.2.1 Token passing • Token : A single movable structure holds the likelihood of each partial path, ,together with the traceback information. • Token passing : Using Vitervi algorithm to find the most likely state sequence • Ex :

  9. 4.2 Time-Synchronous decoding (cont.) • 4.2.1 Token passing • Adding language model Word level

  10. BIT BILL AND 4.2 Time-Synchronous decoding (cont.) • 4.2.2 Beam Pruning • Reducing the total computational requirements by reducing the size of the search space being actively considered. • Unlike traditional breadth-first search, beam search only expands nodes that are likely to succeed at each level • Local search

  11. 4.2 Time-Synchronous decoding (cont.) • 4.2.3 N-Best decoding • The token passing implementation of the Viterbi algorithm can be extended to perform N-best recognition by storing more than one token in each state. • N tokens in each state ,decoder will find the N most likely hypotheses, it’s called word pair approximation. Top 2 s1 s1 s2 s2 s3 s3

  12. 4.2 Time-Synchronous decoding (cont.) • 4.2.4 Limitations • The N-Best decoding method is only suited to medium vocabulary systems using bigram language models and word internal context dependent models. phone level

  13. 4.2 Time-Synchronous decoding (cont.) • 4.2.4 Limitations (cont.) • Extension : Trigram or longer span language model • The network includes multiple copied of each word to ensure that every transition between words has a unique two word history 4 => 42 Evaluation from Fig 4.2 word level

  14. 4.2 Time-Synchronous decoding (cont.) • 4.2.5 Back-Off Implementation Ex: 1. Ben BIT 2. Ben AND

  15. 4.3 Best First decoding • Starting by extending the most likely partial path and only later extend less likely hypotheses if extensions to the best path become unpromising. • The lookahead mothod gives best first decoders their main advantage. • Implemented with the A* search • Not time-synchronous • Like death first search • Admissibility : Guarantee to find an optimal solution

  16. 4.3 Best First decoding (cont.) • A* search • f(n) = g(n) + h(n) : evaluation function of node n • g(n) : cost from root node to node n, decoded partial path score • h*(n): exact score from node n to a specific leaf node • h(n): estimated score from node n to goal state, heuristic function • Admissibility h(n)>= h*(n) • Partial paths are stored on a stack which is sorted in likelihood order.

  17. 4.4 A Hybrid Approach • Time-synchronous + stack decoder

  18. Next Step • Linear Network or Tree Network • One-Pass Tree-Copy Search • Word Graph (lattice) • N-Best rescoring • A* search

  19. Conclusion • Decoding Unknowutterance Wordnetwork HMM HVite : can perform 1. forced alignment, (N-best)2. lattice rescoring and3. recognize direct audio input

More Related