1 / 46

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech. Frank Seide IEEE Transactions on Speech and Audio Processing 2005. Present by shih-hung 2005/09/29. Outline. Introduction Review of (M+1)-gram Viterbi Decoding with reentrant tree

delila
Télécharger la présentation

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005 Present by shih-hung 2005/09/29

  2. Outline • Introduction • Review of (M+1)-gram Viterbi Decoding with reentrant tree • Virtual Hypothesis Copies on word level • Virtual Hypothesis Copies on sub-word level • Virtual Hypothesis Copies for Long-Range acoustic Lookahead (optional) • Experimental Results • Conclusion Speech Lab NTNU 2005

  3. Introduction Speech Lab NTNU 2005

  4. Introduction • For decoding of LVCSR, the most widely used algorithm is a time-synchronous Viterbi decoder that uses a tree-organized pronunciation lexicon with word-condition tree copies. • The search space is organized as a reentrant network which is a composition of the state-level network (lexical tree) and the linguistic (M+1)-gram network. • i.e. a distinct instance (“copy”) of each HMM state in the lexical tree is needed for every linguistic state (M-word history). • Practically, this copying is done on demand in conjunction with beam pruning. Speech Lab NTNU 2005

  5. Introduction Speech Lab NTNU 2005

  6. Introduction Speech Lab NTNU 2005

  7. Introduction • One observes that hypotheses for the same word generated from different tree copies are often identical. • i.e. there is redundant computation • Can we exploit this redundancy and modify the algorithm such that word hypotheses are shared across multiple linguistic state? frank funny seide Speech Lab NTNU 2005

  8. Introduction • A successful approach to this is the two-pass algorithm by Ney and Aubert. It first generates a word-lattice using the “word-pair approximation”, and searches the best path through this lattice using the full range language model. • computation is reduced by sharing word hypotheses between two-word histories that end with the same word. • An alternative approach is start-time conditioned search, which uses non-reentrant tree copies conditioned on the start time of the tree. Here, word hypotheses are shared across all possible linguistic states during word-level recombination. ? Speech Lab NTNU 2005

  9. Introduction Speech Lab NTNU 2005

  10. Introduction Speech Lab NTNU 2005

  11. Introduction • In this paper, we propose a single-pass reentrant-network (M+1)-gram decoder that uses three novel approaches aiming at eliminating copies of the search-space that are redundant. • 1.State copies are conditioned on the phonetic history rather than the linguistic history. • Phone-history approximation (PHA) analog to the word-pair approximation (WPA). • 2.Path hypotheses at word boundaries are saved at every frame in a data structure similar to a word lattice. To apply the (M+1)- gram at a word end , the needed linguistic path-hypothesis copies are recovered on the fly, similarly to lattice rescoring. We call the recovered copies virtual hypothesis copies (VHC). Speech Lab NTNU 2005

  12. Introduction • 3.For further reduction of redundancy, also multiple instances of the same context-dependent phone occurring in the same phonetic history are dynamically replaced by a single instance. Incomplete path hypotheses at phoneme boundaries are temporarily saved as well in the lattice-like structure. To apply the tree lexicon, CD-phone instances associated with tree nodes are recovered on the fly (phone-level VHC). Speech Lab NTNU 2005

  13. Review of (M+1)-gram Viterbi decoding with a reentrant tree := probability of the best path up to time t that ends in state s of the lexical tree for history := time of the latest transition into the tree root on the best path up to time t that ends in state s of the lexical tree for the history (“back-pointer”) :=probability that the acoustic observation vectors o(1)…o(t) are generated by a word/state sequence that ends with the M words at time t. Speech Lab NTNU 2005

  14. Review of (M+1)-gram Viterbi decoding with a reentrant tree • The dynamic-programming equations for the word-history conditioned (M+1)-gram search are as follow: Within-word recombination (s>0) Speech Lab NTNU 2005

  15. Review of (M+1)-gram Viterbi decoding with a reentrant tree Word-boundary equation: Speech Lab NTNU 2005

  16. Virtual hypothesis copies on word level A. How it works B. Word hypothesis C. Word-Boundary assumption and Phonetic-History approximation D. Virtual hypothesis copies: redundancy of E. Choosing F. Collapsed hypothesis copies G. Word-boundary equations H. Collapsed (M+1)-gram search : Summary I. Beam pruning J. Language model lookahead Speech Lab NTNU 2005

  17. How it works • The optimal start time of a word depends on its history . The same word in different histories may have different optimal start times-this is the reason for copying. However, we observed that start times are often identical, in particular if their histories are acoustically similar. For two linguistic histories and we obtain the same optimal start time . then we have computed too much. Speech Lab NTNU 2005

  18. How it works • It would only have been necessary to perform the state-level Viterbi recursion for one of the two histories. This is because: Speech Lab NTNU 2005

  19. How it works • We are now ready to introduce our method of virtual hypothesis copying (word-level). The method consist of • 1.predicting the sets of histories for which the optimal start times are going to be identical-this information is needed already when a path enters a new word; • 2.performing state-level Viterbi processing only for one copy per set. • 3.for all other copies, recovering their accumulated path probabilities. Thus, on state-level, all but one copy per set are neither stored nor computed -we call them “virtual”. Speech Lab NTNU 2005

  20. How it works • The art is to reliably predict these sets of histories that will lead to identical optimal start times. An exact prediction is impossible. • We propose a heuristic, the phone-history approximation (PHA). • The PHA assumes that a word’s optimal boundary depends only on the last N phones of the history. Speech Lab NTNU 2005

  21. How it works Regular bigram search Virtual hypothesis copies Speech Lab NTNU 2005

  22. Word hypotheses p(O|w) Speech Lab NTNU 2005

  23. Word-Boundary assumption and Phonetic-History approximation Speech Lab NTNU 2005

  24. Word-Boundary assumption and Phonetic-History approximation • Intuitively, the optimal word boundaries should not depend on the linguistic state, but rather on the phonetic context at the boundary. • And words ending similarly should lead to the same boundary. • Thus, we propose a phonetically motivated history-class definition, the phone-history approximation (PHA): • A word’s optimal start time depends on the word and its N-phone history. Speech Lab NTNU 2005

  25. Virtual hypothesis copies: redundancy of Speech Lab NTNU 2005

  26. Virtual hypothesis copies: redundancy of Speech Lab NTNU 2005

  27. Choosing Speech Lab NTNU 2005

  28. Collapsed hypothesis copies • The most probable hypothesis is only know when the end of he word is reached - too late to reduce computation. Speech Lab NTNU 2005

  29. Collapsed hypothesis copies Speech Lab NTNU 2005

  30. Word-boundary equations Speech Lab NTNU 2005

  31. Collapsed (M+1)-gram search : Summary Speech Lab NTNU 2005

  32. Language model lookahead • M-gram lookahead aims at using language knowledge as early as possible in the lexical tree by pushing partial M-gram scores toward the tree root. Speech Lab NTNU 2005

  33. Virtual hypothesis copies on the sub-word level • In the word-level method, the state-level search can be interpreted as a “word-lattice generator” with (M+1)-gram “lattice rescoring” applied on the fly; and search-space reduction was achieved by sharing tree copies amongst multiple histories. • We now want to apply the same idea to the subword level: the state-level search now becomes sort of a “subword generator,” subword hypotheses are incrementally matched against the lexical tree (frame-synchronously) and (M+1)-gram lattice rescoring applied as before. Speech Lab NTNU 2005

  34. Virtual hypothesis copies on the sub-word level Speech Lab NTNU 2005

  35. Virtual hypothesis copies on the sub-word level Speech Lab NTNU 2005

  36. Virtual hypothesis copies on the sub-word level Speech Lab NTNU 2005

  37. Experimental setup • Philips LVCSR is based on continuous-mixture HMM. • MFCC feature. • Unigram lookahead. • Corpora for Mandarin: • MAT-2000, PCD, National Hi-Tech Project 863 • Corpora for English: • Trained on WSJ0+1 • Test on 1994 ARPA NAB Speech Lab NTNU 2005

  38. Experimental result Speech Lab NTNU 2005

  39. Experimental result Speech Lab NTNU 2005

  40. Experimental result Speech Lab NTNU 2005

  41. Experimental result Speech Lab NTNU 2005

  42. Experimental result Speech Lab NTNU 2005

  43. Experimental result Speech Lab NTNU 2005

  44. Experimental result Speech Lab NTNU 2005

  45. Experimental result Speech Lab NTNU 2005

  46. Conclusion • We have present a novel time synchronous LVCSR Viterbi decoder for Mandarin based on the novel concept of virtual hypothesis copies (VHC). • At no loss of accuracy, a reduction of active states of 60-80% has been achieved for Chinese, and of 40-50% for American English. Speech Lab NTNU 2005

More Related