1 / 30

Decoding Techniques for Automatic Speech Recognition

Decoding Techniques for Automatic Speech Recognition. Florian Metze Interactive Systems Laboratories. Outline. Decoding in ASR Search Problem Evaluation Problem Viterbi Algorithm Tree Search Re-Entry Recombination. The ASR problem: arg W max p(W| x ). Two major knowledge sources

pearly
Télécharger la présentation

Decoding Techniques for Automatic Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decoding Techniques for Automatic Speech Recognition Florian Metze Interactive Systems Laboratories

  2. Outline • Decoding in ASR • Search Problem • Evaluation Problem • Viterbi Algorithm • Tree Search • Re-Entry • Recombination ESSLLI 2002, Trento

  3. The ASR problem: argW max p(W|x) • Two major knowledge sources • Acoustic Model: p(x|W) • Language Model: P(W) • Bayes: p(W|x)P(x)=p(x|W)P(W) • Search problem: argW max p(x|W)P(W) • p(x|W) consists of Hidden Markov Models: • Dictionary defines state sequence: „hello“ = /hh eh l ow/ • Full model: concatenation of states (i.e. sounds) ESSLLI 2002, Trento

  4. Target Function/ Measure • %WER = minimum editing distance between reference and hypothesis • Example: • the quick brown fox jumps * over REF • * quick brown fox jump is over HYP • D S I ERR • WER = 3/7 = 43% • Different measure from max p(W|x)!!! ESSLLI 2002, Trento

  5. A simpler problem: Evaluation • So far we have: • Dictionary: “hello” = /hh eh l ow/ … • Acoustic Model: phh(x), peh(x), pl(x), pow(x) … • Language Model: P(“hello world”) • State sequence: /hh eh l ow w er l d/ • Given W and x:Alignment needed! / hh eh l ow / ESSLLI 2002, Trento

  6. A simpler problem: Evaluation • So far we have: • Dictionary: “hello” = /hh eh l ow/ … • Acoustic Model: phh(x), peh(x), pl(x), pow(x) … • Language Model: P(“hello world”) • State sequence: /hh eh l ow w er l d/ • Given W and x:Alignment needed! / hh eh l ow / ESSLLI 2002, Trento

  7. The Viterbi Algorithm • Beam search from left to right • Resulting alignment is best match given p?(x) and x hh eh l ow ESSLLI 2002, Trento

  8. The Viterbi Algorithm (cont‘d) • Evaluation problem: ~ Dynamic Time Warping • Best alignment for given W, x, and p?(x) by locally adding scores (=-log p) for states and transitions hh eh l ow ESSLLI 2002, Trento

  9. Pronunciation Prefix Trees (PPT) • Tree Representation of the Search Dictionary • Very compact  fast! • Viterbi Algorithm alsoworks for trees BROADWAY:B R OA D W EY BROADLY:B R OA D L IE BUT: B AH T ESSLLI 2002, Trento

  10. Viterbi Search for PPTs • A PPT is traversed in a time-synchronous way • Apply Viterbi Algorithm on • state level (sub-phonemic units: –b –m –e)  Constrained by HMM Topology • phone level • Constrained by PPT • What do we do when we reach the end of a word? ESSLLI 2002, Trento

  11. Re-Entrant PPTs for continuous speech • Isolated word recognition: • Search terminated in leafs of the PPT • Decoding of word sequences: • Re-enter the PPT and store the Viterbi path using a backpointer-table ESSLLI 2002, Trento

  12. hi I am Candy hello I am Problem: Branching Factor • Imagine sequence of 3 words with 10k vocabulary • 10k ^ 3 = 1000G (potentially) • Not everything will be expanded, of course • Viterbi approximation  path recombination: • Given P(Candy | „hi I am“) = P(Candy | „hello I am“) ESSLLI 2002, Trento

  13. Path Recombination At time t : Path1 = w1 .. wN with score s1 Path2 = v1 .. vM with score s2 Where: s1 = p(x1...xt | w1...wN )*P(wi| wi-1 wi-2) s2 = p(x1...xt | v1 ...vM )*P(vi | vi-1 vi-2) In the end, we‘re only interested in the best path! ESSLLI 2002, Trento

  14. Path Recombination (cont‘d) • To expand the search space into a new root: • Pick the path with the best score so far (Viterbi approximation) • Initialize scores and backpointers for the root node according to the best predecessor word • store the leftcontext model information with the last phone from the predecessor(context-dependent acoustic models: /s ih t/  /l ih p/) ESSLLI 2002, Trento

  15. Problem with Re-Entry: • For a correct use of the Viterbi algorithm, the choice of the best path must include the score for the transition from the predecessor word to the successor word • The word identity is not known at the root level, the choice of the best predecessor can therefore not be done at this point ESSLLI 2002, Trento

  16. Consequences • Wrong predecessor words  language model information only at leaf level • Wrong word boundaries • The starting point for the successor word is determined without any language model information • Incomplete linguistic information • Open pruning thresholds are needed for beam search ESSLLI 2002, Trento

  17. Three-Pass search strategy • Search on a tree-organized lexicon (PPT) • Aggressive path recombination at word ends • Use linguistic information only approximately • Generate a list of starting words for each frame • Search on a flat-organized lexicon • Fix the word segmentation from the first pass • Full use of language model (often needs a third pass) ESSLLI 2002, Trento

  18. Three-Pass Decoder: Results • Q4g system with cache for acoustic scores: • 4000 acoustic models trained on BN+ESST • 40k Vocabulary • Test on “readBN” data ESSLLI 2002, Trento

  19. One-Pass Decoder: Motivation • The efficient use of all available knowledge sources as early as possible should result in faster decoding • Use the same engine to decode along: • Statistical n-gram language models with arbitrary n • Context-free grammars (CFG) • Word-graphs ESSLLI 2002, Trento

  20. Linguistic states • Linguistic state, examples: • n-1 word history for statistical n-gram LM • Grammar state for CFGs • (lattice node,word history) for word-graphs • To fully use the linguistic knowledge source, the linguistic state has to be kept during decoding • Path recombination has to be delayed until the word identity is known ESSLLI 2002, Trento

  21. Linguistic context assignment • Key idea: establish a linguistic polymorphism for each node of the PPT • Maintain a list of linguistically morphed instances in each node • Each instance stores its own backpointer and scores for each state of the underlying HMM with respect to the linguistic state of that instance ESSLLI 2002, Trento

  22. PPT with linguistically morphed instances W EY R OA D B L IE AH T Typically: 3-gram LM, i.e. P(W) = iP(wi|Wi) P(wi|Wi) = P(broadway| „bullets over“) ESSLLI 2002, Trento

  23. Language Model Lookahead • Since the linguistic state is known, the complete LM information P(W)can be applied to the instances, given the possible successor words for that node of the PPT • Let lct = linguistic context/ state of instance i from node n path(w) = path of word w in the PPT (n,lct) = min w  {w | node n  path(w)} P(w|lct) score(i) = p(x1...xt | w1...wN)* P(wN-1|...) * (n,lct) ESSLLI 2002, Trento

  24. LM Lookahead (cont‘d) • When the word becomes unique, the exact lm score is already incorporated and no explicit word transitions needs to be computed • The lm scores  will be updated on demand, based on a compressed PPT („smearing“ of LM scores) • Tighter pruning thresholds can be used since the language model information is not delayed anymore ESSLLI 2002, Trento

  25. Early Path Recombination • The Path recombination can be performed as soon as the word becomes unique, which is usually a few nodes before reaching the leaf. This reduces the number of unique linguistic contexts and instances • This is particularly effective for cross-word models due the fan-out in the right context models ESSLLI 2002, Trento

  26. One-pass Decoder: Summary • One-Pass decoder based on • One copy of tree with dynamically allocated instances • Early path recombination • Full language model lookahead • Linguistic knowledge sources • Statistical n-grams with n >3 possible • Context free grammars ESSLLI 2002, Trento

  27. Results ESSLLI 2002, Trento

  28. Remarks on speed-up  Speed-up ranges from a factor of almost 3 for the readBN task to 1.4 for the meeting data • Speed-up depends strongly on matched domain conditions • Decoder profits from sharp language models • LM Lookahead less effective for weak language models due to unmatched conditions ESSLLI 2002, Trento

  29. Memory usage : Q4g ESSLLI 2002, Trento

  30. Summary • Decoding is time- and memory consuming • Search errors occur when beams too tight (trade-off) or Viterbi assumption violated • State-of-the art: One-pass decoder • Tree-structure for efficiency • Linguistically morphed instances of nodes and leafs • Other approaches exist (stack decoding, a-posteriori decoding, ...) ESSLLI 2002, Trento

More Related