1 / 34

Speech Recognition (Part 2)

Speech Recognition (Part 2). T. J. Hazen MIT Computer Science and Artificial Intelligence Laboratory. Lecture Overview. Probabilistic framework Pronunciation modeling Language modeling Finite state transducers Search System demonstrations (time permitting). Probabilistic Framework.

meir
Télécharger la présentation

Speech Recognition (Part 2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Recognition(Part 2) T. J. Hazen MIT Computer Science and Artificial Intelligence Laboratory

  2. Lecture Overview • Probabilistic framework • Pronunciation modeling • Language modeling • Finite state transducers • Search • System demonstrations (time permitting)

  3. Probabilistic Framework • Speech recognition is typically performed a using probabilistic modeling approach • Goal is to find the most likely string of words, W, given the acoustic observations, A: • The expression is rewritten using Bayes’ Rule:

  4. Lexical Network Acoustic Model Pronunciation Model Language Model Probabilistic Framework • Words are represented as sequence of phonetic units. • Using phonetic units, U, expression expands to: • Pronunciation and language models provide constraint • Pronunciation and language models encoded in network • Search must efficiently find most likely U and W

  5. Phonemes • Phonemes are the basic linguistic units used to construct morphemes, words and sentences. • Phonemes represent unique canonical acoustic sounds • When constructing words, changing a single phoneme changes the word. • Example phonemic mappings: • pin  /p ih n/ • thought  /th ao t/ • saves /s ey v z/ • English spelling is not (exactly) phonemic • Pronunciation can not always be determined from spelling • Homophones have same phonemes but different spellings • Twovs.tovs.too, bear vs. bare, queue vs. cue, etc. • Same spelling can have different pronunciations • read, record, either, etc.

  6. Phonemic Units and Classes Vowels aa : poter : bert ae : batey : bait ah : but ih : bit ao: bought iy : beat aw : bout ow : boat ax : about oy : boy ay : buyuh : book eh : betuw : boot Semivowels l : light w : wet r : righty : yet Nasals m : might n : night ng : sing Stops p : payb : bay t : tead : day k : key g : go Fricatives s : sue f:fee z : zoo v:vee sh : shoe th:thesis zh : azure dh : that hh : hat Affricates ch : chew jh:Joe

  7. Phones • Phones (or phonetic units) are used to represent the actual acoustic realization of phonemes. • Examples: • Stops contain a closure and a release • /t/ [tcl t] • /k/  [kcl k] • The /t/ and /d/ phonemes can be flapped • utter /ah t er/  [ah dx er] • udder /ah d er/  [ah dx er] • Vowels can be fronted: • Tuesday  /t uw z d ey/  [tcl t ux z d ey]

  8. Enhanced Phoneme Labels Stops p : payb : bay t : tead : day k : key g : go Stops w/ optional release pd : tapbd : tab td : patdd : bad kd : packgd : dog Unaspirated stops p- : speed t- : steep k- : ski Stops w/ optional flap tf : batter df : badder Retroflexed stops tr : tree dr : drop Special sequences nt : interview tq en : Clinton

  9. special noise model symbol repeat previous symbol special filled pause vowel alternate pronunciations optional phonemes Example Phonemic Baseform File <hangup> : _h1 + <noise> : _n1 + <uh> : ah_fp <um> : ah_fp m adder : ae df er atlanta : ( ae | ax ) td l ae nt ax either : ( iy | ay ) th er laptop : l ae pd t aa pd northwest : n ao r th w eh s td speech : s p- iy ch temperature : t eh m p ( r ? ax | er ax ? ) ch er trenton : tr r eh n tq en

  10. Standard /t/ Flapped /t/ Applying Phonological Rules • Multiple phonetic realization of phonemes can be generated by applying phonological rules. • Example: • Phonological rewrite rules can be used to generate this: butter : b ah tf er This can be realized phonetically as: bcl b ah tcl t er or as: bcl b ah dx er butter : bcl b ah ( tcl t | dx ) er

  11. Left Context Phoneme Right Context Phonetic Realization Example Phonological Rules • Example rule for /t/ deletion (“destination”): {s} t {ax ix} => [tcl t]; • Example rule for palatalization of /s/ (“miss you”): {} s {y} => s | sh;

  12. Contractions and Reductions • Examples of contractions: • what’s  what is • isn’t  is not • won’t  will not • i’d  i would | i had • today’s  today is | today’s • Example of multi-word reductions • gimme  give me • gonna  going to • ave  avenue • ‘bout  about • d’y’ave  do you have • Contracted and reduced forms entered in lexical dictionary

  13. forecast for boston me tell the in is weather baltimore what Language Modeling • A language model constrains hypothesized word sequences • A finite state grammar (FSG) example: • Probabilities can be added to arcs for additional constraint • FSGs work well when users stay within grammar… • …but FSGs can’t cover everything that might be spoken.

  14. N-gram Language Modeling • An n-gram model is a statistical language model • Predicts current word based on previous n-1 words • Trigram model expression: • Examples • An n-gram model allows any sequence of words… • …but prefers sequences common in training data. P( wn | wn-2 , wn-1) P(|arriving in) boston P(|tuesday march ) seventeenth

  15. N-gram Model Smoothing • For a bigram model, what if… • To avoid sparse training data problems, we can use an interpolated bigram: • One method for determining interpolation weight:

  16. P(NTH|WEEKDAY MONTH ) P(seventeenth|NTH ) Class N-gram Language Modeling • Class n-gram models can also help sparse data problems • Class trigram expression: • Example: P(class(wn)|class(wn-2), class(wn-1)) P(wn| class(wn)) P(seventeenth|tuesday march)

  17. Multi-Word N-gram Units • Common multi-word units can be treated as a single unit within an N-gram language model • Common uses of compound units: • Common multi-word phrases: • thank_you , good_bye , excuse_me • Multi word sequences that act as a single semantic unit: • new_york , labor_day , wind_speed • Letter sequences or initials: • j_f_k , t_w_a , washington_d_c

  18. Finite-State Transducer (FST) Motivation • Most speech recognition constraints and results can be represented as finite-state automata: • Language models (e.g., n-grams and word networks) • Lexicons • Phonological rules • N-best lists • Word graphs • Recognition paths • Common representation and algorithms desirable • Consistency • Powerful algorithms can be employed throughout system • Flexibility to combine or factor in unforeseen ways

  19. One initial state One or more final states Transitions between states: input : output / weight input requires an input symbol to match output indicates symbol to output when transition taken epsilon () consumes no input or produces no output weight is the cost (e.g., -log probability) of taking transition An FST defines a weighted relationship between regular languages A generalization of the classic finite-state acceptor (FSA) What is an FST?

  20. Lexicon maps /phonemes/ to ‘words’ Words can share parts of pronunciations Sharing at beginning beneficial to recognition speed because pruning can prune many words at once FST Example: Lexicon

  21. Composition (o) combines two FSTs to produce a single FST that performs both mappings in single step o = words  /phonemes/ /phonemes/  [phones] words  [phones] FST Composition

  22. FST Optimization Example letter to word lexicon

  23. Determinization turns lexicon into tree Words share common prefix FST Optimization Example: Determinization

  24. Minimization enables sharing at the ends FST Optimization Example: Minimization

  25. G : Language Model Multi-Word Units Language Model M : Multi-word Mapping Canonical Words R : Reductions Model Spoken Words L : Lexical Model Phonemic Units Pronunciation Model P: Phonological Model Phonetic Units C : CD Model Mapping Acoustic Model Labels A Cascaded FST Recognizer

  26. G : Language Model Multi-Word Units give me new_york_city M : Multi-word Mapping Canonical Words give me new york city R : Reductions Model Spoken Words gimme new york city L : Lexical Model Phonemic Units g ih m iy n uw y ao r kd s ih tf iy P: Phonological Model Phonetic Units gcl g ih m iy n uw y ao r kcl s ih dx iy C : CD Model Mapping Acoustic Model Labels A Cascaded FST Recognizer

  27. Lexical FST Acoustic Model Search • Once again, the probabilistic expression is: • Pronunciation and language models encoded in FST • Search must efficiently find most likely U and W

  28. Viterbi search: a time synchronous breadth-first search a r z Lexical Nodes m h# t0 t3 t2 t5 t6 t7 t1 t4 t8 Time Viterbi Search m a r z h# h#

  29. Search efficiency can be improved with pruning Score-based: Don’t extend low scoring hypotheses Count-based: Extend only a fixed number of hypotheses a r z Lexical Nodes x m h# t0 t3 t2 t5 t6 t7 t1 t4 t8 Time x Viterbi Search Pruning

  30. Count-based pruning can effectively reduce search Example: Fix beam size (count) and vary beam width (score) Search Pruning Example 36

  31. BackwardsA* search can be used to find N-best paths Viterbi backtrace is used as future estimate for path scores a r z Lexical Nodes m h# t0 t3 t2 t5 t6 t7 t1 t4 t8 Time N-best Computation with Backwards A* Search 35

  32. Street address recognition is difficult 6.2M unique street, city, state pairs in US (283K unique words) High confusion rate among similar street names Very large search space for recognition Commercial solution  Directed dialogue Breaks problem into set of smaller recognition tasks Simple for first time users, but tedious with repeated use Street Address Recognition • C: Main menu. Please say one of the following: • C: “directions”, “restaurants”, “gas stations”, or “more options”. • H: Directions. • C: Okay. Directions. What state are you going to? • H: Massachusetts. • C: Okay. Massachusetts. What city are you going to? • H: Cambridge. • C: Okay. Cambridge. What is the street address? • H: 32 Vassar Street. • C: Okay. 32 Vassar Street in Cambridge, Massachusetts. • C: From you current location, continue straight on…

  33. Research goal  Mixed initiative dialogue More difficult to predict what users will say Far more natural for repeat or expert users • Recognition approach: dynamically adapt recognition vocabulary • 3 recognition passes over one utterance. • 1st pass: Detect state and activate relevant cities • 2nd pass: Detect cities and activate relevant streets • 3rd pass: Recognize full street address Street Address Recognition • C: How can I help you? • H: I need directions to 32 Vassar Street in Cambridge, Mass.

  34. Dynamic Vocabulary Recognition

More Related