Speech Recognition (Part 2)

Speech Recognition(Part 2) T. J. Hazen MIT Computer Science and Artificial Intelligence Laboratory

Lecture Overview • Probabilistic framework • Pronunciation modeling • Language modeling • Finite state transducers • Search • System demonstrations (time permitting)

Probabilistic Framework • Speech recognition is typically performed a using probabilistic modeling approach • Goal is to find the most likely string of words, W, given the acoustic observations, A: • The expression is rewritten using Bayes’ Rule:

Lexical Network Acoustic Model Pronunciation Model Language Model Probabilistic Framework • Words are represented as sequence of phonetic units. • Using phonetic units, U, expression expands to: • Pronunciation and language models provide constraint • Pronunciation and language models encoded in network • Search must efficiently find most likely U and W

Phonemes • Phonemes are the basic linguistic units used to construct morphemes, words and sentences. • Phonemes represent unique canonical acoustic sounds • When constructing words, changing a single phoneme changes the word. • Example phonemic mappings: • pin  /p ih n/ • thought  /th ao t/ • saves /s ey v z/ • English spelling is not (exactly) phonemic • Pronunciation can not always be determined from spelling • Homophones have same phonemes but different spellings • Twovs.tovs.too, bear vs. bare, queue vs. cue, etc. • Same spelling can have different pronunciations • read, record, either, etc.

Phonemic Units and Classes Vowels aa : poter : bert ae : batey : bait ah : but ih : bit ao: bought iy : beat aw : bout ow : boat ax : about oy : boy ay : buyuh : book eh : betuw : boot Semivowels l : light w : wet r : righty : yet Nasals m : might n : night ng : sing Stops p : payb : bay t : tead : day k : key g : go Fricatives s : sue f:fee z : zoo v:vee sh : shoe th:thesis zh : azure dh : that hh : hat Affricates ch : chew jh:Joe

Phones • Phones (or phonetic units) are used to represent the actual acoustic realization of phonemes. • Examples: • Stops contain a closure and a release • /t/ [tcl t] • /k/  [kcl k] • The /t/ and /d/ phonemes can be flapped • utter /ah t er/  [ah dx er] • udder /ah d er/  [ah dx er] • Vowels can be fronted: • Tuesday  /t uw z d ey/  [tcl t ux z d ey]

Enhanced Phoneme Labels Stops p : payb : bay t : tead : day k : key g : go Stops w/ optional release pd : tapbd : tab td : patdd : bad kd : packgd : dog Unaspirated stops p- : speed t- : steep k- : ski Stops w/ optional flap tf : batter df : badder Retroflexed stops tr : tree dr : drop Special sequences nt : interview tq en : Clinton

special noise model symbol repeat previous symbol special filled pause vowel alternate pronunciations optional phonemes Example Phonemic Baseform File <hangup> : _h1 + <noise> : _n1 + <uh> : ah_fp <um> : ah_fp m adder : ae df er atlanta : ( ae | ax ) td l ae nt ax either : ( iy | ay ) th er laptop : l ae pd t aa pd northwest : n ao r th w eh s td speech : s p- iy ch temperature : t eh m p ( r ? ax | er ax ? ) ch er trenton : tr r eh n tq en

Standard /t/ Flapped /t/ Applying Phonological Rules • Multiple phonetic realization of phonemes can be generated by applying phonological rules. • Example: • Phonological rewrite rules can be used to generate this: butter : b ah tf er This can be realized phonetically as: bcl b ah tcl t er or as: bcl b ah dx er butter : bcl b ah ( tcl t | dx ) er

Left Context Phoneme Right Context Phonetic Realization Example Phonological Rules • Example rule for /t/ deletion (“destination”): {s} t {ax ix} => [tcl t]; • Example rule for palatalization of /s/ (“miss you”): {} s {y} => s | sh;

Contractions and Reductions • Examples of contractions: • what’s  what is • isn’t  is not • won’t  will not • i’d  i would | i had • today’s  today is | today’s • Example of multi-word reductions • gimme  give me • gonna  going to • ave  avenue • ‘bout  about • d’y’ave  do you have • Contracted and reduced forms entered in lexical dictionary

forecast for boston me tell the in is weather baltimore what Language Modeling • A language model constrains hypothesized word sequences • A finite state grammar (FSG) example: • Probabilities can be added to arcs for additional constraint • FSGs work well when users stay within grammar… • …but FSGs can’t cover everything that might be spoken.

N-gram Language Modeling • An n-gram model is a statistical language model • Predicts current word based on previous n-1 words • Trigram model expression: • Examples • An n-gram model allows any sequence of words… • …but prefers sequences common in training data. P( wn | wn-2 , wn-1) P(|arriving in) boston P(|tuesday march ) seventeenth

N-gram Model Smoothing • For a bigram model, what if… • To avoid sparse training data problems, we can use an interpolated bigram: • One method for determining interpolation weight:

P(NTH|WEEKDAY MONTH ) P(seventeenth|NTH ) Class N-gram Language Modeling • Class n-gram models can also help sparse data problems • Class trigram expression: • Example: P(class(wn)|class(wn-2), class(wn-1)) P(wn| class(wn)) P(seventeenth|tuesday march)

Multi-Word N-gram Units • Common multi-word units can be treated as a single unit within an N-gram language model • Common uses of compound units: • Common multi-word phrases: • thank_you , good_bye , excuse_me • Multi word sequences that act as a single semantic unit: • new_york , labor_day , wind_speed • Letter sequences or initials: • j_f_k , t_w_a , washington_d_c

Finite-State Transducer (FST) Motivation • Most speech recognition constraints and results can be represented as finite-state automata: • Language models (e.g., n-grams and word networks) • Lexicons • Phonological rules • N-best lists • Word graphs • Recognition paths • Common representation and algorithms desirable • Consistency • Powerful algorithms can be employed throughout system • Flexibility to combine or factor in unforeseen ways

One initial state One or more final states Transitions between states: input : output / weight input requires an input symbol to match output indicates symbol to output when transition taken epsilon () consumes no input or produces no output weight is the cost (e.g., -log probability) of taking transition An FST defines a weighted relationship between regular languages A generalization of the classic finite-state acceptor (FSA) What is an FST?

Lexicon maps /phonemes/ to ‘words’ Words can share parts of pronunciations Sharing at beginning beneficial to recognition speed because pruning can prune many words at once FST Example: Lexicon

Composition (o) combines two FSTs to produce a single FST that performs both mappings in single step o = words  /phonemes/ /phonemes/  [phones] words  [phones] FST Composition

FST Optimization Example letter to word lexicon

Determinization turns lexicon into tree Words share common prefix FST Optimization Example: Determinization

Minimization enables sharing at the ends FST Optimization Example: Minimization

G : Language Model Multi-Word Units Language Model M : Multi-word Mapping Canonical Words R : Reductions Model Spoken Words L : Lexical Model Phonemic Units Pronunciation Model P: Phonological Model Phonetic Units C : CD Model Mapping Acoustic Model Labels A Cascaded FST Recognizer

G : Language Model Multi-Word Units give me new_york_city M : Multi-word Mapping Canonical Words give me new york city R : Reductions Model Spoken Words gimme new york city L : Lexical Model Phonemic Units g ih m iy n uw y ao r kd s ih tf iy P: Phonological Model Phonetic Units gcl g ih m iy n uw y ao r kcl s ih dx iy C : CD Model Mapping Acoustic Model Labels A Cascaded FST Recognizer

Lexical FST Acoustic Model Search • Once again, the probabilistic expression is: • Pronunciation and language models encoded in FST • Search must efficiently find most likely U and W

Viterbi search: a time synchronous breadth-first search a r z Lexical Nodes m h# t0 t3 t2 t5 t6 t7 t1 t4 t8 Time Viterbi Search m a r z h# h#

Search efficiency can be improved with pruning Score-based: Don’t extend low scoring hypotheses Count-based: Extend only a fixed number of hypotheses a r z Lexical Nodes x m h# t0 t3 t2 t5 t6 t7 t1 t4 t8 Time x Viterbi Search Pruning

Count-based pruning can effectively reduce search Example: Fix beam size (count) and vary beam width (score) Search Pruning Example 36

BackwardsA* search can be used to find N-best paths Viterbi backtrace is used as future estimate for path scores a r z Lexical Nodes m h# t0 t3 t2 t5 t6 t7 t1 t4 t8 Time N-best Computation with Backwards A* Search 35

Street address recognition is difficult 6.2M unique street, city, state pairs in US (283K unique words) High confusion rate among similar street names Very large search space for recognition Commercial solution  Directed dialogue Breaks problem into set of smaller recognition tasks Simple for first time users, but tedious with repeated use Street Address Recognition • C: Main menu. Please say one of the following: • C: “directions”, “restaurants”, “gas stations”, or “more options”. • H: Directions. • C: Okay. Directions. What state are you going to? • H: Massachusetts. • C: Okay. Massachusetts. What city are you going to? • H: Cambridge. • C: Okay. Cambridge. What is the street address? • H: 32 Vassar Street. • C: Okay. 32 Vassar Street in Cambridge, Massachusetts. • C: From you current location, continue straight on…

Research goal  Mixed initiative dialogue More difficult to predict what users will say Far more natural for repeat or expert users • Recognition approach: dynamically adapt recognition vocabulary • 3 recognition passes over one utterance. • 1st pass: Detect state and activate relevant cities • 2nd pass: Detect cities and activate relevant streets • 3rd pass: Recognize full street address Street Address Recognition • C: How can I help you? • H: I need directions to 32 Vassar Street in Cambridge, Mass.

Dynamic Vocabulary Recognition

Speech Recognition (Part 2)

Speech Recognition (Part 2)

Presentation Transcript

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech recognition

Combining Speech Attributes for Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

Speech Recognition

Automatic Speech Recognition - Edukite

Speech Recognition

LSA 352: Summer 2007. Speech Recognition and Synthesis

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition