CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Sp

CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 17 March 7 Search Strategies for Improved Performance, Part II

Next Topics: Improving Performance of an HMM • Search Strategies for Improved Performance • Null States • Beam Search • Grammar Search • Tree Search • Token Passing • “On-Line” Processing • Balancing Insertion/Deletion Errors • Detecting Out of Vocabulary Words • Stack Decoder (A*) Search • Word Lattice or Word Graph • Grammar Search, Part II • Weighted Finite State Transducer Overview • Acoustic-Model Strategies for Improved Performance • Semi-Continuous HMMs • Clustering • Cloning • Pause Models

Search Strategies: Balancing Insertion/Deletion Errors • As noted on Lecture 9 slide 11, phoneme durations tend to have a Gamma distribution, which even a three-state HMMdoes not model well. • As a result of this discrepancy between speech and HMMs,HMMs generate more insertion errors than deletion errors(because HMM is a priori more likely to generate several short phonemes than one long phoneme). • One issue in HMM implementation is how to improveperformance by balancing insertion and deletion errors. observed phoneme distribution prob. being in phn 3-state HMM time (frames)

Search Strategies: Balancing Insertion/Deletion Errors • Plot insertion errors and deletion errors using ROC curve: • Lowest total error (total_error = insertion_error + deletion_error) usually occurs when insertion_error = deletion_error. Thisis called the equal-error rate (EER). • Objective: using a priori information, make average insertionerror rate equal average deletion error rate. insertions 0% 100% 0% deletions 100%

Search Strategies: Balancing Insertion/Deletion Errors • Insertion Penalty: • Add a constant penalty to  whenever a word is exited • Large penalty yields more deletion errors,small penalty yields more insertion errors • Using a corpus of data not used in training or testing(a “development” or “cross-validation” set) that is similarto the data that will be seen in testing, find a penaltyvalue that yields EER. • Can’t use testing data because we’re computing a parameterof the overall HMM model • Can’t use training data because other model parameterswere trained on this data set, and so this set yields lowestpossible error, not realistic error.

Search Strategies: Detecting Out of Vocabulary Words • Garbage Modeling: • Add a new state to the HMM, “G”. • At each time frame, rank-order the observation probabilities • Set bG(ot) = Nth observation probability (N=1 is mostlikely observation) • Add new “word” to the vocabulary, called “garbage”: • add “garbage” to list of words; if grammar, “garbage”may occur in between any other words. • If N is optimally chosen, “garbage” word will be recognizedmost often there is an out-of-vocabulary word or noise in thesignal, and will be recognized least often otherwise. • Value for N depends on match between data and HMM. G “garbage” =

ae b z Search Strategies: Detecting Out of Vocabulary Words • Phone Modeling: • Add a new word to the HMM, “OOV”. This wordconsists of one or more phonemes in any sequence: • This word will, by definition, score as good or better thanany other word in the vocabulary. • Apply a “language model” weight or an “OOV penalty”to discourage transitioning into this word, allowing in-vocabulary words to be recognized. Find optimal value for this weight or penalty using development data. “OOV” = N N OOV penalty …

r ae p r m d iy l l n z y uw m n Search Strategies: Detecting Out of Vocabulary Words • Syllable Modeling: • Add a new word to the HMM, “OOV”. This wordconsists of one or more syllables in any sequence: • An OOV penalty is still required, but this OOV word will not recognize an impossible phoneme sequence (e.g. “z p iy k”) instead of the correct (but, due to chance or noise, lower-scoring) sequence (e.g. “s p iy k”). t s “OOV” = N N … OOV penalty k

Search Strategies: Stack Decoder (A*) Search • One method of finding N-best sentence-level outputs, instead of “the best” sentence output. • Useful in large-vocabulary ASR systems where recognitionis performed in (at least) two passes (real-time recognitionnot possible): • First pass: perform fast recognition over entirevocabulary using “simple” HMM, and generate multiple recognition hypotheses • Second pass: perform slow but more accurate recognitiononly on output from first pass, using more sophisticatedHMM • A* search is method from symbolic reasoning (AI) • Uses heuristics to find the N best sentences; if heuristicsare good then best result will be same as Viterbi result.

NULL (end=0, score= INF) Every (end=310,score=40) If (end=100,score=33) Alice (end=280,score=25) In (end=90,score=250) Search Strategies: Stack Decoder (A*) Search • Description and Example (from Jurafsky & Martin, 2000) • Recognize the phrase “If music be the food of love” (1) Start with NULL as root of sentence tree, set n to 0 (2) Determine every possible word starting at time t=1, adding to the stack the words (now a partial sentence), their end times, and scores (e.g. log probabilities, so closer to zero is better), with a link to the NULL partial sentence.

Search Strategies: Stack Decoder (A*) Search • (3) Pop partial sentence with highest score, P, off the stack(keeping the word, time, score, and link information for future use) • (4) If P is a complete sentence (end time of last word in partial sentence = T), then (a) output sentence by following linksto all previous words in sentence, (b) increment n. (c) If n == N, then stop; otherwise, go to (3) • Determine every possible word starting at time t=(end timeof last word in P), adding to the stack the new words, theirend times, and scores, with a link to the last word in P. • Go to step (3)

walls (end=370, score=321) If (end=100, score=33) Alice (end=280, score=25) Every (end=310, score=40) In (end=90, score=250) wants (end=360, score=41) was (end=310, score=35) NULL (end=0, score=INF) Search Strategies: Stack Decoder (A*) Search • and the next step: music (end=290, score=-34) muscle (end=350, score=-36) If (end=100, score=-33) messy (end=360, score=-40) Alice (end=280, score=-25) was (end=310, score=-35) NULL (end=0, score=INF) Every (end=310, score=-40) wants (end=360, score=-41) In (end=90, score=-250) walls (end=370, score=-321)

Search Strategies: Stack Decoder (A*) Search • A* search is “time-asynchronous” search • The score is an evaluation of how good a partial sentence is • Possible formula for score: • score = p(o1o2 … ot | w1w2 … wn) · p(w1w2 … wn) • where t is the end time of the partial sentence and n is thenumber of words currently in the partial sentence. • However, this results in lower scores for longer utterances.Also, the score should reflect how good we think the finalsentence will be when we get to the end. • Better formula for score: • score = (b + w) + e • b = value of (b + w) from prev. word in partial sentence • w = best score for word (given fixed begin time) • e = score for best path to end of sentence

Search Strategies: Stack Decoder (A*) Search • b is easy enough to compute… what about w and e? • One method is to do A* search backward in time afterfirst doing Viterbi search forward in time (this is called forward-backward search) • Then, “end” of sentence is beginning of utterance, so e (score for best path to end of sentence) is the  value at theend of the previous best word and w is the difference in scores between the  value at the end of current word and  value at last frame of previous best word and going to the current word.(See next slide) • Now we need to keep all  values in memory, notjust for time t–1 when doing (forward) Viterbi search. • A* is slow compared to standard Viterbi search: Must perform two passes, maintain sorted list, and try all word transitions. However, provides valuable information.

Search Strategies: Stack Decoder (A*) Search As in Lecture 13, we can define the score, Ŝ, for word w(n) fromframe s to frame e, where (a)w(n) is modeled by HMM w(n), (b)aij = j when t is first frame of data, and (c) the state at qs-1 is the best state transitioning into w(n)’s first state at time s: If we’re in log domain to avoid underflow errors, this equals Which is equal to where qe are the possible ending states of w(n) and is the delta value of the state leading into the best first state of w(n) at time s (determined from the backtrace ).

## -13.44 -15.01 -8.07 -12.46 -15.18 -15.51 -11.98 -12.70 -10.53 -0.66 -10.57 -9.67 -7.88 -7.88 -9.00 -2.63 -1000 -1000 -8.73 -8.97 -13.69 -1000 -8.01 -6.21 -3.57 -1000 -4.35 -3.33 -3.71 -5.19 -6.90 -4.39 -3.92 -9.82 -7.14 -7.86 -3.87 -5.26 -7.27 -10.87 -5.65 -7.48 -7.18 -5.79 -7.63 Search Strategies: Stack Decoder (A*) Search • Example with “yes” and “no” and 9 frames of output: ## = log() score (closer to zero is better), arrow shows  from “o” from “o” y: e: s: n: o: t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 T=9 Viterbi   A* When doing A*, at each word beginning, look not only at  (best transition into this word), but all possible transitions into this word!

from “o” from “o” y: e: -9.82 -12.70 -11.98 -10.57 -3.87 -15.01 -13.44 -12.46 -5.19 -6.21 -7.86 -13.69 -1000 -2.63 -8.01 -7.27 -1000 -0.66 -1000 -1000 -4.35 -3.33 -3.71 -6.90 -4.39 -7.14 -7.48 -3.92 -5.65 -7.18 -5.79 -7.63 -9.67 -9.00 -8.97 -7.88 -10.53 -8.07 -5.26 -8.73 -7.88 -3.57 -15.18 -15.51 -10.87 s: n: o: t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 T=9 Search Strategies: Stack Decoder (A*) Search yes 3:5 -2.28-4.15-3.33= -9.76 yes 6:9 0-2.28-5.79= -8.07 yes 1:3 -4.15+-6.9+0.0= -11.05 no 4:5 -2.28-1.87-3.92= -8.07 NULL no 8:9 0-3.77-9.67= -13.44 no 1:3 -4.15+-3.92+0.0= -8.07 1st best output: no (1-3), no (4-5), yes (6-9) score= -8.07

from “o” from “o” y: e: -10.57 -9.82 -12.70 -5.19 -12.46 -3.87 -15.01 -13.44 -11.98 -6.21 -7.86 -13.69 -1000 -2.63 -8.01 -7.27 -1000 -0.66 -1000 -1000 -4.35 -3.33 -3.71 -6.90 -3.92 -7.14 -7.18 -4.39 -5.65 -5.79 -7.63 -7.48 -8.97 -9.00 -9.67 -7.88 -10.53 -5.26 -8.07 -8.73 -7.88 -15.18 -15.51 -10.87 -3.57 s: n: o: t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 T=9 Search Strategies: Stack Decoder (A*) Search yes 1:2 -6.43+-INF+0.0= -INF yes 3:5 -2.28-4.15-3.33= -9.76 yes 6:9 0-2.28-5.79= -8.07 no 1:2 -6.43+-3.33+0.0= -9.76 NULL yes 1:3 -4.15+-6.9+0.0= -11.05 no 8:9 0-3.77-9.67= -13.44 no 4:5 -2.28-1.87-3.92= -8.07 2st best output: no (1-2), yes (3-5), yes (6-9) score= -9.76

from “o” from “o” y: e: -9.82 -12.70 -11.98 -10.57 -3.87 -15.01 -13.44 -12.46 -5.19 -6.21 -7.86 -13.69 -1000 -2.63 -8.01 -7.27 -1000 -0.66 -1000 -1000 -4.35 -3.33 -3.71 -6.90 -4.39 -7.14 -7.48 -3.92 -5.65 -7.18 -5.79 -7.63 -9.67 -9.00 -8.97 -7.88 -10.53 -8.07 -5.26 -8.73 -7.88 -3.57 -15.18 -15.51 -10.87 s: n: o: t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 T=9 Search Strategies: Stack Decoder (A*) Search yes 1:2 -6.43+-INF+0.0= -INF yes 3:5 -2.28-4.15-3.33= -9.76 yes 6:9 0-2.28-5.79= -8.07 NULL yes 1:3 -4.15+-6.9+0.0= -11.05 no 8:9 0-3.77-9.67= -13.44 no 4:5 -2.28-1.87-3.92= -8.07 3rd best output: yes (1-3), no (4-5), yes (6-9) score= -11.05

from “o” from “o” y: e: -9.82 -10.57 -12.70 -5.19 -12.46 -3.87 -15.01 -13.44 -11.98 -2.63 -6.21 -7.86 -13.69 -1000 -8.01 -7.27 -1000 -0.66 -1000 -1000 -4.35 -3.33 -3.71 -6.90 -4.39 -3.92 -5.65 -7.14 -9.67 -7.48 -7.63 -7.18 -5.79 -8.97 -7.88 -7.88 -10.53 -8.07 -8.73 -5.26 -9.00 -10.87 -15.18 -15.51 -3.57 s: n: o: t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 T=9 Search Strategies: Stack Decoder (A*) Search yes 1:2 -6.43+-INF+0.0= -INF yes 3:5 -2.28-4.15-3.33= -9.76 yes 6:9 0-2.28-5.79= -8.07 NULL no 8:9 0-3.77-9.67= -13.44 etc.

Search Strategies: Word Lattice or Word Graph The computation of an N-best list is time-consuming, and it’s often not necessary to rank-order the best word sequences, only to have some idea of what the likely words are. Then, further processing can narrow down this list using more information to a single word sequence. To address this problem, we can construct a word lattice or a word graph: (image from http://www.visnet-noe.org/pdf/M_Saraclar.pdf)

Search Strategies: Word Lattice or Word Graph • The procedure (as discussed in Ney and Aubert 1994 or Aubert and Ney, 1995)is as follows: • At each time t during the Viterbi search, consider all active (i.e. within the beam) word pairs (or single words) that end at time t. Retain only those word pairs or words that have sufficient probability. • For each word pair or word that survives the pruning, keep track of the time at which the word begins (second word, if a word pair), and the score for the (final) word. The score for a word is the  value at time t minus the  value at the word begin time. • When done searching, construct the word graph by outputting all words and their scores • This procedure is both extremely simple, has low computational complexity, and is effective.

Grammar-Based Search • Aside from the lexical tree with one-pass Viterbi search, the other common framework is a grammar-based Viterbi search when the task belongs to a restricted domain (e.g. finding lost luggage at an airline) or the second pass in a large-vocabulary system. • Three standards have been developed for grammar-based search: • 1. XML (developed by WWW Consortium or W3C, contributions by Bell Labs, Nuance, Comverse, ScanSoft, IBM, Hewlett-Packard, Cisco, Microsoft) • 2. ABNF (Augmented BNF, developed by W3C) • 3. SALT (developed by Microsoft, supported by Cisco Systems, Comverse, Intel, Philips, Scansoft ) • The ABNF form will be summarized here, because it’s the simplest to develop from the programmer’s point of view and mappable to/from XML. • Based on documentation “Speech Recognition Grammar Specification Version 1.0”, http://www.w3.org/TR/speech-grammar

Grammar-Based Search • ABNF is composed mostly of context-free grammar rules. • A rule definition begins with a rule name, is delimited by “=” between the rule name and rule expansion, and ends with a “;” • A rule name within a rule expansion may be local (prefixed with ‘$’) or external (specified via uniform resource identifier, or URI). • Rule expansion symbols: • Dollar sign ('$') and angle brackets ('<' and '>') when needed mark rule references (references to rule names) • Parentheses ('(' and ')') may enclose any component of a rule expansion • Vertical bar ('|') delimits alternatives • Forward slashes ('/' and '/') delimit any weights on alternatives • Angle brackets ('<' and '>') delimit any repeat operator • Square brackets ('[' and ']') delimit any optional expansion • Curly brackets ('{' and '}') delimit any tag • Exclamation point ('!') prefixes any language identifier

Grammar-Based Search • Rule Examples: • $date = ([$weekday] $month $day) | $weekday | the $day ; • $day = first | second | third | … | thirty_first ; • $time = $hour $minute [$ampm] ; • $favoriteFoods = /5.5/ ice cream | /3.2/ hot dogs | /0.2/ lima beans ; • $creditcard = ($digit) <16> ; • $UStelephone1 = ($digit)<7-10> ; • $UStelephone2 = ($digit <7>) $digit <0-3 /0.4/> ; • $international = ($digit)<9-> ; • $localRule = $<GlobalGrammarURI.gram#rule2> | • $<GlobalGrammarURI.gram#rule7> ; • $rule = $word {this is a tag; it does not affect word recognition} ; • $yes = yes | oui!fr | hai!ja ; a weight (not necessarily a probability) 40% probability of recurrence tags may be used to affect subsequent semantic processing after words are identified language specification determines expected pronunciation of word.

Grammar-Based Search • A set of grammar rules is contained in a grammar document • Example document declarations: • #ABNF 1.0 ISO-8859-1; • language en-US; • mode voice; • root $rootRule; • base <http://www.cslu.ogi.edu/recog/base_path>; • lexicon <C:\users\hosom\recog\generalpurpose.lexicon>; • lexicon <http://www.cslu.ogi.edu/recog/otherwords.lexicon>; • // this is a comment; • /** this is another comment; grammar rules go after this line **/ header identified by #ABNF … language keyword followed by avalid language identifier mode is either ‘voice’ or ‘dtmf’ name of top-level rule relative URIs are relative to this base location(s) of one or more lexicons (pronunciationdictionaries)

Grammar-Based Search • A rule in one document can reference one or more rules in other documents via URI specification. So, an entire grammar may be composed of numerous files (documents), one file for a specific task (e.g. time or date). • tokens (terminal symbols used in rules) may take a number of forms: • form example • single unquoted token yes • single non-alphabetic unquoted token 3 • single quoted token with white space “New Years Day” • single quoted token, no white space “yes” • three tokens separated by white space december thirty first

Weighted Finite State Transducers (WFST) • Another approach to the search issue is to transform the model from a Hidden Markov Model where observations are generated at each state, to a Weighted Finite State Transducer (WFST) model, where observations are generated at state transitions. • This simple change does not affect theoretical capability. Anything that can be done with WFST can be done with HMMs. However, the WFST framework enables an elegant way to combine word probabilities, alternate pronunciations, and context-dependent phonetic states into a single network. • In a transducer, an input symbol is mapped to an output symbol. In a WFST, these mappings can be assigned weights that affect how the input is mapped to the output (e.g. transition probabilities). (notes and figures from Mohri, Pereira, Riley, “Speech Recognition with Weighted Finite State Transducers”, 2008)

Weighted Finite State Transducers (WFST) • Example: • Several WFSTs can be “composed” to combine different levels of information. So, a grammar can be composed with a lexicon, composed with phoneme models, to produce a single state network. • In “determinization” and “minimization”, the network is restructured so that there is only one unique input label on a state, and null states are removed, then the network is compacted.

Weighted Finite State Transducers (WFST) Grammar: Lexicon:

Weighted Finite State Transducers (WFST) Composition: Determinization: Minimization:

Weighted Finite State Transducers (WFST) • Phonemes can be represented using context-dependent triphones and three states per phoneme using the same approach of composition, determinization, and minimization. • The WFST can be applied at an even lower level, e.g. specifying the weights of particular components of the Gaussian Mixture Model. • Durations can be modeled with a Semi-Markov Model (no self loops), or by adding self-loops to the states generated by the WFST. • Recognition can be done very quickly and efficiently by on-the-fly composition of a triphone-level WFST and acoustic models. This approach allows one to construct only the states that are needed when they are needed, instead of a full language model composed with a full lexicon composed with a full mapping from phonemes to triphones. (see “Lecture 7: Finite State Transducers, Language Models, and Speech Recognition Search” by Mark Hasegawa-Johnson (2005)).

Weighted Finite State Transducers (WFST) • Sample results on the North American Business News (NAB) task (40,000 word vocabulary) using one-pass recognition: • Acoustic model with 7,208 HMM states with up to 12 components per GMM • Context-independent phoneme to context-dependent triphone transducer with 1,525 states and 80,225 transitions • Lexicon with 40,000 words and 1.056 pronunciations per word. • Trigram LM with 3,926,010 transitions (representing all unigrams, 22% of bigrams, and 19% of trigrams) • With accuracy fixed at 83%, speed as follows:

Weighted Finite State Transducers (WFST) Comparing “standard” HMMs with WFST: fled 2.284 0.0 l eh f 0.0 0.0 bill read l 0.405 d 0.805 1.098 0.0 eh 0.0 ih 1.386 0.107 0.0 b 0.805 0.0 r read jill 0.107 NULL 0.405 iy l NULL jh 0.0 0.0 0.0 ih wrote 0.107 1.098 jim 0.287 t ow 2.237 m 0.0 where transition “probabilities” are in –log() domain, and do not needto sum to 1.0 because they include LM information

CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Sp