1 / 107

Speech Recognition and Understanding

Speech Recognition and Understanding. Alex Acero Microsoft Research. Thanks to Mazin Rahim (AT&T). “A Vision into the 21 st Century”. Milestones in Speech Recognition . Small Vocabulary, Acoustic Phonetics-based. Large Vocabulary; Syntax, Semantics, .

kalli
Télécharger la présentation

Speech Recognition and Understanding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Recognition and Understanding Alex Acero Microsoft Research Thanks to Mazin Rahim (AT&T)

  2. “A Vision into the 21st Century”

  3. Milestones in Speech Recognition Small Vocabulary, Acoustic Phonetics-based Large Vocabulary; Syntax, Semantics, Very Large Vocabulary; Semantics, Multimodal Dialog Medium Vocabulary, Template-based Large Vocabulary, Statistical-based Isolated Words Connected Digits Continuous Speech Continuous Speech Speech Understanding Spoken dialog; Multiple modalities Connected Words Continuous Speech Isolated Words Stochastic language understanding Finite-state machines Statistical learning Pattern recognition LPC analysis Clustering algorithms Level building Filter-bank analysis Time-normalization Dynamicprogramming Concatenative synthesis Machine learning Mixed-initiative dialog Hidden Markov models Stochastic Language modeling 1962 1967 1972 1977 1982 1987 1992 1997 2003 Year

  4. Multimodal System Technology Components Speech Speech Pen Gesture Visual TTS ASR Automatic SpeechRecognition Text-to-SpeechSynthesis Data, Rules Words Words SLG SLU Spoken Language Generation Spoken LanguageUnderstanding Action Meaning DM DialogManagement

  5. Voice-enabled System Technology Components Speech Speech TTS ASR Automatic SpeechRecognition Text-to-SpeechSynthesis Data, Rules Words Words SLG SLU Spoken Language Generation Spoken LanguageUnderstanding Action Meaning DM DialogManagement

  6. Automatic Speech Recognition • Goal • convert a speech signal into a text message • Accurately and efficiently • independent of the device, speaker or the environment. • Applications • Accessibility • Eyes-busy hands-busy (automobile, doctors, etc) • Call Centers for customer care • Dictation

  7. Basic Formulation • Basic equation of speech recognition is • X=X1,X2,…,Xn is the acoustic observation • is the word sequence. • P(X|W) is the acoustic model • P(W) is the language model

  8. Speech Recognition Process TTS ASR SLG SLU DM Acoustic Model Input Speech Pattern Classification (Decoding, Search) “Hello World” Feature Extraction Confidence Scoring (0.9) (0.8) Word Lexicon Language Model

  9. Acoustic Model Feature Extraction Pattern Classification Confidence Scoring Feature Extraction • Goal: • Extract robust features relevant for ASR • Method: • Spectral analysis • Result: • A feature vector every 10ms • Challenges: • Robustness to environment (office, airport, car) • Devices (speakerphones, cellphones) • Speakers (accents, dialect, style, speaking defects) Language Model Word Lexicon

  10. Spectral Analysis • Female speech (/aa/, pitch of 200Hz) • Fourier transform • 30ms Hamming Window x[n]:time signal X[k]:Fourier transform

  11. Spectrograms • Short-time Fourier transform • Pitch and formant structure

  12. Feature Extraction Process Noise removal, Normalization Quantization Filtering Cepstral Analysis Preemphasis M,N Pitch Formants Spectral Analysis Energy Zero-crossing Segmentation (blocking) Equalization Bias removal or normalization Temporal Derivative Windowing Delta cepstrum Delta^2 cepstrum

  13. Robust Speech Recognition A mismatch in the speech signal between the training phase and testing phase results in performance degradation. Signal Features Model Training Enhancement Normalization Adaptation Signal Features Model Testing

  14. Noise and Channel Distortion y(t) = [s(t) + n(t)] * h(t) Noise n(t) h(t) Channel + Distorted Speech y(t) Speech s(t) 5dB 50dB Fourier Transform Fourier Transform frequency frequency

  15. Speaker Variations • Vocal tract length varies from 15-20cm • Longer vocal tracts =>lower frequency contents • Maximum Likelihood Speaker Normalization • Warp the frequency of a signal

  16. Pattern Classification Confidence Scoring Feature Extraction Acoustic Model Language Model Word Lexicon Acoustic Modeling • Goal: • Map acoustic features into distinct subword units • Such as phones, syllables, words, etc. • Hidden Markov Model (HMM): • Spectral properties modeled by a parametric random process • A collection of HMMs is associated with a subword unit • HMMs are also assigned for modeling extraneous events • Advantages: • Powerful statistical method for a wide range of data and conditions • Highly reliable for recognizing speech

  17. Discrete-Time Markov Process • The Dow Jones Industrial Average Discrete-time first order Markov chain

  18. Hidden Markov Models

  19. Example • I observe (up, up, down, up, unchanged, up) • Is it a bull market? bear market? P(bull)=0.7*0.7*0.1*0.7*0.2*0.7*[0.5* (0.6)5]=1.867*10-4 P(bear)=0.1*0.1*0.6*0.1*0.3*0.1*[0.2* (0.3)5]=8.748*10-9 P(steady)=0.3*0.3*0.3*0.3*0.4*0.3*[0.3*(0.5)5]=9.1125*10-6 • It’s 20 times more likely that we are in a bull market than a steady market! • How about P(bull,bull,bear,bull,steady,bull)= =(0.7*0.7*0.6*0.7*0.4*0.7)*(0.5*0.6*0.2*0.5*0.2*0.4)=1.382976*10-4

  20. Basic Problems in HMMs Given acoustic observation X and model: Evaluation: Compute P(X | ) Decoding: Choose optimal state sequence Re-estimation: Adjust  to maximize P(X |)

  21. t -2 t+1 t-1 t s1 a1i s1 aj1 Xt aj2 a2i s2 s2 si sj aj3 a3i s3 s3 aijbj(Xt) aNi ajN sN sN bt+1(j) bt(j) t-2(i) t-1(i) Evaluation: Forward-Backward algorithm Forward Backward

  22. Optimal alignment between X and S sM b2(2) Vt-1(j+1) s2 j+1 aj+1 j Vt-1(j) aj j Vt(j) s1 j bj(xt) Vt-1(j-1) aj-1 j j -1 x1 x2 xT t -1 t Decoding: Viterbi Algorithm Step 1: Initialization D1(i)=πibi(x1), B1(i)=0 j=1,…N Step 2: Iterations for t=2,…,T { for j=1,…,N { Vt(j)=min[Vt-1(i)aij ] bj(xt) Bt(j)=argmin[Vt-1(i)aij ] }} Step 3: Backtracking The optimal score is VT = max Vt(i) Final state is sT= argmax Vt(i) Optimal path is (s1,s2,…,sT) where st=Bt+1(st+1) t=T-1,…1

  23. Reestimation:Baum-Welch Algorithm • Find =(A, B, ) that maximize p(X| ): • No closed-form solution => • EM algorithm: • Start with old parameter value  • Obtain a new parameter that maximizes • EM guaranteed to have higher likelihood

  24. Continuous Densities • Output distribution is mixture of Gaussians • Posterior probabilities of state j at time t, (mixture k and state i at time t - 1): • Reestimation Formulae:

  25. EM Training Procedure Input speech database Estimate Posteriors t(j,k), t(i,j) Maximize parameters aij, cjk, jk, jk Updated HMM Model Old HMM Model

  26. 1 2 0 Design Issues • Continuous vs. Discrete HMM • Whole-word vs. subword (phone units) • Number of states, number of Gaussians • Ergodic vs. Bakis • Context-dependent vs. context-independent a

  27. /sil/ /sil/ one three /sil/ Training with continuous speech • No segmentation is needed • Composed HMM

  28. Context Variability in Speech • At word/sentence level: • Mr. Wright should write to Ms. Wrightright away about his Ford orfour door Honda. • At phone level: /ee/ for words peat and wheel • Triphones capture: • Coarticulation • phonetic context Peat Wheel

  29. Context-dependent models Triphone IY(P, CH) captures coarticulation, phonetic context Stress: Italy vs Italian

  30. Clustering similar triphones • /iy/ with two different left contexts /r/ and /w/ • Similar effects on /iy/ • Cluster those triphones together

  31. Clustering with decision trees

  32. Other variability in Speech • Style: • discrete vs. continuous speech, • read vs spontaneous • slow vs fast • Speaker: • speaker independent • speaker dependent • speaker adapted • Environment: • additive noise (cocktail party effect) • telephone channel

  33. Acoustic Adaptation • Model adaptation needed if • Mismatched test conditions • Desire to tune to a given speaker • Maximum a Posteriori (MAP) • adds a prior for the parameters  • Maximum Likelihood Linear Regression (MLLR) • Transform mean vectors • Can have more than one MLLR transform • Speaker Adaptive Training (SAT) applies MLLR to training data as well

  34. MAP vs MLLR Speaker-dependent system is trained with 1000 sentences

  35. Discriminative Training • Maximum Likelihood Training: • Parameters obtained from true classes • Discriminative Training: • maximize discrimination between classes Discriminative Feature Transformation • Maximize inter-class difference to intra-class difference • Done at the state level • Linear Discriminant Analysis (LDA) Discriminative Model Training • Maximize posterior probability • Correct class and competing classes are used • Maximum mutual information (MMI); Minimum classification Error (MCE), Minimum Phone Error (MPE)

  36. Acoustic Model Word Lexicon • Goal: • Map legal phone sequences into words • according to phonotactic rules: • David /d/ /ey/ /v/ /ih/ /d/ • Multiple Pronunciation: • Several words may have multiple pronunciations: • Data /d/ /ae/ /t/ /ax/ • Data /d/ /ey/ /t/ /ax/ • Challenges: • How do you generate a word lexicon automatically? • How do you add new variant dialects and word pronunciations? Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon

  37. The Lexicon • An entry per word (> 100K words for dictation) • Multiple pronunciations (tomato) • Done by hand or with letter-to-sound rules (LTS) • LTS rules can be automatically trained with decision trees (CART) • less than 8% errors, but proper nouns are hard!

  38. Acoustic Model Language Model Goal: Model “acceptable” spoken phrases constrained by task syntax Rule-based: Deterministic knowledge-driven grammars: Statistical: Compute estimate of word probabilities (N-gram, class-based, CFG) Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon flying from $cityto $city on $date flying from Newark to Boston tomorrow 0.4 0.6

  39. Formal grammars

  40. Chomsky Grammar Hierarchy

  41. Ngrams Trigram Estimation

  42. Understanding Bigrams • Training data: • “John read her book” • “I read a different book” • “John read a book by Mark” • But we have a problem here

  43. Ngram Smoothing • Data sparseness: in millions of words more than 50% of trigrams occur only once. • Can’t assign p(wi|wi-1, wi-2)=0 • Solution: assign non-zero probability for each ngram by lowering the probability mass of seen ngrams

  44. Perplexity • Cross-entropy of a language model on word sequence W is • And its perplexity measures the complexity of a language model (geometric mean of branching factor).

  45. Perplexity • For digit recognition task (TIDIGITS) has 10 words, PP=10 and 0.2% error rate • Airline Travel Information System (ATIS) has 2000 words and PP=20 and a 2.5% error rate • Wall Street Journal Task has 5000 words and PP=130 with bigram and 5% error rate • In general, lower perplexity => lower error rate, but it does not take acoustic confusability into account: E-set (B, C, D, E, G, P, T) has PP=7 and has 5% error rate.

  46. Ngram Smoothing • Deleted Interpolation algorithm estimates  that maximizes probability on held-out data set • We can also map all out-of-vocabulary words to the unknown word • Other backoff smoothing algorithms possible: Katz, Kneser-Ney, Good-Turing, class ngrams

  47. Adaptive Language Models • Cache Language Models • Topic Language Models • Maximum Entropy Language Models

  48. Bigram Perplexity • Trained on 500 million words and tested on Encarta Encyclopedia

  49. OOV Rate • OOV rate measured on Encarta Encyclopedia. Trained on 500 million words.

  50. WSJ Results • Perplexity and word error rate on the 60000-word Wall Street Journal continuous speech recognition task • Unigrams, bigrams and trigrams were trained from 260 million words • Smoothing mechanisms Kneser-Ney

More Related