1 / 38

Language Models For Speech Recognition

Language Models For Speech Recognition. Speech Recognition. : sequence of acoustic vectors Find the word sequence so that: The task of a language model is to make available to the recognizer adequate estimates of the probabilities. Language Models. N-gram models.

Télécharger la présentation

Language Models For Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Models For Speech Recognition

  2. Speech Recognition • : sequence of acoustic vectors • Find the word sequence so that: • The task of a language model is to make available to the recognizer adequate estimates of the probabilities

  3. Language Models

  4. N-gram models • Make the Markov assumption that only the prior local context – the last (N-1) words – affects the next word • N=3 trigrams • N=2 bigrams • N=1 unigrams

  5. Parameter estimation Maximum Likelihood Estimator • N=3 trigrams • N=2 bigrams • N=1 unigrams • This will assign zero probabilities to unseen events

  6. Number of Parameters • For a vocabulary of size V, a 1-gram model has V-1 independent parameters • A 2-gram model has V2-1 independent parameters • In general, an n-gram model has Vn-1 independent parameters Typical values for a moderate size vocabulary of 20000 words are:

  7. Number of Parameters • |V|=60.000 N=35M Eleftherotypia daily newspaper • In a typical training text, roughly 80% of trigrams occur only once Good-Turing estimate: ML estimates will be zero for 37.5% of the 3-grams and for 11% of the 2-grams

  8. Problems • Data sparseness: we have not enough data to train the model parameters Solutions • Smoothing techniques: accurately estimate probabilities in the presence of sparse data • Good-Turing, Jelinek-Mercer (linear interpolation), Katz (backing-off) • Build compact models: they have fewer parameters to train and thus require less data • equivalence classification of words (e.g. grammatical rules (noun, verb, adjective, preposition), semantic labels (city, name, date))

  9. Smoothing • Make distributions more uniform • Redistribute probability mass from higher to lower probabilities

  10. Additive Smoothing • For each n-gram that occurs r times, pretend that it occurs r+1 times • e.g bigrams

  11. Good-Turing Smoothing • For any n-gram that occurs r times, pretend that it occurs r* times is the number of n-grams which occurs r times • To convert this count to a probability we just normalize • Total probability of unseen n-grams

  12. Example

  13. Jelinek-Mercer Smoothing(linear interpolation) • Good-Turing • Intuitively • Interpolate a higher-order model with a lower-order model • Given fixed pML, it is possible to search efficiently for the λ that maximize the probability of some data using the Baum-Welch algorithm

  14. Katz Smoothing (backing-off) • For those events which wave been observed in the training data we assume some reliable estimate of the probability • For the remaining unseen events we back-off to some less specific distribution • is chosen so that the total probability sums to 1

  15. Witten-Bell Smoothing • Model the probability of new events, estimating the probability of seeing such a new event as we proceed through the training corpus (i.e. the total number of word types in the corpus)

  16. Absolute Discounting • Subtract a constant D from each nonzero count

  17. Kneser-Ney • Lower order distribution not proportional to to the number of occurrences of a word, but to the number of different words that it follows

  18. Modified Kneser-Ney

  19. Measuring Model Quality • Consider the language as an information source L, which emits a sequence of symbols wi from a finite alphabet (the vocabulary) • The quality of a language model M can be judged by its cross entropy with regard to the distribution PT(x) of some hitherto unseen text T: • Intuitively speaking cross entropy is the entropy of T as “perceived” by the model M

  20. Perplexity • Perplexity: • In a language with perplexity X, every word can be followed be X different words with equal probabilities

  21. Elements of Information Theory • Entropy • Mutual Information pointwise • Kullback-Leiblel (KL) divergence

  22. The Greek Language • Highly inflectional language • A Greek vocabulary of 220K words is needed in order to achieve 99.6% lexical coverage

  23. Perplexity

  24. Experimental Results

  25. Hit Rate

  26. Class-based Models • Some words are similar to other words in their meaning and syntactic function • Group words into classes • Fewer parameters • Better estimates

  27. Class-based n-gram models • Suppose that we partition the vocabulary into G classes • This model produces text by first generating a string of classes g1,g2,…,gn and then converting them into the words wi, i=1,2,…n with probability p(wi|gi) • An n-gram model has Vn-1 independent parameters (216x1012) • A class-based model has Gn-1+V-G parameters ( 109 ) Gn-1of an n-gram model for a vocabulary of size G V-G of the form p(wi|gi)

  28. Relation to n-grams

  29. Defining Classes • Manually • Use part-of-speech labels by linguistic experts or a tagger • Use stem information • Automatically • Cluster words as part of an optimization method e.g. Maximize the log-likelihood of test text

  30. Agglomerative Clustering • Bottom-up clustering • Start with a separate cluster for each word • Merge that pair for which the loss in average MI is least

  31. Example • Syntactical classes • verbs, past tense: άναψαν, επέλεξαν, κατέλαβαν, πλήρωσαν, πυροβόλησαν • nouns, neuter: άλογο, δόντι, δέντρο, έντομο, παιδί, ρολόι, σώμα • Adjectives, masculine:δημοκρατικός, δημόσιος, ειδικός, εμπορικός, επίσημος • Semantic classes • last names: βαρδινογιάννης, γεννηματάς, λοβέρδος, ράλλης • countries: βραζιλία, βρετανία, γαλλία, γερμανία, δανία • numerals: δέκατο, δεύτερο, έβδομο, εικοστό, έκτο, ένατο, όγδοο • Some not so well defined classes • ανακριβής, αναμεταδίδει, διαφημίσουν, κομήτες, προμήθευε • εξίσωση, έτρωγαν, και, μαλαισία, νηπιαγωγών, φεβρουάριος

  32. Stem-based Classes • άγνωστ: άγνωστος, άγνωστου, άγνωστο, άγνωστον, άγνωστοι, άγνωστους, άγνωστη, άγνωστης, άγνωστες, άγνωστα, • βλέπ: βλέπω, βλέπεις, βλέπει, βλέπουμε, βλέπετε, βλέπουν • εκτελ: εκτελεί, εκτελούν, εκτελούσε, εκτελούσαν, εκτελείται, εκτελούνται • εξοχικ: εξοχικό, εξοχικά, εξοχική, εξοχικής, εξοχικές • ιστορικ: ιστορικός, ιστορικού, ιστορικό, ιστορικοί, ιστορικών, ιστορικούς, ιστορική, ιστορικής, ιστορικές, ιστορικά • καθηγητ: καθηγητής, καθηγητή, καθηγητές, καθηγητών • μαχητικ: μαχητικός, μαχητικού, μαχητικό, μαχητικών, μαχητική, μαχητικής, μαχητικά

  33. Experimental Results

  34. Example • Interpolate class-based and word-based models

  35. Experimental Results

  36. Hit Rate

  37. Experimental Results

  38. Where do we go from here? • Use syntactic information The dog on the hill barked • Constraints

More Related