Enhancing Speech Recognition with Language Models

Language Model

Language Model Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of candidates can be eliminated and it is possible to give other words higher probabilities.

LM This lets the recognizer make the right guess when two different sentences Sound the same. For example: • It’s fun to recognize speech? • It’s fun to wreck a nice beach?

LM The Bayesian rule: To maximize we look today at P(W)

LM Ultimate goal is that a speech recognizer performs a good as human being. In psychology a lot of research has been done. • The *eel was on the shoe • The *eel was on the car People capable to adjusting to right context • removes ambiguities • limits possible words Already very good language models for dedicated applications (e.g. medical, a lot of standardization)

classification Language models used in speech recognition can be classified into the following categories: • Uniform models: the chance a word occurs is 1 / V. V is the size of the vocabulary • Finite state machines • Grammar models: they use context free grammars • Stochastic models: they determine the chance of a word on it’s preceding words (eg n-grams)

CFG A grammar is defined by: G = (V, T, P, S) where:V contains the set of all non-terminal symbols. T contains the set of all terminal symbols. P is a set of production or production rules. S is a special symbol called the start symbol. Example of rules: S -> NP VP VP -> V NPNP -> NOUNNP -> NAMENOUN -> speechNAME -> Julie Ethan VERB -> loves chases

CFG Parsing • bottom up where you start with the input sentence and try to reach the start symbol • Top down, you start with the starting symbol and try to reach the input sentence by applying the appropriate rules. Left recursion is a problem. (A -> Aa) Advantage bottom up: “What is the weather forecast for this afternoon?” A lot of parsing algorithms available from computer science Problem: people don’t follow the rules of grammar strictly, especially in spoken language. Creating a grammar that covers all this constructions is unfeasible.

probabilistic CFG A mixture between formal language and probabilistic models is the PCFG If there are m rules for left-hand side non terminal node Then probability of these rules is Where C denotes the number of times each rule is used.

Stochastic language models In formal language theory P(W) can be regarded as 1 if the word sequence is accepted or as 0 if it is rejected. N-grams: The probability that wi will follow, given that the word sequence was presented previously

N-grams Unigram: Bigram: Trigram:

gram example To calculate this probability, we need to compute both the number of times "am" is preceded by "I" and the number of times "here" is preceded by "I am." All four sounds the same, right decision can only be made by language model.

training Training is done by very large training sets with millions of words. Still a lot of legal word sequences won’t be considered during the training. Because it is unfeasible to train on every possible sequence of words, it will occur that for legal sequences P(W) is zero.

training Solutions to overcome this problem • A practical approach is to assume this probability depends only on an equivalence class. For example, group all nouns in an equivalence class. • A technique called smoothing adjusts very low and very high probabilities. So 0 en 1 won’t occur anymore.

evaluation The most common metric for a LM is looking at the word recognition error rate. This requires a complete SR system. Another method is known as perplexity

perplexity Encode text W using –2logP(W) bits. Then the cross-entropy H(W) is: Where N is the length of the text. The perplexity is then defined as:

example Training set: • John read her book • I read a different book • John read a book by Mulan

example These bigram probabilities help us estimate the probability for the sentence as: P(John read a book) = P(John|<s>)P(read|John)P(book|a)P(</s>|book) = 0.148 Then cross entropy: -1/4*2log(0.148) = 0.689 So perplexity = 20.689 = 1.61 Comparison: Wall street journal text (5000 words) has a bigram perplexity of 128

evalutation High perplexity means that the number of words branching from a previous word is larger on average. Low perplexity does not guarantee good performance. For example B,C,D,E,G,P,T has 7 but does not take into account acoustic confusability.

Enhancing Speech Recognition with Language Models

Enhancing Speech Recognition with Language Models

Presentation Transcript

Language Model in Turkish IR

Language Organization Classic model

Statistical Translation Language Model

Model/Language of Instruction

Cell Behavior Model Definition Language

SRILM Based Language Model

Language Planning: The Welsh Model

Conceptual Language Model Design for Spoken Language Understanding

Cummins Model of Academic Language

A Seamless Model and Language

Language Model

Language Organization Classic model

Language Model Grammar Conversion

Part 5 Language Model

Language model using HTK

Language Model (LM)

Language Model Methods and Metrics

RELATIONAL MODEL AND SQL LANGUAGE

Language-Model Based Text-Compression

Language Model Grammar Conversion