Language Modeling

Language Modeling

Roadmap (for next two classes) • Review LMs • What are they? • How (and where) are they used? • How are they trained? • Evaluation metrics • Entropy • Perplexity • Smoothing • Good-Turing • Backoff and Interpolation • Absolute Discounting • Kneser-Ney

What is a language model? Gives a probability of communication of transmitted signals of information (Claude Shannon, Information Theory) Lots of ties to Cryptography and Information Theory We most often use n-gram models

Applications

Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW

Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH

Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH • Goal of LM: P(“I’d like to recognize speech”) > P(“I’d like to wreck a nice beach”)

Why n-gram LMs? • We could just count how often a sentence occurs… • …but language is too productive – infinite combos! • Break down by word – predict each given its history • We could just count words in context… • …but even contexts get too sparse. • Just use the last words in a -gram model

Language Model Evaluation Metrics

Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty

Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content

Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol?

Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information,

Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1

Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1 • If message needs one bit, branching factor is 2 • If message needs two bits, branching factor is 4

Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1 • If message needs one bit, branching factor is 2 • If message needs two bits, branching factor is 4 • Entropy and perplexity measure the same thing (uncertainty / information content) with different scales

Information in a fair coin flip

Information in a single fair die

Information in sum of two dice?

Entropy of a distribution • Start with a distribution over events in the event space • Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution

Entropy of a distribution • Start with a distribution over events in the event space • Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution • Key notion – you can use shorter codes for more common messages • (If you’ve heard of Huffman coding, here it is…)

Computing Entropy

Computing Entropy Ideal code length for this symbol

Computing Entropy Expected occurrences of Ideal code length for this symbol

Entropy example • What binary code would I use to represent these?

Entropy example • What binary code would I use to represent these? • Sample: • cabaa 3/1/1

Perplexity • Just • If entropy measures # of bits per symbol • Just exponentiate to get the branching factor

BIG SWITCH:Cross entropyand Language model perplexity

The Train/Test Split and Entropy • Before, we were computing • This scores how well we’re doing • if we know the true distribution • We estimate parameters on training and evaluate on test

Cross entropy • Estimate distribution on training corpus; see how well it predicts testing corpus • Let • be the distribution we learned from training data • be the test data • Then cross entropy of test given training is: • This is the negative average logprob • Also, average number of bits required to encode each test data symbol using our learned distribution

Cross entropy, formally True distribution , assumed distribution Wrote codebook using , encode messages from Let be count-based distribution of test data , then

Language model perplexity • Recipe: • Train a language model on training data • Get negative logprobs of test data, compute average • Exponentiate! • Perplexity correlates rather well with: • Speech recognition error rates • MT quality metrics • LM Perplexities for word-based models are normally between say 50 and 1000 • Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact

Parameter estimation and smoothing

Tasks • You get parameters • You want to produce data that conforms to this distribution • This is simulation or data generation

Tasks • You get parameters • And observations • HHHHTHTTHHHHH • You need to answer: “How likely is this data according to the model? • This is evaluating the likelihood function

Tasks • You get observations: • HHTHTTHTHTHHTHTHTTHTHT • You need to find a set of parameters: • This is parameter estimation

Parameter estimation • Wekeep talking about things like: • as a distribution with parameters • How do we estimate parameters? • What’s the likelihood of these parameters?

Parameter estimation techniques • Often use Relative Frequency Estimate • For certain distributions… • “how likely is it that I get k heads when I flip n times”(Binomial distributions) • “how likely is it that I get five 6s when I roll five dice”(Multinomial distributions) • …Relative Freq = Maximum Likelihood Estimate (MLE) • This is the set of parameters for which the underlying distribution has the max likelihood (another max!) • Formalizes your intuition from the prior slide

Maximum Likelihood has problems :/ • Remember: • Two problems: • What happens if ? • We assign zero probability to an event… • Even worse, what if ? • Divide by zero is undefined!

Smoothing • Main goal: prevent zero numerators (zero probs) and zero denominators (divide by zeros) • Make a “sharp” distribution (where some outputs have large probabilities and others have zero probs) be “smoother” • The smoothest distribution is the uniform distribution • Constraint: • Result should still be a distribution

Smoothing techniques • Add one (Laplace) • This can help, but it generally doesn’t do a good job of estimating what’s going on

Mixtures / interpolation • Say I have two distributions and • Pick any number between and • Then is a distribution • Two things to show: • (a) Sums to one: • (b) All values are • and because they’re distributions • and since and • So the sum is non-negative, and we’re done

Laplace as a mixture • Say we have outcomes and total observations. Laplace says: Laplace is a mixture between MLE and uniform! Mixture weight is determined by N and K

BERP Corpus Bigrams • Original bigram probabilites

BERP Smoothed Bigrams • Smoothed bigram probabilities from the BERP

Language Modeling