1.21k likes | 1.38k Vues
This guide explores the fundamentals of language models (LMs), their training methods, and evaluation metrics like entropy and perplexity. Learn how n-gram models facilitate word prediction and the significance of smoothing techniques, including Good-Turing and Kneser-Ney. Discover real-world applications of LMs in speech recognition, as well as understanding phoneme sequences. Gain insights into the interplay between language modeling, cryptography, and information theory to enhance your comprehension of communication in AI systems.
E N D
Roadmap (for next two classes) • Review LMs • What are they? • How (and where) are they used? • How are they trained? • Evaluation metrics • Entropy • Perplexity • Smoothing • Good-Turing • Backoff and Interpolation • Absolute Discounting • Kneser-Ney
What is a language model? Gives a probability of communication of transmitted signals of information (Claude Shannon, Information Theory) Lots of ties to Cryptography and Information Theory We most often use n-gram models
Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW
Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH
Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH • Goal of LM: P(“I’d like to recognize speech”) > P(“I’d like to wreck a nice beach”)
Why n-gram LMs? • We could just count how often a sentence occurs… • …but language is too productive – infinite combos! • Break down by word – predict each given its history • We could just count words in context… • …but even contexts get too sparse. • Just use the last words in a -gram model
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol?
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information,
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1 • If message needs one bit, branching factor is 2 • If message needs two bits, branching factor is 4
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1 • If message needs one bit, branching factor is 2 • If message needs two bits, branching factor is 4 • Entropy and perplexity measure the same thing (uncertainty / information content) with different scales
Entropy of a distribution • Start with a distribution over events in the event space • Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution
Entropy of a distribution • Start with a distribution over events in the event space • Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution • Key notion – you can use shorter codes for more common messages • (If you’ve heard of Huffman coding, here it is…)
Computing Entropy Ideal code length for this symbol
Computing Entropy Expected occurrences of Ideal code length for this symbol
Entropy example • What binary code would I use to represent these?
Entropy example • What binary code would I use to represent these? • Sample: • cabaa 3/1/1
Perplexity • Just • If entropy measures # of bits per symbol • Just exponentiate to get the branching factor
The Train/Test Split and Entropy • Before, we were computing • This scores how well we’re doing • if we know the true distribution • We estimate parameters on training and evaluate on test
Cross entropy • Estimate distribution on training corpus; see how well it predicts testing corpus • Let • be the distribution we learned from training data • be the test data • Then cross entropy of test given training is: • This is the negative average logprob • Also, average number of bits required to encode each test data symbol using our learned distribution
Cross entropy, formally True distribution , assumed distribution Wrote codebook using , encode messages from Let be count-based distribution of test data , then
Language model perplexity • Recipe: • Train a language model on training data • Get negative logprobs of test data, compute average • Exponentiate! • Perplexity correlates rather well with: • Speech recognition error rates • MT quality metrics • LM Perplexities for word-based models are normally between say 50 and 1000 • Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact
Tasks • You get parameters • You want to produce data that conforms to this distribution • This is simulation or data generation
Tasks • You get parameters • And observations • HHHHTHTTHHHHH • You need to answer: “How likely is this data according to the model? • This is evaluating the likelihood function
Tasks • You get observations: • HHTHTTHTHTHHTHTHTTHTHT • You need to find a set of parameters: • This is parameter estimation
Parameter estimation • Wekeep talking about things like: • as a distribution with parameters • How do we estimate parameters? • What’s the likelihood of these parameters?
Parameter estimation techniques • Often use Relative Frequency Estimate • For certain distributions… • “how likely is it that I get k heads when I flip n times”(Binomial distributions) • “how likely is it that I get five 6s when I roll five dice”(Multinomial distributions) • …Relative Freq = Maximum Likelihood Estimate (MLE) • This is the set of parameters for which the underlying distribution has the max likelihood (another max!) • Formalizes your intuition from the prior slide
Maximum Likelihood has problems :/ • Remember: • Two problems: • What happens if ? • We assign zero probability to an event… • Even worse, what if ? • Divide by zero is undefined!
Smoothing • Main goal: prevent zero numerators (zero probs) and zero denominators (divide by zeros) • Make a “sharp” distribution (where some outputs have large probabilities and others have zero probs) be “smoother” • The smoothest distribution is the uniform distribution • Constraint: • Result should still be a distribution
Smoothing techniques • Add one (Laplace) • This can help, but it generally doesn’t do a good job of estimating what’s going on
Mixtures / interpolation • Say I have two distributions and • Pick any number between and • Then is a distribution • Two things to show: • (a) Sums to one: • (b) All values are • and because they’re distributions • and since and • So the sum is non-negative, and we’re done
Laplace as a mixture • Say we have outcomes and total observations. Laplace says: Laplace is a mixture between MLE and uniform! Mixture weight is determined by N and K
BERP Corpus Bigrams • Original bigram probabilites
BERP Smoothed Bigrams • Smoothed bigram probabilities from the BERP