LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 13 2/25/2013

Recommended reading • Jurafsky & Martin • Chapter 4: N-gram language models and smoothing

Outline • Generative probabilistic models • Language models • More smoothing • Evaluating language models • Probability and grammaticality • Programming assignment #3

Generative probabilistic models • A generative probabilistic model is a model that defines a probability distribution over the outcomes of the random variables it represents • Let s be a string that a model generates • ∑s p(s) = 1.0 • Some examples of generative models: • Naïve Bayes • Language models • Hidden Markov Models • Probabilistic Context-Free Grammars

Structure in generative models • When we specify independencies and conditional independencies in the probability distribution of a generative model, we are making assumptions about the statistical distribution of the data • Such assumptions may not actually be true! • Structured models can be viewed as: • Generating strings in a particular manner • Imposing a particular structure upon strings • Regardless of whether or not the strings, as a natural phenomenon, were actually generated through our model

Graphical models • Generative models are often visualized as graphical models • Graphical model: shows the probability relationships in a set of random variables • Bayesian networks are an example of a graphical model • (But not all graphical models are Bayes nets or generative; will see these later)

Naïve Bayes viewed as a generative model • To generate C and X1, X2, …, Xn: • Generate the class C • Then generate each Xi conditional upon the value of the class C … X1 X2 Xn

Common questions for generative models • Given a generative model, and a training corpus, how do we: 1. Estimate the parameters of the model from data? 2. Calculate the probability of a string that the model generates? 3. Find the most likely string that the model generates? 4. Perform classification?

Answers to questions, for Naïve Bayes • Estimate the parameters of the model from data? • Count p(C) and p(X|C) for all X and C, then smooth • Calculate the probability of a string that the model generates? • Multiply factors in Naïve Bayes equation • Find the most likely string that the model generates? • (“String” = a set of features with particular values) • Select the class and feature values that maximize joint probability • Perform classification? • Select highest-probability class for a particular set of features

Language models • We would like to assign a probability to a sequence of words (or other types of units) • Let W = w1, w2, …, wn • Interpret this as a sequence; this is not (just) a joint distribution of N random variables • What is p(W)?

Applications of language models • Machine translation: what's the most likely translation? • Quehambretengoyo What hunger have I Hungry I am so I am so hungry Have I that hunger • Speech recognition: what’s the most likely word sequence? • Recognize speech • Wreck a nice beach • Handwriting recognition • Spelling correction • POS tagging

POS tagging is similar • POS tagging of a sentence: What is the most likely tag sequence? • NN VB DT NNS • NN NN DT NNS • Let T = t1, t2, …, tn • What is P(T)? • POS tag model: like a language model, but defined over POS tags

Language modeling • Language modeling is the specific task of predicting the next word in a sequence • Given a sequence of n-1 words, what is the most likely next word? • argmaxwn p(wn| w1, w2, …, wn-1)

Language modeling software • CMU-Cambridge Statistical Language Modeling toolkit http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html • SRI Language Modeling toolkit http://www.speech.sri.com/projects/srilm/

Calculate the probability of a sentence W • Example: • Let W = Lady Gaga wore her meat dress today • Use a corpus to estimate probability • p(W) = count(W) / # of sentences in corpus http://assets.nydailynews.com/polopoly_fs/1.441573!/img/httpImage/image.jpg

Problem: sparse data • Zero probability sentence • Brown Corpus (~50,000 sentences) does not contain the sentence “Lady Gaga wore her meat dress today” • Even Google does not find this sentence. • However, it’s a perfectly fine sentence • There must be something wrong with our probability estimation method

Intuition: count shorter sequences • Although we don’t see “Lady Gaga wore her meat dress today” in a corpus, we can find substrings: • Lady • Lady Gaga • wore • her meat dress • Lady Gaga wore her meat dress • Lady Gaga wore • wore her • dress today

p(W): apply chain rule • W = w1, …, wn • p(w1, …, wn) = p(w1, ..., wn-1) * p(wn | w1, ..., wn-1) • p(w1, …, wn) = p(w1) * p(w2 | w1) * p(w3 | w1, w2) * … * p(wn | w1, ..., wn-1)

Estimate probabilities from corpusLet C = “count” • p(w1) = C(w1) / # of words in corpus • p(w2 | w1) = C(w1, w2) / C(w1) • p(w3 | w1, w2) = C(w1, w2, w3) / C(w1, w2) • p(wn | w1, ..., wn-1) = C(w1, ..., wn) / C(w1,..., wn-1)

This isn’t any easier • By applying the chain rule, we reduce the calculation to include counts for short sequences, such as C(w1) and C(w1, w2) • But long sequences remain, such as C(w1, ..., wn), which was in the original computation we wanted to perform! p(W) = C(W) / # of sentences in corpus = C(w1, ..., wn) / # of sentences in corpus

Solution: Markov assumption • Markov assumption: limited context • The previous N items matter in the determination of the current item, rather than the entire history • Example: current word depends on previous word • Let p( wn | w1, ..., wn-1 ) = p( wn | wn-1 ) • Under this model: p( today | Lady Gaga wore her meat dress ) = p( today | dress )

Markov assumption is an example of conditional independence • In an Nth-order Markov model, N is the amount of previous context • Applied to language models: • The current word is conditionally dependent upon the previous N words, but conditionally independent of all words previous to those • p( wi | w1, ..., wi-1) = p( wi | wi-N, ..., wi-1)

0th-order language model • Also called a unigram model • Let p( wn | w1, ..., wn-1) = p( wn ) • Zero context generation • Each word is generated independently of others • As a graphical model (all variables are independent): … w1 w2 wn

1st-order language model • Also called a bigram model • Let p( wn | w1, ..., wn-1) = p( wn | wn-1 ) • As a graphical model: W1 … Wn-1 Wn

2nd-order language model • Also called a trigram model • Let p( wn | w1, ..., wn-1) = p( wn | wn-2, wn-1 ) • As a graphical model: W1 … Wn-2 Wn-1 Wn

Initial items in sequence • In an Nth-order model, the initial N elements in the sequence are dependent upon all elements generated so far • For example, this doesn’t make sense: • Under a trigram model: p(w0) = p( w0 | w-2, w-1 ) • Trigram model, correctly: • p(w0) = p( w0 ) • p(w1) = p( w1 | w0 ) • p(w2) = p( w2 | w0, w1 )

Summary of language models • Nth-order Markov assumption: • p( wi | w1,...,wi-1) = p( wi | wi-N, ..., wi-1) • Count occurrences of length - N word sequences (N-grams) in a corpus • Model for joint probability of sequence:

p(w1, …, wn) = p(w1) * p(w2 | w1) * p(w3 | w1, w2) * … * p(wn | w1, ..., wn-1) • Example, 1st-order model: • p(w1, …, wn) = p(w1) * p(w2 | w1) * p(w3 | w2) * … * p(wn | wn-1)

Toy example of a language model • Probability distributions in a bigram language model, counted from a training corpus Note that ∑n p(cn|cn-1) = 1.0 p(a) = 1.0 p(d|b) = 1.0 p(b|a) = 0.7 p(e|c) = 0.6 p(c|a) = 0.3 p(f|c) = 0.4 • Language model imposes a probability distribution over all strings it generates. This model generates {abd,ace,acf}. p(a,b,d) = 1.0 * 0.7 * 1.0 = 0.7 p(a,c,e) = 1.0 * 0.3 * 0.6 = 0.18 p(a,c,f) = 1.0 * 0.3 * 0.4 = 0.12 ∑W p(W) = 1.0

Initial items in sequence: alternative • Add sentence boundary markers to training corpus. Example: <s> <s> Lady Gaga is rich . <s> <s> She likes to wear meat . <s> <s> But she does not eat meat . <s> <s> • Generation: begin with first word conditional upon context consisting entirely of sentence boundary markers. Example, trigram model: p(w0) = p( w0 | <s> <s> ) p(w1) = p( w1 | <s> w0 ) p(w2) = p( w2 | w0, w1 )

Trade-offs in choice of N • Higher N: • Longer units, better approximation of a language • Sparse data problem is worse • Lower N: • Shorter units, worse approximation of a language • Sparse data problem is not as bad

N-gram approximations of English • We can create a Markov (i.e., N-gram) approximation of English by randomly generating sequences according to a language model • As N grows, looks more like the original language • This was realized a long time ago: • Claude Shannon, 1948 • Invented information theory • Frederick Damerau • Ph.D., Yale linguistics, 1966 • Empirical Investigation of Statistically Generated Sentences

Shannon: character N-gram approximations of English(though he uses “N” to mean “N-1”) 0th-order model 1st-order model 2nd-order model

Damerau: 0th-order word model(lines show grammatical sequences)

Damerau: 5th-order word model(lines show grammatical sequences)

English can be “similar” to a Markov process: the style of actual patent claims

Need to deal with sparse data • Data sparsity grows with higher N • Many possible N-grams are non-existent in corpora, even for small N • “Lady Gaga wore her rutabaga dress today” • Google count of “her rutabaga dress” is zero

Zero counts cause problems for language models • If any term in the probability equation is zero, the probability of the entire sequence is zero • Toy example: bigram model p(w0, w1, w2, w3) = p(w0) * p(w1|w0) * p(w2|w1) * p(w3|w2) = .01 * 0.3 * 0 * .04 = 0.0 • Need to smooth

Smoothing methods to be covered • Previously: • Add-one smoothing • Deleted estimation • Good-Turing smoothing • Today: • Witten-Bell smoothing • Backoff smoothing • Interpolated backoff

How do we treat novel N-grams? • Simple methods that assign equal probability to all zero-count N-grams: • Add-one smoothing • Deleted estimation • Good-Turing smoothing • Assign differing probability to zero-count N-grams: • Witten-Bell smoothing • Backoff smoothing • Interpolated backoff

4. Witten-Bell smoothing • Key idea: a zero-frequency N-gram is an event that hasn’t happened yet • If p(wi | wi-k, …, wi-1) = 0, then the estimate pWB(wi | wi-k, …, wi-1) is higher if wi-k, …, wi-1 occurs with many different wi • Called “diversity smoothing”

Witten-Bell smoothing • If p(wi | wi-k, …, wi-1) = 0, then the estimate pWB(wi | wi-k, …, wi-1) is higher if wi-k, …, wi-1 occurs with many different wi • Example: compare these two cases • p(C|A,B) = 0 and ABA, ABB, ABD, ABE, ABF have nonzero counts • p(Z|X,Y) = 0 and XYA, XYB have nonzero counts • We would expect that the smoothed estimate of p(C|A,B) should be higher than the smoothed estimate of p(Z|X,Y)

Witten-Bell smoothing for bigrams • Let’s smooth bigrams: p(wi-1, wi) • T(wi-1) is the number of different words (types) that occur to the right of wi-1 • N(wi-1) is the number of all word occurrences (tokens) to the right of wi-1 • If c(wi-1, wi) = 0, • = # of types of bigrams starting with wi-1 # tokens of wi-1 + # of types of bigrams starting with wi-1

Witten-Bell Smoothing • Unsmoothed: • Smoothed: • If c(wi-1, wi) = 0, • If c(wi-1, wi) > 0, Takes probability mass away from nonzero-count items

5. Backoff smoothing • Consider p( zygote|see the) vs. p( baby|see the) • Suppose these trigrams both have zero counts: see the baby see the zygote • And we have that: • Unigram: p(baby) > p(zygote) • Bigram: p(the baby) > p(the zygote) • Trigram: we would expect that p(see the baby) > p(see the zygote) p( baby|see the) > p( zygote|see the)

Backoff smoothing • Hold out probability mass for novel events • But divide up unevenly, in proportion to the backoff probability • Unlike add-one, deleted estimation, Good-Turing • For p(Z|X, Y), the backoff probability is p(Z|Y) • For p(Z|Y), the backoff probability is p(Z)

Backoff smoothing: details • For p(Z|X, Y), the backoff probability is p(Z|Y) • Novel events are types Z that were never observed after X,Y • For p(Z|Y), the backoff probability is p(Z) • Novel events are types Z that were never observed after Y • Even if Z was never observed after X,Y, it may have been observed after the shorter, more frequent context Y. • Then p(Z|Y) can be estimated without further backoff. If not, we back off further to p(Z). • For p(Z), the backoff probability for novel Z can be assigned using other methods

LING / C SC 439/539 Statistical Natural Language Processing