Language Modeling with N-Grams

Language Modeling with N-Grams

Language Modeling • A Language Model is a probabilistic model that allows us to compute the probability of a sentence. • Let w1:n denote the word sequence w1w2…wn. • What is the probability P(w1:n)?

Why Language Modeling? • Determine which sequence of words is more likely • Predicting the next word given the previous words • Shannon game: • guessing the next letter given previous letters. • Applications in • Speech Recognition • Machine Translation • Context sensitive spelling check

Language Modeling in Speech Recognition • Some sequences of words sounds alike, but not all of them are good English sentences. • I went to a party • Eye went two a bar tea Rudolph the red nose reindeer. Rudolph the Red knows rain, dear. Rudolph the Red Nose reigned here.

Language Modeling in Machine Translation • Given a French sentence • On voit Jon à la télévision • And several possible English translations: • Jon appeared in TV. • In Jon appeared TV. • Jon appeared on TV. • Which one is more likely to be correct?

Context Sensitive Spelling • Which is most probable? • … I think they’re okay … • … I think there okay … • … I think their okay … • Which is most probable? • … by the way, are they’re likely to … • … by the way, are there likely to … • … by the way, are their likely to …

Axioms of Probability Theory • Suppose P(.) is a probability function, then 1. for any event E, 0≤P(E) ≤1. 2. P(S) = 1, where S is the sample space. 3. for any two mutually exclusive events E1 and E2, P(E1 U E2) = P(E1) + P(E2) • Any function that satisfies the above three axioms is a probability function.

Properties of Probability 1. P(¬E) = 1– P(E) 2. If E1 and E2 are logically equivalent, then P(E1)=P(E2). • E1: Not all philosophers are more than six feet tall. • E2: Some philosopher is not more that six feet tall. Then P(E1)=P(E2). 3. P(E1, E2)≤P(E1).

Conditional Probability • The probability of an event may change after knowing another event. The probability of A given B is denoted by P(A|B). • Example • P( W=space ) the probability of a randomly selected word from an English text is ‘space’ • P( W=space | W’=outer) the probability of ‘space’ if the previous word is ‘outer’

Chain Rule and Bayes Theorem • Chain Rule: P(A, B)=P(A)P(B|A) • Bayes Theorem If P(E2)>0, then P(E1|E2)=P(E2|E1)P(E1)/P(E2) This can be derived from the definition of conditional probability.

Markov Assumption • W1:n-1 is called the history of wn • Sue swallowed the large green ______. • The statistics for the complete history is very sparse. • Markov Assumption: only the closest n words are relevant: P(wn|w1:n-1)≈P(wn|wn-N+1:n-1) • Bigram: only the previous one word matters • Trigram: only the previous two words matter • Therefore P(w1:n) ≈k=1,n P(wk|wk-N+1:k-1)

Examples: • Without Markov Assumption: • P(I went to a party) = ? • With Markov Assumption (n=3) • P(I went to a party) = ? • With Markov Assumption (n=2) • P(I went to a party) = ? • What does n=1 mean?

Parameters in N-gram Models • Suppose there are 20,000 words • very conservative assumption • Parameters • Bigram: 20,000x19,999 = 400M • Trigram:20,0002x19,999=8 trillion • 4-gram: 20,0003x19,999=1.6x1017 • Reliability vs. Relevance • as n increases, n-gram becomes more relevant, but less reliable.

Estimation of Probability • P(wn | w1:n-1) = P(w1:n)/P(w1:n-1) • Probabilities (subjective/objective) exist independent of data. • However, probabilities have to be estimated from data. • Maximum Likelihood Estimation • PMLE(wn | w1:n)=C(w1:n)/C(w1:n-1)

Maximum Likelihood Estimation • MLE assigns the highest probability to data. • Example: • training corpus: <s> a b a b </s> • MLE P(a|b)= ½, P(b|a)=1, P(a|<s>)=1, P(</s>|b) = ½, P(corpus)=1/2. • MLE is not suitable for NLP • MLE assigns 0 probability to unseen events. • One experiment shows that 23% of trigrams were previously unseen after 1.5M words.

p(z | xy) = ? Suppose our training data includes … xya .. … xyd … … xyd … but never xyz Should we conclude p(a | xy) = 1/3?p(d | xy) = 2/3?p(z | xy) = 0/3? NO! Absence of xyz might just be bad luck. How to Estimate

Smoothing the Estimates • Should we conclude • p(a | xy) = 1/3? reduce this • p(d | xy) = 2/3? reduce this • p(z | xy) = 0/3? increase this • Discount the positive counts somewhat • Reallocate that probability to the zeroes

Especially if the denominator is small … • 1/3 probably too high, 100/300 probably about right • Especially if numerator is small … • 1/300 probably too high, 100/30000 probably about right

Dealing with 0 Probability • Back-off • If the frequency count of N-gram is 0, used N-1 gram • Smoothing • Mix MLE with another probability distribution that guarantees not to give 0 probability.

UNIGRAM 438699 ... DNS 298 DNS/WINS 2 dns1.isp.net 1 dnsadmin.exe 2 DNSName 1 DNSServer 1 do 384 ... NT 3313 ... pertinent 2 pervasiveness 1 Ph33r 3 phase 24 phased 1 ... phone 60 Phonebook 23 phrase 9 phrases 2 physical 123 PhysicalDisk 1 ... do 384 ... anything 2 approach 1 ... for 5 have 5 I 1 If 4 ... Link 1 list 1 ... no 1 not 97 Novell 1 offer 1 ... workitem 1 you 7 your 1 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 Courtesy of Patrick Pantel

not 7 do 384 not 97 UNIGRAM 438699 ... DNS 298 DNS/WINS 2 dns1.isp.net 1 dnsadmin.exe 2 DNSName 1 DNSServer 1 do 384 ... NT 3313 ... pertinent 2 pervasiveness 1 Ph33r 3 phase 24 phased 1 ... phone 60 Phonebook 23 phrase 9 phrases 2 physical 123 PhysicalDisk 1 ... not 97 do 384 ... anything 2 approach 1 ... for 5 have 5 I 1 If 4 ... Link 1 list 1 ... no 1 not 97 Novell 1 offer 1 ... workitem 1 you 7 your 1 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 C(they,do,not) = 7 C(do,not) = 97 PMLE(not|they,do) = 7/22 = 0.318 PMLE(not|do) = 97/384 = 0.253 PMLE(offer|they,do) = 0/22 = 0 PMLE(have|they,do) = 2/22 = 0.091 Courtesy of Patrick Pantel

Add-One Smoothing • V is the number of types we might see • the vocabulary size (unique words) • Add-One Smoothing (+1): • Too much mass is reserved for 0-frequency N-grams • arbitrarily picked value “1” to add to N-grams Courtesy of Patrick Pantel

Vocabulary Size (V) = 10,543 Vocabulary Size (V) = 10,543 They,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 P+1(not|they,do) P+1(offer|they,do) P+1(have|they,do) Courtesy of Patrick Pantel

Witten-Bell Discounting (unigrams) • T is the number of types in the training corpus (T < V) • N is the number of tokens in the training corpus • Idea: Use the count of things seen once to estimate unseen events • we saw T words once Courtesy of Patrick Pantel

Witten-Bell Discounting (unigrams) • Total mass reserved for all 0-frequency N-grams is: • Where does this mass come from? • Z = number of 0-frequency words = V – T Courtesy of Patrick Pantel

Witten-Bell Discounting (N-grams) • Condition T, N and Z on N-gram context • unseen N-gram estimate is specific to a word history (context) • b is the number of N-gram types with the given context • b is the number of N-gram tokens with the given context • b is the number of 0-frequency N-grams with the given context Courtesy of Patrick Pantel

N(they,do) N(do) N() UNIGRAM 438699 ... DNS 298 DNS/WINS 2 dns1.isp.net 1 dnsadmin.exe 2 DNSName 1 DNSServer 1 do 384 ... NT 3313 ... pertinent 2 pervasiveness 1 Ph33r 3 phase 24 phased 1 ... phone 60 Phonebook 23 phrase 9 phrases 2 physical 123 PhysicalDisk 1 ... do 384 ... anything 2 approach 1 ... for 5 have 5 I 1 If 4 ... Link 1 list 1 ... no 1 not 97 Novell 1 offer 1 ... workitem 1 you 7 your 1 T(they,do) =9 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 T(do)=81 T()=10543 Courtesy of Patrick Pantel

Witten-Bell Discounting (N-grams) • For N-grams with non-zero frequency: • Mass reserved for 0-frequency N-grams: • For 0-frequency N-grams: Courtesy of Patrick Pantel

PWB(not|they,do) T=9 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 PWB(offer|they,do) Total N-gram Types 1 - 10543 2 - 114707 3 - 256844 PWB(have|they,do) Courtesy of Patrick Pantel

Good-Turing Estimation • where • r = C(w1, …, wn) • Nr= the number of n-grams that occurred r times • This should only be used when r is small.

Example • Corpus: a b a b • Observed bigrams: • b a: 1 • a b: 2 • N0=2, N1=1, N2=1, N=3 • Probability estimations: • f0= N1 /N0 =0.5

Backing off • Estimate the probability with a linear combination of lower order estimations which are less likely to be 0. • Simple linear interpolation

Evaluation of Language Model • Best method: • Use the language model in an application, e.g., spelling check, machine translation, speech recognition, … • Perplexity: the language model that assign the higher probability to the testing data is better.

Language Modeling with N-Grams

Language Modeling with N-Grams

Presentation Transcript

Language Modeling

Language Modeling

What are n-grams good for?

N-Grams and Corpus Linguistics

Language Modeling

Language Modeling

Language modeling

N-Grams and Corpus Linguistics

Discriminative n-gram language modeling

Word-counts and N-grams

N-Grams and Corpus Linguistics

N-Grams and Corpus Linguistics

From Grammar to N-grams

Language Modeling

Chapter 4: N-GRAMS

6. N-GRAMs

Natural Language Processing Statistical Inference: n-grams

N-Grams

Pattern Matching Using n -grams With Algebraic Signatures

Language modeling

What are n-grams good for?