Natural Language Processing 096260

Natural Language Processing • 096260 An Empirical Study of Smoothing Techniques for LanguageModelingStanley F. Chen, Joshua GoodmanComputer Science GroupHarvard UniversityCambridge, MassachusettsJuly 24, 1998TR-10-98

Language Model Language Model gives the probability of any word to appear as the next word in the text.

MLE ( maximum likelihood estimate)

MLE ( bigram ) JOHN READS MOBY DICK MARY READS A DIFFERENT BOOK SHE READS A BOOK BY CHER This example was taken from NLP Lunch Tutorial: Smoothing by Bill MacCartney ( 2005 ) Stanford university http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

Problem with MLE “time heals all the wounds” If “all the wounds” never appeared in the training data the probability of the whole sentence will be 0.

MLE ( bigram ) JOHN READS MOBY DICK MARY READS A DIFFERENT BOOK SHE READS A BOOK BY CHER This example was taken from NLP Lunch Tutorial: Smoothing by Bill MacCartney ( 2005 ) Stanford university http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

Smoothing Take from “rich” words, give to “poor” words.

Types of smoothing Interpolation Back-off

Jelinek and Mercer (1980) Smoothing by Brown (1992) – (3-gram) Lambda may be affected by the Wi’s!

Witten-Bell Smoothing • An instance of Jelinek-Mercer smoothing. • Definition: • In 3-gram model: Meaning how many different words occurred after the words “Yossi eat” in the training data.

Witten-Bell Smoothing This is actually recursive. If we look on the 3-gram model:

Absolute discounting • Like Jelinek-Mercer, involves interpolation of higher- and lower-order models. • However, instead of multiplying the higher-order maximum-likelihood distribution by a factor , the higher-order distribution is created by subtracting a fixed discount D <= 1 from each nonzero count.

Absolute discounting Ney, Essen,andKneser (1994) suggest setting D through deleted estimation on the training data. They arrive at the estimate:

Kneser-Ney Smoothing “San Francisco” is quite common. “Apple Francisco” is not. However the word “Francisco” makes it too “rich”:

Kneser-Ney Smoothing ( 2-gram) • Define:

Kneser-Ney Smoothing ( n-gram ) • Define:

Modified Kneser-Ney Smoothing Why give always the same discount? Let’s make a different discount depending on the n-gram counts.

Modified Kneser-Ney Smoothing

experimental setup The following smoothing methods were checked: On the following corpuses:

experimental setup • Each piece of held-out data was chosen to be 2,500 sentences, or roughly 50,000 words.

experimental setup • evaluate smoothing methods through their cross-entropy on test data. • The baseline was JM with single lambda.

Results

Results • cross-entropy decreases. • the entropies of different corpora. • trigram models better than bigram ( large training set ). • Witten-bell-backoff performs poorly. • Interpolated models are superior to back-off.

Discussion Modified Kneser-Ney Smoothing is the best!

Discussion Whenever data sparsity is an issue, smoothing can help performance, and data sparsity is almost always an issue in statistical modeling. Stanley F. Chen, Joshua Goodman

Natural Language Processing 096260