1 / 28

Natural Language Processing 096260 

Natural Language Processing 096260 . An Empirical Study of Smoothing Techniques for Language Modeling Stanley F. Chen, Joshua Goodman Computer Science Group Harvard University Cambridge, Massachusetts July 24, 1998 TR-10-98. Language Model.

Télécharger la présentation

Natural Language Processing 096260 

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing • 096260  An Empirical Study of Smoothing Techniques for LanguageModelingStanley F. Chen, Joshua GoodmanComputer Science GroupHarvard UniversityCambridge, MassachusettsJuly 24, 1998TR-10-98

  2. Language Model Language Model gives the probability of any word to appear as the next word in the text.

  3. MLE ( maximum likelihood estimate)

  4. MLE ( bigram ) JOHN READS MOBY DICK MARY READS A DIFFERENT BOOK SHE READS A BOOK BY CHER This example was taken from NLP Lunch Tutorial: Smoothing by Bill MacCartney ( 2005 ) Stanford university http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

  5. MLE ( bigram ) JOHN READS MOBY DICK MARY READS A DIFFERENT BOOK SHE READS A BOOK BY CHER This example was taken from NLP Lunch Tutorial: Smoothing by Bill MacCartney ( 2005 ) Stanford university http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

  6. Problem with MLE “time heals all the wounds” If “all the wounds” never appeared in the training data the probability of the whole sentence will be 0.

  7. MLE ( bigram ) JOHN READS MOBY DICK MARY READS A DIFFERENT BOOK SHE READS A BOOK BY CHER This example was taken from NLP Lunch Tutorial: Smoothing by Bill MacCartney ( 2005 ) Stanford university http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

  8. Smoothing Take from “rich” words, give to “poor” words.

  9. Types of smoothing Interpolation Back-off

  10. Jelinek and Mercer (1980) Smoothing by Brown (1992) – (3-gram) Lambda may be affected by the Wi’s!

  11. Witten-Bell Smoothing • An instance of Jelinek-Mercer smoothing. • Definition: • In 3-gram model: Meaning how many different words occurred after the words “Yossi eat” in the training data.

  12. Witten-Bell Smoothing This is actually recursive. If we look on the 3-gram model:

  13. Absolute discounting • Like Jelinek-Mercer, involves interpolation of higher- and lower-order models. • However, instead of multiplying the higher-order maximum-likelihood distribution by a factor , the higher-order distribution is created by subtracting a fixed discount D <= 1 from each nonzero count.

  14. Absolute discounting Ney, Essen,andKneser (1994) suggest setting D through deleted estimation on the training data. They arrive at the estimate:

  15. Kneser-Ney Smoothing “San Francisco” is quite common. “Apple Francisco” is not. However the word “Francisco” makes it too “rich”:

  16. Kneser-Ney Smoothing ( 2-gram) • Define:

  17. Kneser-Ney Smoothing ( n-gram ) • Define:

  18. Modified Kneser-Ney Smoothing Why give always the same discount? Let’s make a different discount depending on the n-gram counts.

  19. Modified Kneser-Ney Smoothing

  20. experimental setup The following smoothing methods were checked: On the following corpuses:

  21. experimental setup • Each piece of held-out data was chosen to be 2,500 sentences, or roughly 50,000 words.

  22. experimental setup • evaluate smoothing methods through their cross-entropy on test data. • The baseline was JM with single lambda.

  23. Results

  24. Results

  25. Results

  26. Results • cross-entropy decreases. • the entropies of different corpora. • trigram models better than bigram ( large training set ). • Witten-bell-backoff performs poorly. • Interpolated models are superior to back-off.

  27. Discussion Modified Kneser-Ney Smoothing is the best!

  28. Discussion Whenever data sparsity is an issue, smoothing can help performance, and data sparsity is almost always an issue in statistical modeling. Stanley F. Chen, Joshua Goodman

More Related