1 / 18

CS 479, section 1: Natural Language Processing

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . CS 479, section 1: Natural Language Processing. Lectures #13: Language Model Smoothing: Discounting. Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture. Announcements.

mala
Télécharger la présentation

CS 479, section 1: Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, section 1:Natural Language Processing Lectures #13: Language Model Smoothing: Discounting Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture.

  2. Announcements • Project #1, Part 1 • Early: today • Due: Friday • “Under construction” bar on the schedule to be moved forward today.

  3. Objectives • Get comfortable with the process of factoring and smoothing a joint model of a familiar object: text! • Look closely at discounting techniques for smoothing language models.

  4. Review: Language Models • Is a Language Model (i.e., Markov Model) a joint or conditional distribution? Over what? • Is a local model a joint or conditional distribution? Over what? • How are the Markov model and the local model related?

  5. Review: Smoothing • What is smoothing? • Name two ways to accomplish smoothing.

  6. Discounting • Starting point: Maximum Likelihood Estimator (MLE) • a.k.a. Empirical Estimator • Key discounting problem: • What count c*(w,h) should we use for an event that occurred c(w,h) times in N samples? • Call c*(w,h) “r*” for short. • Corrollary problem: • What probability p* = r*/N* should we use for an event that occurred with probability p = r/N

  7. Discounting r allegations attack man outcome reports … claims request allegations allegations attack man outcome r* … reports claims request

  8. Zipf • Rank word types by token frequency • Zipf’s Law • Frequency is inversely proportional to rank • Most word types are rare

  9. Notation

  10. Discounting: Add-One Estimation • Idea: pretend we saw every n-gram once more than we actually did [Laplace] • And for unigrams: • Observed count r > 0  re-estimated count 0 < r* < r • V = N1+ + N0 • Reserves V/(N+V) for extra events • Observed: N1+/(N+V) is distributed back to seen n-grams • “Unobserved”: N0/(N+V) is reserved for unseen words (nearly all of the vocab!) • Actually tells us not only how much to hold out, but where to put it (evenly) • Works astonishingly poorly in practice

  11. Discounting: Add-Epsilon • Quick fix: add some small  instead of 1 [Lidstone, Jefferys] • Slightly better, holds out less mass • Still a bad idea

  12. Discounting: Add-Epsilon • Let  =  (1/V) • A uniform prior over vocabulary, from a Bayesian p.o.v.

  13. Discounting: Unigram Prior • Better to assume a unigram prior • Where the unigram is empirical or smoothed itself:

  14. How Much Mass to Withhold? • Remember the key discounting problem: • What count r* should we use for an event that occurred r times in N samples? • For the rich, r is too big • For the poor, r is too small (0)

  15. Idea: Use Held-out Data Sample #1: N words (or n-grams) Sample #2: N words (or n-grams) the the rates interest rates the the interest rates interest rates the Federal the Federal [Jelinek and Mercer]

  16. Idea: Use Held-out Data Sample #1: N words (or n-grams) Sample #2: N words (or n-grams) the the rates interest rates the the interest rates interest rates the Federal the Federal [Jelinek and Mercer]

  17. Discounting: Held-out Estimation • Summary: • Get another N samples (sample #1) • Collect the set of types occurring r times (in sample #1) • Get another N samples (sample #2) • What is the average frequency of those types in sample #2? • e.g., in our example, doubletons occurred on average 1.5 times • Use that average as r* • r = 2  r* = 1.5 • In practice, much better than add-one, etc. • Main issue: requires large (equal) amount of held-out data

  18. What’s Next • Upcoming lectures: • More sophisticated discounting strategies • Overview of Speech Recognition • The value of language models

More Related