1 / 121

Language Modeling

Language Modeling. Roadmap (for next two classes). Review LMs What are they? How (and where) are they used? How are they trained? Evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser -Ney. What is a language model?.

byron
Télécharger la présentation

Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Modeling

  2. Roadmap (for next two classes) • Review LMs • What are they? • How (and where) are they used? • How are they trained? • Evaluation metrics • Entropy • Perplexity • Smoothing • Good-Turing • Backoff and Interpolation • Absolute Discounting • Kneser-Ney

  3. What is a language model? Gives a probability of communication of transmitted signals of information (Claude Shannon, Information Theory) Lots of ties to Cryptography and Information Theory We most often use n-gram models

  4. Applications

  5. Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW

  6. Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH

  7. Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH • Goal of LM: P(“I’d like to recognize speech”) > P(“I’d like to wreck a nice beach”)

  8. Why n-gram LMs? • We could just count how often a sentence occurs… • …but language is too productive – infinite combos! • Break down by word – predict each given its history • We could just count words in context… • …but even contexts get too sparse. • Just use the last words in a -gram model

  9. Language Model Evaluation Metrics

  10. Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty

  11. Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content

  12. Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol?

  13. Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information,

  14. Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1

  15. Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1 • If message needs one bit, branching factor is 2 • If message needs two bits, branching factor is 4

  16. Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1 • If message needs one bit, branching factor is 2 • If message needs two bits, branching factor is 4 • Entropy and perplexity measure the same thing (uncertainty / information content) with different scales

  17. Information in a fair coin flip

  18. Information in a fair coin flip

  19. Information in a fair coin flip

  20. Information in a single fair die

  21. Information in a single fair die

  22. Information in a single fair die

  23. Information in a single fair die

  24. Information in sum of two dice?

  25. Entropy of a distribution • Start with a distribution over events in the event space • Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution

  26. Entropy of a distribution • Start with a distribution over events in the event space • Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution • Key notion – you can use shorter codes for more common messages • (If you’ve heard of Huffman coding, here it is…)

  27. Computing Entropy

  28. Computing Entropy Ideal code length for this symbol

  29. Computing Entropy Expected occurrences of Ideal code length for this symbol

  30. Entropy example • What binary code would I use to represent these?

  31. Entropy example • What binary code would I use to represent these? • Sample: • cabaa 3/1/1

  32. Perplexity • Just • If entropy measures # of bits per symbol • Just exponentiate to get the branching factor

  33. BIG SWITCH:Cross entropyand Language model perplexity

  34. The Train/Test Split and Entropy • Before, we were computing • This scores how well we’re doing • if we know the true distribution • We estimate parameters on training and evaluate on test

  35. Cross entropy • Estimate distribution on training corpus; see how well it predicts testing corpus • Let • be the distribution we learned from training data • be the test data • Then cross entropy of test given training is: • This is the negative average logprob • Also, average number of bits required to encode each test data symbol using our learned distribution

  36. Cross entropy, formally True distribution , assumed distribution Wrote codebook using , encode messages from Let be count-based distribution of test data , then

  37. Language model perplexity • Recipe: • Train a language model on training data • Get negative logprobs of test data, compute average • Exponentiate! • Perplexity correlates rather well with: • Speech recognition error rates • MT quality metrics • LM Perplexities for word-based models are normally between say 50 and 1000 • Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact

  38. Parameter estimation and smoothing

  39. Tasks • You get parameters • You want to produce data that conforms to this distribution • This is simulation or data generation

  40. Tasks • You get parameters • And observations • HHHHTHTTHHHHH • You need to answer: “How likely is this data according to the model? • This is evaluating the likelihood function

  41. Tasks • You get observations: • HHTHTTHTHTHHTHTHTTHTHT • You need to find a set of parameters: • This is parameter estimation

  42. Parameter estimation • Wekeep talking about things like: • as a distribution with parameters • How do we estimate parameters? • What’s the likelihood of these parameters?

  43. Parameter estimation techniques • Often use Relative Frequency Estimate • For certain distributions… • “how likely is it that I get k heads when I flip n times”(Binomial distributions) • “how likely is it that I get five 6s when I roll five dice”(Multinomial distributions) • …Relative Freq = Maximum Likelihood Estimate (MLE) • This is the set of parameters for which the underlying distribution has the max likelihood (another max!) • Formalizes your intuition from the prior slide

  44. Maximum Likelihood has problems :/ • Remember: • Two problems: • What happens if ? • We assign zero probability to an event… • Even worse, what if ? • Divide by zero is undefined!

  45. Smoothing • Main goal: prevent zero numerators (zero probs) and zero denominators (divide by zeros) • Make a “sharp” distribution (where some outputs have large probabilities and others have zero probs) be “smoother” • The smoothest distribution is the uniform distribution • Constraint: • Result should still be a distribution

  46. Smoothing techniques • Add one (Laplace) • This can help, but it generally doesn’t do a good job of estimating what’s going on

  47. Mixtures / interpolation • Say I have two distributions and • Pick any number between and • Then is a distribution • Two things to show: • (a) Sums to one: • (b) All values are • and because they’re distributions • and since and • So the sum is non-negative, and we’re done

  48. Laplace as a mixture • Say we have outcomes and total observations. Laplace says: Laplace is a mixture between MLE and uniform! Mixture weight is determined by N and K

  49. BERP Corpus Bigrams • Original bigram probabilites

  50. BERP Smoothed Bigrams • Smoothed bigram probabilities from the BERP

More Related