1 / 44

6. N-GRAMs

6. N-GRAMs. 부산대학교 인공지능연구실 최성자. Word prediction. “I’d like to make a collect …” Call, telephone, or person-to-person Spelling error detection Augmentative communication Context-sensitive spelling error correction. Language Model. Language Model (LM)

Télécharger la présentation

6. N-GRAMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 6. N-GRAMs 부산대학교 인공지능연구실 최성자

  2. Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person • Spelling error detection • Augmentative communication • Context-sensitive spelling error correction

  3. Language Model • Language Model (LM) • statistical model of word sequences • n-gram: Use the previous n -1 words to predict the next word

  4. Applications • context-sensitive spelling error detection and correction “He is trying to fine out.” “The design an construction will take a year.” • machine translation

  5. Counting Words in Corpora • Corpora (on-line text collections) • Which words to count • What we are going to count • Where we are going to find the things to count

  6. Brown Corpus • 1 million words • 500 texts • Varied genres (newspaper, novels, non-fiction, academic, etc.) • Assembled at Brown University in 1963-64 • The first large on-line text collection used in corpus-based NLP research

  7. Issues in Word Counting • Punctuation symbols (. , ? !) • Capitalization (“He” vs. “he”, “Bush” vs. “bush”) • Inflected forms (“cat” vs. “cats”) • Wordform: cat, cats, eat, eats, ate, eating, eaten • Lemma (Stem): cat, eat

  8. Types vs. Tokens • Tokens (N): Total number of running words • Types (B): Number of distinct words in a corpus (size of the vocabulary) Example: “They picnicked by the pool, then lay back on the grass and looked at the stars.” –16 word tokens, 14 word types (not counting punctuation) ※ “Types” will mean wordform types and not lemma type, and punctuation marks will generally be counted as word

  9. How Many Words in English? • Shakespeare’s complete works • 884,647 wordform tokens • 29,066 wordform types • Brown Corpus • 1 million wordform tokens • 61,805 wordform types • 37,851 lemma types

  10. Simple (Unsmoothed) N-grams • Task: Estimating the probability of a word • First attempt: • Suppose there is no corpus available • Use uniform distribution • Assume: • word types = V (e.g., 100,000)

  11. Simple (Unsmoothed) N-grams • Task: Estimating the probability of a word • Second attempt: • Suppose there is a corpus • Assume: • word tokens = N • # times w appears in corpus = C(w)

  12. Simple (Unsmoothed) N-grams • Task: Estimating the probability of a word • Third attempt: • Suppose there is a corpus • Assume a word depends on its n –1 previous words

  13. Simple (Unsmoothed) N-grams

  14. Simple (Unsmoothed) N-grams • n-gram approximation: • Wk only depends on its previous n–1words

  15. Bigram Approximation • Example: P(I want to eat British food) = P(I|<s>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) <s>: a special word meaning “start of sentence”

  16. Note on Practical Problem • Multiplying many probabilities results in a very small number and can cause numerical underflow • Use logprob instead in the actual computation

  17. Estimating N-gram Probability • Maximum Likelihood Estimate (MLE)

  18. Estimating Bigram Probability • Example: • C(to eat) = 860 • C(to) = 3256

  19. Two Important facts • The increasing accuracy of N-gram models as we increse the value of N • Very strong dependency on their training corpus (in particular its genre and its size in words)

  20. Smoothing • Any particular training corpus is finite • Sparse data problem • Deal with zero probability

  21. Smoothing • Smoothing • Reevaluating zero probability n-grams and assigning them non-zero probability • Also called Discounting • Lowering non-zero n-gram counts in order to assign some probability mass to the zero n-grams

  22. Add-One Smoothing for Bigram

  23. Things Seen Once • Use the count of things seen once to help estimate the count of things never seen

  24. Witten-Bell Discounting

  25. Witten-Bell Discounting for Bigram

  26. Witten-Bell Discounting for Bigram

  27. Seen count Unseen count

  28. Good-Turing Discounting for Bigram

  29. Backoff

  30. Backoff

  31. Entropy • Measure of uncertainty • Used to evaluate quality of n-gram models (how well a language model matches a given language) • Entropy H(X) of a random variable X: • Measured in bits • Number of bits to encode information in the optimal coding scheme

  32. Example 1

  33. Example 2

  34. Perplexity

  35. Entropy of a Sequence

  36. Entropy of a Language

  37. Cross Entropy • Used for comparing two language models • p: Actual probability distribution that generated some data • m: A model of p (approximation to p) • Cross entropy of m on p:

  38. Cross Entropy • By Shannon-McMillan-Breimantheorem: • Property of cross entropy: • Difference between H(p,m) and H(p) is a measure of how accurate model m is • The more accurate a model, the lower its cross-entropy

More Related