1 / 30

Language Modeling

Language Modeling. Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek). Word Prediction in Application Domains. Guessing the next word/letter Once upon a time there was ……. C’era una volta ….

tracen
Télécharger la présentation

Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Modeling Anytime a linguist leaves the group the recognition rate goes up.(Fred Jelinek)

  2. Word Prediction in Application Domains • Guessing the next word/letter • Once upon a time there was ……. • C’era una volta …. • Domains: speech modeling, augmentative communication systems (disabled persons), T9

  3. Word Prediction for Spelling • Andranno a trovarlo alla sua cassa domani. • Se andrei al mare sarei abbronzato. • Vado a spiaggia. • Hopefully, all with continue smoothly in my absence. • Can they lave him my message? • I need to notified the bank of this problem.

  4. Probs • Prior probability that the training data D will be observed P(D) • Prior probability of h, P(h), my include any prior knowledge that h is the correct hypothesis • P(D|h), probability of observing data D given a world where hypothesis h holds. • P(h|D), probability that h holds given the data D, i.e. posterior probability of h, because it reflects our confidence that h holds after we have seen the data D.

  5. The Bayes Rule (Theorem)

  6. Maximum Aposteriory Hypothesis and Maximum Likelihood

  7. Bayes Optimal Classifier • Motivation: 3 hypotheses with the posterior probs of 0.4, 0.3 and 0.3. Thus, the first one is the MAP hypothesis. (!) BUT: • (A problem) Suppose new instance us classified positive by the first hyp., while negative by the other two. So, the porb. that the new instance is positive is 0.4 opposed to 0.6 for negative classification. The MAP is the 0.4 one ! • Solution: The most probable classification of the new instance is obtained by combining the prediction for all hypothesis weighted by their posterior probabilities.

  8. Bayes Optimal Classifier • Classification: class • Bayes Optimal Classifier

  9. Naïve Bayes Classifier • Bayes Optimal Classifier • Naïve version

  10. m-estimate of probability

  11. Tagging • P (tag = Noun | word = saw) = ?

  12. Use corpus to find them Language Model

  13. N-gram Model • The N-th word is predicted by the previous N-1 words. • What is a word? • Token, word-form, lemma, m-tag, …

  14. N-gram approximation models

  15. bi-gram and tri-gram models N=2 (bi): N=3 (tri):

  16. Counting n-grams

  17. The Language Model Allows us to Calculate Sentence Probs • P( Today is a beautiful day . ) = P( Today | <Start>) * P (is | Today) * P( a | is) * P(beautiful|a) * P(day| beautiful) * P(. | day) * P(<End>| .) Work in log space !

  18. Unseen n-grams and Smoothing • Discounting (several types) • Backoff • Deleted Interpolation

  19. Deleted Interpolation

  20. Searching For the Best Tagging W_1 W_2 W_3 W_4 W_5 W_6 W_7 W_8 t_1_1 t_1_2 t_1_3 t_1_4 t_1_5 t_1_6 t_1_7 t_1_8 t_2_1 t_2_2 t_2_3 t_2_5 t_2_8 t_3_1 t_3_3 t_4_1 Use Viterbi search to find the best path through the lattice.

  21. Cross Entropy • Entropy from the point of view of the user who has misinterpreted the source distribution to be q rather than p [Cross entropy is an upper bound of entropy]

  22. Cross Entropy as a Quality Measure • Two models, therefore 2 upper bounds of entropy. • The more accurate is the one with lower cross entropy

  23. Imagine that y was generated with either model A or model B. Then:

  24. Cont. Proof of convergence of the EM algorithm

  25. Estimation - Maximization Algorithm • Consider a problem in which the data D is a set of instances generated by a probability distribution that is a mixture of k distinct Normal distributions (assuming same variances) • Hypothesis is therefore defined by the vector of the means of the distributions

  26. Estimation-Maximization Algorithm • Step 1: Calculate the expected value of each distribution, assuming that the current hypothesis holds • Step 2: Calculate a new maximum likelihood hypothesis assuming that the expected value is the true value. Then make the new hypothesis be the actual one. • Step 3: Goto Step 1.

  27. If we find lambda prime such that So we need to maximize A with respect to lambda prime Under the constraint that all lambdas sum up to one.  Use Lagrange multipliers

  28. The EM Algorithm Can be analogically generalized for more lambdas

  29. Measuring success rates • Recall = (#correct answers)/(#total possible answers) • Precision = (#correct answers)/(#answers) • Fallout = (#incorrect answers)/(#of spourious facts in the text) • F-measure = [(b^2+1)*P*R]/(b^2*P+R) • If b > 1 P is favored.

  30. Chunking as Tagging • Even certain parsing problems can be solved via tagging • E.g.: • ((A B) C ((D F) G)) • BIA tags: A/B B/A C/I D/B F/A G/A

More Related