1 / 61

Maximum Entropy: Modeling, Decoding, Training

Maximum Entropy: Modeling, Decoding, Training. Advanced Statistical Methods in NLP Ling 572 February 2, 2012. Roadmap. MaxEnt : Recap Modeling: Computing expectations Constraints in the model Decoding HW #5 MaxEnt (cont’d) Training . Maximum Entropy Principle: Summary.

lovey
Télécharger la présentation

Maximum Entropy: Modeling, Decoding, Training

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximum Entropy:Modeling, Decoding, Training Advanced Statistical Methods in NLP Ling 572 February 2, 2012

  2. Roadmap • MaxEnt: • Recap • Modeling: • Computing expectations • Constraints in the model • Decoding • HW #5 • MaxEnt (cont’d) • Training

  3. Maximum Entropy Principle:Summary • Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes: • Questions: • 1) How do we model the constraints? • 2) How can select the distributions?

  4. Example II: MT (Berger, 1996) • What we find out that translator uses dansor en 30%? • Constraint: p(dans)+p(en)=3/10 • Now what is maxent model? • p(dans)=p(en)=3/20 • p(à)=p(au cours de)=p(pendant)=7/30 • What if we also know translate picks à or dans 50%? • Add new constraint: p(à)+p(dans)=0.5 • Now what is maxent model?? • Not intuitively obvious…

  5. Feature Functions • A feature function is a binary-valued indicator function: • In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class. • fj(x,y) = {1 if y=“guns” and x includes “rifle” • {0 otherwise

  6. Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Raw counts Example due F. Xia

  7. Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Empirical distribution Example due F. Xia

  8. Calculating Empirical Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • y = true label of x • For each feature t in x: • empirical_expectation[t][y] += 1/N

  9. Model Expectation (Detail)

  10. Model Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

  11. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

  12. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

  13. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

  14. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 Model Expectation Example due F. Xia

  15. Calculating Model Expectation • Build previous table

  16. Calculating Model Expectation • Build previous table • Collect a set of training samples of size N

  17. Calculating Model Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • Compute P(y|x) for each y in Y • For each feature t in x: • For each y in Y: • model_expectation[t][y] += 1/N*P(y|x)

  18. Comparing Expectations • Empirical Expectation:

  19. Comparing Expectations • Empirical Expectation: • Model Expectation:

  20. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood

  21. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints?

  22. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data

  23. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data • So, model expectation = empirical expectation

  24. Conditional Likelihood • Given data (X,Y), conditional likelihood is function of parameters λ

  25. Constraints • Make model more consistent with training data • Move away from simplest maximum entropy

  26. Constraints • Make model more consistent with training data • Move away from simplest maximum entropy • Make model less uniform • Lower entropy • Increase likelihood

  27. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}}

  28. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints:

  29. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints: P={p| =dj, j={1,…,k}

  30. Maximizing H(p) • Problem: Hard to analytically compute max of H(p)

  31. Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p)

  32. Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p) • Technically, employ Lagrange multipliers • Find multipliers λthat minimize Lagrangian • Solution minimizing new form will maximize H(p)

  33. Solving w/Lagrange Multipliers Minimize A(p) Set A’(p)=0, and solve

  34. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what?

  35. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what? • Are there p’s that satisfy these constraints? • Does p* exist? • Is p* unique? • What is the form of p*? • How can we compute it?

  36. p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k}

  37. p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k}

  38. p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k} • Theorem 1 of (Ratnaparkhi, 1997) shows that: • If p* , then p*=argmaxpH(p) and p* is unique

  39. p* • Two forms: • By optimization and by constraint

  40. p* • Two forms: • By optimization and by constraint

  41. p* • Two forms: • By optimization and by constraint • Equivalent:

  42. p* • Two forms: • By optimization and by constraint • Equivalent: π=1/Z; λj=ln αj

  43. The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}}

  44. The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form

  45. The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form

  46. Decoding

  47. Decoding • p(y|x) = ,Z is the normalization term

  48. Decoding • Given a trained model with λis • Z=0

  49. Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight

  50. Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight • For each t in x: • sum += weight for (t,y)

More Related