Maximum Entropy: Modeling, Decoding, Training - PowerPoint PPT Presentation

maximum entropy modeling decoding training n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Maximum Entropy: Modeling, Decoding, Training PowerPoint Presentation
Download Presentation
Maximum Entropy: Modeling, Decoding, Training

play fullscreen
1 / 61
Maximum Entropy: Modeling, Decoding, Training
158 Views
Download Presentation
lovey
Download Presentation

Maximum Entropy: Modeling, Decoding, Training

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Maximum Entropy:Modeling, Decoding, Training Advanced Statistical Methods in NLP Ling 572 February 2, 2012

  2. Roadmap • MaxEnt: • Recap • Modeling: • Computing expectations • Constraints in the model • Decoding • HW #5 • MaxEnt (cont’d) • Training

  3. Maximum Entropy Principle:Summary • Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes: • Questions: • 1) How do we model the constraints? • 2) How can select the distributions?

  4. Example II: MT (Berger, 1996) • What we find out that translator uses dansor en 30%? • Constraint: p(dans)+p(en)=3/10 • Now what is maxent model? • p(dans)=p(en)=3/20 • p(à)=p(au cours de)=p(pendant)=7/30 • What if we also know translate picks à or dans 50%? • Add new constraint: p(à)+p(dans)=0.5 • Now what is maxent model?? • Not intuitively obvious…

  5. Feature Functions • A feature function is a binary-valued indicator function: • In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class. • fj(x,y) = {1 if y=“guns” and x includes “rifle” • {0 otherwise

  6. Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Raw counts Example due F. Xia

  7. Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Empirical distribution Example due F. Xia

  8. Calculating Empirical Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • y = true label of x • For each feature t in x: • empirical_expectation[t][y] += 1/N

  9. Model Expectation (Detail)

  10. Model Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

  11. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

  12. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

  13. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

  14. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 Model Expectation Example due F. Xia

  15. Calculating Model Expectation • Build previous table

  16. Calculating Model Expectation • Build previous table • Collect a set of training samples of size N

  17. Calculating Model Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • Compute P(y|x) for each y in Y • For each feature t in x: • For each y in Y: • model_expectation[t][y] += 1/N*P(y|x)

  18. Comparing Expectations • Empirical Expectation:

  19. Comparing Expectations • Empirical Expectation: • Model Expectation:

  20. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood

  21. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints?

  22. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data

  23. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data • So, model expectation = empirical expectation

  24. Conditional Likelihood • Given data (X,Y), conditional likelihood is function of parameters λ

  25. Constraints • Make model more consistent with training data • Move away from simplest maximum entropy

  26. Constraints • Make model more consistent with training data • Move away from simplest maximum entropy • Make model less uniform • Lower entropy • Increase likelihood

  27. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}}

  28. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints:

  29. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints: P={p| =dj, j={1,…,k}

  30. Maximizing H(p) • Problem: Hard to analytically compute max of H(p)

  31. Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p)

  32. Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p) • Technically, employ Lagrange multipliers • Find multipliers λthat minimize Lagrangian • Solution minimizing new form will maximize H(p)

  33. Solving w/Lagrange Multipliers Minimize A(p) Set A’(p)=0, and solve

  34. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what?

  35. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what? • Are there p’s that satisfy these constraints? • Does p* exist? • Is p* unique? • What is the form of p*? • How can we compute it?

  36. p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k}

  37. p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k}

  38. p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k} • Theorem 1 of (Ratnaparkhi, 1997) shows that: • If p* , then p*=argmaxpH(p) and p* is unique

  39. p* • Two forms: • By optimization and by constraint

  40. p* • Two forms: • By optimization and by constraint

  41. p* • Two forms: • By optimization and by constraint • Equivalent:

  42. p* • Two forms: • By optimization and by constraint • Equivalent: π=1/Z; λj=ln αj

  43. The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}}

  44. The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form

  45. The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form

  46. Decoding

  47. Decoding • p(y|x) = ,Z is the normalization term

  48. Decoding • Given a trained model with λis • Z=0

  49. Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight

  50. Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight • For each t in x: • sum += weight for (t,y)