# Maximum Entropy: Modeling, Decoding, Training - PowerPoint PPT Presentation Download Presentation Maximum Entropy: Modeling, Decoding, Training

Maximum Entropy: Modeling, Decoding, Training Download Presentation ## Maximum Entropy: Modeling, Decoding, Training

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Maximum Entropy:Modeling, Decoding, Training Advanced Statistical Methods in NLP Ling 572 February 2, 2012

2. Roadmap • MaxEnt: • Recap • Modeling: • Computing expectations • Constraints in the model • Decoding • HW #5 • MaxEnt (cont’d) • Training

3. Maximum Entropy Principle:Summary • Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes: • Questions: • 1) How do we model the constraints? • 2) How can select the distributions?

4. Example II: MT (Berger, 1996) • What we find out that translator uses dansor en 30%? • Constraint: p(dans)+p(en)=3/10 • Now what is maxent model? • p(dans)=p(en)=3/20 • p(à)=p(au cours de)=p(pendant)=7/30 • What if we also know translate picks à or dans 50%? • Add new constraint: p(à)+p(dans)=0.5 • Now what is maxent model?? • Not intuitively obvious…

5. Feature Functions • A feature function is a binary-valued indicator function: • In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class. • fj(x,y) = {1 if y=“guns” and x includes “rifle” • {0 otherwise

6. Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Raw counts Example due F. Xia

7. Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Empirical distribution Example due F. Xia

8. Calculating Empirical Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • y = true label of x • For each feature t in x: • empirical_expectation[t][y] += 1/N

9. Model Expectation (Detail)

10. Model Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

11. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

12. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

13. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

14. Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 Model Expectation Example due F. Xia

15. Calculating Model Expectation • Build previous table

16. Calculating Model Expectation • Build previous table • Collect a set of training samples of size N

17. Calculating Model Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • Compute P(y|x) for each y in Y • For each feature t in x: • For each y in Y: • model_expectation[t][y] += 1/N*P(y|x)

18. Comparing Expectations • Empirical Expectation:

19. Comparing Expectations • Empirical Expectation: • Model Expectation:

20. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood

21. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints?

22. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data

23. Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data • So, model expectation = empirical expectation

24. Conditional Likelihood • Given data (X,Y), conditional likelihood is function of parameters λ

25. Constraints • Make model more consistent with training data • Move away from simplest maximum entropy

26. Constraints • Make model more consistent with training data • Move away from simplest maximum entropy • Make model less uniform • Lower entropy • Increase likelihood

27. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}}

28. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints:

29. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints: P={p| =dj, j={1,…,k}

30. Maximizing H(p) • Problem: Hard to analytically compute max of H(p)

31. Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p)

32. Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p) • Technically, employ Lagrange multipliers • Find multipliers λthat minimize Lagrangian • Solution minimizing new form will maximize H(p)

33. Solving w/Lagrange Multipliers Minimize A(p) Set A’(p)=0, and solve

34. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what?

35. The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what? • Are there p’s that satisfy these constraints? • Does p* exist? • Is p* unique? • What is the form of p*? • How can we compute it?

36. p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k}

37. p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k}

38. p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k} • Theorem 1 of (Ratnaparkhi, 1997) shows that: • If p* , then p*=argmaxpH(p) and p* is unique

39. p* • Two forms: • By optimization and by constraint

40. p* • Two forms: • By optimization and by constraint

41. p* • Two forms: • By optimization and by constraint • Equivalent:

42. p* • Two forms: • By optimization and by constraint • Equivalent: π=1/Z; λj=ln αj

43. The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}}

44. The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form

45. The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form

46. Decoding

47. Decoding • p(y|x) = ,Z is the normalization term

48. Decoding • Given a trained model with λis • Z=0

49. Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight

50. Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight • For each t in x: • sum += weight for (t,y)