Maximum Entropy: Modeling, Decoding, Training

155 Views

Download Presentation
## Maximum Entropy: Modeling, Decoding, Training

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Maximum Entropy:Modeling, Decoding, Training**Advanced Statistical Methods in NLP Ling 572 February 2, 2012**Roadmap**• MaxEnt: • Recap • Modeling: • Computing expectations • Constraints in the model • Decoding • HW #5 • MaxEnt (cont’d) • Training**Maximum Entropy Principle:Summary**• Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes: • Questions: • 1) How do we model the constraints? • 2) How can select the distributions?**Example II: MT (Berger, 1996)**• What we find out that translator uses dansor en 30%? • Constraint: p(dans)+p(en)=3/10 • Now what is maxent model? • p(dans)=p(en)=3/20 • p(à)=p(au cours de)=p(pendant)=7/30 • What if we also know translate picks à or dans 50%? • Add new constraint: p(à)+p(dans)=0.5 • Now what is maxent model?? • Not intuitively obvious…**Feature Functions**• A feature function is a binary-valued indicator function: • In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class. • fj(x,y) = {1 if y=“guns” and x includes “rifle” • {0 otherwise**Empirical Expectation:Example**• Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Raw counts Example due F. Xia**Empirical Expectation:Example**• Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Empirical distribution Example due F. Xia**Calculating Empirical Expectation**• Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • y = true label of x • For each feature t in x: • empirical_expectation[t][y] += 1/N**Model Expectation:Example**• Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia**Model Expectation:Example**• Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia**Model Expectation:Example**• Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia**Model Expectation:Example**• Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia**Model Expectation:Example**• Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 Model Expectation Example due F. Xia**Calculating Model Expectation**• Build previous table**Calculating Model Expectation**• Build previous table • Collect a set of training samples of size N**Calculating Model Expectation**• Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • Compute P(y|x) for each y in Y • For each feature t in x: • For each y in Y: • model_expectation[t][y] += 1/N*P(y|x)**Comparing Expectations**• Empirical Expectation:**Comparing Expectations**• Empirical Expectation: • Model Expectation:**Incorporating Constraints**• Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood**Incorporating Constraints**• Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints?**Incorporating Constraints**• Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data**Incorporating Constraints**• Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data • So, model expectation = empirical expectation**Conditional Likelihood**• Given data (X,Y), conditional likelihood is function of parameters λ**Constraints**• Make model more consistent with training data • Move away from simplest maximum entropy**Constraints**• Make model more consistent with training data • Move away from simplest maximum entropy • Make model less uniform • Lower entropy • Increase likelihood**The Modeling Problem**• Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}}**The Modeling Problem**• Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints:**The Modeling Problem**• Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints: P={p| =dj, j={1,…,k}**Maximizing H(p)**• Problem: Hard to analytically compute max of H(p)**Maximizing H(p)**• Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p)**Maximizing H(p)**• Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p) • Technically, employ Lagrange multipliers • Find multipliers λthat minimize Lagrangian • Solution minimizing new form will maximize H(p)**Solving w/Lagrange Multipliers**Minimize A(p) Set A’(p)=0, and solve**The Modeling Problem**• Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what?**The Modeling Problem**• Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what? • Are there p’s that satisfy these constraints? • Does p* exist? • Is p* unique? • What is the form of p*? • How can we compute it?**p*: Existence, Form, & Uniqueness**• P={p| , j={1,…,k}**p*: Existence, Form, & Uniqueness**• P={p| , j={1,…,k}**p*: Existence, Form, & Uniqueness**• P={p| , j={1,…,k} • Theorem 1 of (Ratnaparkhi, 1997) shows that: • If p* , then p*=argmaxpH(p) and p* is unique**p***• Two forms: • By optimization and by constraint**p***• Two forms: • By optimization and by constraint**p***• Two forms: • By optimization and by constraint • Equivalent:**p***• Two forms: • By optimization and by constraint • Equivalent: π=1/Z; λj=ln αj**The Model: Summary**• Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}}**The Model: Summary**• Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form**The Model: Summary**• Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form**Decoding**• p(y|x) = ,Z is the normalization term**Decoding**• Given a trained model with λis • Z=0**Decoding**• Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight**Decoding**• Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight • For each t in x: • sum += weight for (t,y)