Maximum Entropy: Modeling, Decoding, Training

Maximum Entropy:Modeling, Decoding, Training Advanced Statistical Methods in NLP Ling 572 February 2, 2012

Roadmap • MaxEnt: • Recap • Modeling: • Computing expectations • Constraints in the model • Decoding • HW #5 • MaxEnt (cont’d) • Training

Maximum Entropy Principle:Summary • Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes: • Questions: • 1) How do we model the constraints? • 2) How can select the distributions?

Example II: MT (Berger, 1996) • What we find out that translator uses dansor en 30%? • Constraint: p(dans)+p(en)=3/10 • Now what is maxent model? • p(dans)=p(en)=3/20 • p(à)=p(au cours de)=p(pendant)=7/30 • What if we also know translate picks à or dans 50%? • Add new constraint: p(à)+p(dans)=0.5 • Now what is maxent model?? • Not intuitively obvious…

Feature Functions • A feature function is a binary-valued indicator function: • In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class. • fj(x,y) = {1 if y=“guns” and x includes “rifle” • {0 otherwise

Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Raw counts Example due F. Xia

Empirical Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t3 t4 • x4 c3 t1 t3 Empirical distribution Example due F. Xia

Calculating Empirical Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • y = true label of x • For each feature t in x: • empirical_expectation[t][y] += 1/N

Model Expectation (Detail)

Model Expectation:Example • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 “Raw” counts Example due F. Xia

Model Expectation:Example • Let P(y|xi)=1/3 • Training data: • x1 c1 t1 t2 t3 • x2 c2 t1 t4 • x3 c1 t4 • x4 c3 t1 t3 Model Expectation Example due F. Xia

Calculating Model Expectation • Build previous table

Calculating Model Expectation • Build previous table • Collect a set of training samples of size N

Calculating Model Expectation • Build previous table • Collect a set of training samples of size N • For each instance x in the training data: • Compute P(y|x) for each y in Y • For each feature t in x: • For each y in Y: • model_expectation[t][y] += 1/N*P(y|x)

Comparing Expectations • Empirical Expectation:

Comparing Expectations • Empirical Expectation: • Model Expectation:

Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood

Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints?

Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data

Incorporating Constraints • Maximum entropy models: • Model known constraints • o.w. Apply maximum entropy (minimal commitment) • Discriminative models: maximize conditional likelihood • What are our constraints? • Our model must be consistent with training data • So, model expectation = empirical expectation

Conditional Likelihood • Given data (X,Y), conditional likelihood is function of parameters λ

Constraints • Make model more consistent with training data • Move away from simplest maximum entropy

Constraints • Make model more consistent with training data • Move away from simplest maximum entropy • Make model less uniform • Lower entropy • Increase likelihood

The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}}

The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints:

The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Maximize H(p) • subject to • Constraints: P={p| =dj, j={1,…,k}

Maximizing H(p) • Problem: Hard to analytically compute max of H(p)

Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p)

Maximizing H(p) • Problem: Hard to analytically compute max of H(p) • Approach: • Convert to an alternate form that is easier to optimize and for which the optimum is also an optimum for H(p) • Technically, employ Lagrange multipliers • Find multipliers λthat minimize Lagrangian • Solution minimizing new form will maximize H(p)

Solving w/Lagrange Multipliers Minimize A(p) Set A’(p)=0, and solve

The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what?

The Modeling Problem • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| =dj, j={1,…,k}} • Now what? • Are there p’s that satisfy these constraints? • Does p* exist? • Is p* unique? • What is the form of p*? • How can we compute it?

p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k}

p*: Existence, Form, & Uniqueness • P={p| , j={1,…,k} • Theorem 1 of (Ratnaparkhi, 1997) shows that: • If p* , then p*=argmaxpH(p) and p* is unique

p* • Two forms: • By optimization and by constraint

p* • Two forms: • By optimization and by constraint • Equivalent:

p* • Two forms: • By optimization and by constraint • Equivalent: π=1/Z; λj=ln αj

The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}}

The Model: Summary • Goal: Find p* s.t. p* = argmaxp H(p) subject to • P={p| j, j={1,…,k}} • p*: • Is unique • Maximizes conditional likelihood • is of the form

Decoding

Decoding • p(y|x) = ,Z is the normalization term

Decoding • Given a trained model with λis • Z=0

Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight

Decoding • Given a trained model with λis • Z=0 • For each y in Y: • sum = 0; # Initialize or set to default_weight • For each t in x: • sum += weight for (t,y)

Maximum Entropy: Modeling, Decoding, Training

Maximum Entropy: Modeling, Decoding, Training

Presentation Transcript

Data Stream Algorithms Intro, Sampling, Entropy

Chapter 8 (part B): Data Warehouse Modeling

대기질 및 배출 모델링 Air Quality and Emission Modeling

Information Modeling Requirement Analysis

Measurement, Modeling, and Analysis of the Internet: Part II

Language Modeling

Chapter 19 Principles of Chemical Reactivity: Entropy and Free Energy

Community Multiscale Air Quality (CMAQ) Modeling System

Maximum Entropy

Discrete Choice Modeling

SVMs: Linear and Beyond

Text Categorization

대기질 및 배출 모델링 Air Quality and Emission Modeling

Unified Modeling Language (UML)

INTRODUCTION ,MODELING CONCEPTS,CLASS MODELING

Training Data Modeling Introduction

Introduction to UML: Structural and Use Case Modeling

Thermodynamics Entropy, Energy and equilibrium

Multilevel Modeling

Introduction to UML: Structural and Use Case Modeling

Introduction to UML, the Unified Modeling Language

Entropy