Unified Expectation Maximization: Framework for Tuning Posterior Entropy

A Framework For Tuning Posterior Entropy Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan RothUniversity of Illinois at Urbana-Champaign Workshop on Inferning, ICML 2012, Edinburgh

Inference: Predicting Structures • Predict the output variable y from the space of allowed outputs Y given input variable xusing parameters or weight vectorw • E.g. • predict POS tags given a sentence, • predict word alignments given sentences in two different languages, • predict the entity-relation structure from a document • Prediction expressed as y* = argmaxy2YP (y | x; w)

Learning: Weakly Supervised Learning • Labeled data is scarce and difficult to obtain • A lot of work on learning with a small amount of labeled data • Expectation Maximization (EM) algorithm is the de facto standard • More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM • Constraint-driven Learning (CoDL; Chang et al, 07) • Posterior regularization (PR; Ganchev et al, 10)

Learning Using EM: a Quick Primer qt(y) = argminqKL( q(y) , P(y|x;wt) ) (Neal and Hinton, 99) qt(y) = P(y|x;wt) Conditional distribution of y given w E-step is an inference step, and M-step learns w.r.t. the distribution inferred Posterior distribution • Given unlabeled data: x, estimate w; hidden:y • for t = 1 … Tdo • E-step: infer a posterior distribution, q, over y: • M:step: estimate the parameters ww.r.t. q: wt+1 = argmaxwEqlog P (x, y; w)

Different EM Variations • Hard EM changes the E-step of the EM algorithm • Which version to use: EM (PR) vs hard EM (CoDL) (Spitkovskyet al, 10) (Pedro’s talk)? • Or is there something better out there? • OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM) (Samdani et al, 12) • A framework which explicitly provides a handle on the entropy of the inferred distribution during the E-step • Includes existing EM algorithms • Pick the most suitable EM algorithm in a simple, adaptive, and principled way

Outline • Background: Expectation Maximization (EM) • Unified Expectation Maximization (UEM) • Motivation • Formulation and mathematical intuition • Experiments

Different Versions Of EM EM/Posterior Regularization (Ganchev et al, 10) Hard EM/Constraint driven-learning (Chang et al, 07) E-step: M-step: argmaxwEqlog P (x, y; w) • E-step: argminqKL(qt(y),P(y|x;wt)) • M-step: argmaxwEqlog P (x, y; w) y*=argmaxyP(y|x,w) Eq[Uy] ·b Uy·b Not clear which version To use!!!

Motivation: Unified Expectation Maximization (UEM) EM Hard EM EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution UEM tunes the entropy of the posterior distribution qand is parameterized by a single parameter °

Unified EM (UEM) Changes the entropy of the posterior EM (PR) minimizes the KL-Divergence KL(q , P (y|x;w)) KL(q , p) = yq(y) log q(y) – q(y) log p(y) UEM changes the E-step of standard EM and minimizes a modified KL divergence KL(q , P (y|x;w); °) where KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) Different ° values ! different EM algorithms

Effect of Changing ° KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) q with ° = 1 q with ° = 1 Original Distribution p q with ° = 0 q with ° = -1

Unifying Existing EM Algorithms KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) Changing ° essentially changes the “hardness of inference” Deterministic Annealing (Smith and Eisner, 04; Hofmann, 99) No Constraints Hard EM EM -1 0 1 1 With Constraints ° CODL PR

Outline • Setting up the problem • Introduction to Unified Expectation Maximization • Experiments • POS tagging • Entity-Relation Extraction • Word Alignment

Experiments: exploring the role of ° • Test if changing the inference step by tuning °helps improve the performance over baselines • Compare against: • Posterior Regularization (PR) corresponds to ° = 1.0 • Constraint-driven Learning (CODL) corresponds to °= -1 • Study the relation between the quality of initialization and ° (or “hardness” of inference)

Unsupervised POS Tagging • Model as first order HMM • Try varying qualities of initialization: • Uniform initialization: initialize with equal probability for all states • Supervised initialization: initialize with parameters trained on varying amounts of labeled data • Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization

Unsupervised POS tagging: Different EM instantiations EM Hard EM Initialization with 40-80 examples Initialization with 20 examples Performance relative to EM Initialization with 10 examples Initialization with 5 examples Uniform Initialization °

R23 R12 Experiments: Entity-Relation Extraction Dole ’s wife, Elizabeth , is a resident of N.C. E1E2E3 • Extract entity types (e.g. Loc, Org, Per) and relation types (e.g. Lives-in, Org-based-in, Killed) between pairs of entities • Add constraints: • Type constraints between entity and relations • Expected count constraints to regularize the counts of ‘None’ relation • Semi-supervised learning with a small amount of labeled data

Result on Relations UEM Statistically significantly better than PR Macro-f1 scores % of labeled data

Experiments: Word Alignment Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for wordalignment PR with agreement constraints known to give HUGE improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into individual HMMs

Word Alignment: EN-FR with 10k Unlabeled Data Alignment Error Rate

Word Alignment: EN-FR Alignment Error Rate

Word Alignment: FR-EN Alignment Error Rate

Word Alignment: EN-ES Alignment Error Rate

Word Alignment: ES-EN Alignment Error Rate

Experiments Summary • In different settings, different baselines work better • Entity-Relation extraction: CODL does better than PR • Word Alignment: PR does better than CODL • Unsupervised POS tagging: depends on the initialization • UEM allows us to choose the best algorithm in all of these cases • Best version of EM: a new version with 0 < ° < 1

Unified EM: Summary Questions? • UEM: a unified framework for EM algorithms which tunes the entropy of the posterior by a single parameter ° • ° : adaptively changes the entropy of the posterior based on the data, initialization, and constraints • Experimentally: the best °corresponds to neither EM (PR) nor hard EM (CODL) and found through the UEM framework • Shows the role of inference in learning: learned parameters seem to be sensitive to entropy of the inferred posterior • Open question: What is actually going on? • How is the entropy of the E-step actually changing the learnt model?

Unified Expectation Maximization: Framework for Tuning Posterior Entropy

Unified Expectation Maximization: Framework for Tuning Posterior Entropy

Presentation Transcript

Maximum a Posterior

FUSION: A Framework for Engineering Self-Tuning Self-Adaptive Software Systems

Entropy

Entropy

A Computational Framework to Robustness Analysis and Gain Tuning

NITRO : A Framework for Adaptive Code Variant Tuning

Entropy

A Framework for

A Framework for

Multi-tuning framework

Towards Auto-tuning Framework for Numerical Libraries

A variational expression for a generlized relative entropy

Tuning for kvinder

Framework Functionality, Tuning and Troubleshooting

Entropy

A variational expression for a generlized relative entropy

Maximum a Posterior

Tuning A Guitar