Learning Approximate Inference Policies for Fast Prediction

Learning Approximate Inference Policies for Fast Prediction Jason Eisner ICML “Inferning” Workshop June 2012 1

Beware: Bayesians in Roadway A Bayesian is the person who writes down the function you wish you could optimize

lexicon (word types) semantics sentences discourse context resources inflection cognates transliteration abbreviation neologism language evolution entailment correlation tokens N translation alignment editing quotation speech misspellings,typos formatting entanglement annotation To recover variables, model and exploit their correlations

Motivating Tasks • Structured prediction (e.g., for NLP problems) • Parsing ( trees) • Machine translation ( word strings) • Word variants ( letter strings, phylogenies, grids) • Unsupervised learning via Bayesian generative models • Given a few verb conjugation tables and a lot of text • Find/organize/impute all verb conjugation tables of the language

Motivating Tasks • Structured prediction (e.g., for NLP problems) • Parsing ( trees) • Machine translation ( word strings) • Word variants ( letter strings, phylogenies, grids) • Unsupervised learning via Bayesian generative models • Given a few verb conjugation tables and a lot of text • Find/organize/impute all verb conjugation tables of the language • Given some facts and a lot of text • Discover more facts through information extraction and reasoning

Current Methods • Dynamic programming • Exact but slow • Approximate inference in graphical models • Are approximations any good? • May use dynamic programming as subroutine(structured BP) • Sequential classification

Speed-Accuracy Tradeoffs • Inference requires lots of computation • Is some computation going to waste? • Sometimes the best prediction is overdetermined … • Quick ad hoc methods sometimes work: how to respond? • Is some computation actively harmful? • In approximate inference, passing a message can hurt • Frustrating to simplify model just to fix this • Want to keep improving our models! • But need good fast approximate inference • Choose approximations automatically • Tuned to data distribution & loss function • “Trainable hacks” – more robust

This talk is about “trainable hacks” training data feedback Prediction device likelihood (suitable for domain)

This talk is about “trainable hacks” training data feedback Prediction device loss + runtime (suitable for domain)

Loss Datadistribution Predictionrule Optimizedparameters of prediction rule Bayesian Decision Theory • What prediction rule? (approximate inference + beyond) • What loss function? (can include runtime) • How to optimize? (backprop, RL, …) • What data distribution? (may have to impute)

Probabilisticdomain model Partialdata This talk is about “trainable hacks” Completetraining data feedback Prediction device loss + runtime (suitable for domain)

Part 1:Your favorite approximate inference algorithm is a trainable hack

General CRFs: Unrestricted model structure Y2 Y1 Y3 Y4 X1 X2 X3 14 . Add edges to model the conditional distribution well. But exact inference is intractable. So use loopy sum-product or max-product BP.

General CRFs: Unrestricted model structure DT .9 NN .05 … NN .8 JJ .1 … VBD .7 VB .1 … IN .9 NN .01 … DT .9 NN .05 … NN .4 JJ .3 … . .99 , .001 … The cat sat on the mat . 15 Inference: compute properties of the posterior distribution.

General CRFs: Unrestricted model structure DT NN VBD IN DT NN . The cat sat on the mat . 16 Decoding: coming up with predictions from the results of inference.

General CRFs: Unrestricted model structure Could be present in linear-chain CRFs as well. 17 • One uses CRFs with several approximations: • Approximate inference. • Approximate decoding. • Mis-specified model structure. • MAP training (vs. Bayesian). • Why are we still maximizing data likelihood? • Our system is more like a Bayes-inspired neural network that makes predictions.

Black box decision function parameterized by ϴ (Appr.) Inference (Appr.) Decoding x p(y|x) ŷ L(y*,ŷ) Train directly to minimize task loss(Stoyanov, Ropson, & Eisner 2011; Stoyanov & Eisner 2012) 18 • Adjust ϴto (locally) minimize training loss • E.g., via back-propagation (+ annealing) • “Empirical Risk Minimization under Approximations (ERMA)”

Optimization Criteria 19

Optimization Criteria MLE 20

Experimental Results 23 • 3 NLP problems; also synthetic data • We show that: • General CRFs work better when they match dependencies in the data. • Minimum risk training results in more accurate models. • ERMA software package available at www.clsp.jhu.edu/~ves/software

ERMA software packagehttp://www.clsp.jhu.edu/~ves/software 24 • Includes syntax for describing general CRFs. • Supports sum-product and max-product BP. • Can optimize several commonly used loss functions: MSE, Accuracy, F-score. • The package is generic: • Little effort to model new problems. • About1-3 days to express each problem in our formalism.

Modeling Congressional Votes First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… 25 The ConVote corpus [Thomas et al., 2006]

Modeling Congressional Votes First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Yea 26 The ConVote corpus [Thomas et al., 2006]

Modeling Congressional Votes Mr. Sensenbrenner First , I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Had it not been for the heroic actions of the passengers of United flight 93 who forced the plane down over Pennsylvania, congress's ability to serve … Yea Yea 27 The ConVote corpus [Thomas et al., 2006]

Modeling Congressional Votes Mr. Sensenbrenner First , I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Had it not been for the heroic actions of the passengers of United flight 93 who forced the plane down over Pennsylvania, congress's ability to serve … Yea Yea 28 The ConVote corpus [Thomas et al., 2006]

Modeling Congressional Votes • Predict representative votes based on debates. Y/N 29 An example from the ConVote corpus [Thomas et al., 2006]

Modeling Congressional Votes • Predict representative votes based on debates. Y/N Text First , I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… 30 An example from the ConVote corpus [Thomas et al., 2006]

Modeling Congressional Votes • Predict representative votes based on debates. Y/N Y/N Context Text Text First, I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… 31 An example from the ConVote corpus [Thomas et al., 2006]

Modeling Congressional Votes 32

Modeling Congressional Votes *Boldfaced results are significantly better than all others (p < 0.05) 36

Information Extraction from Semi-Structured Text What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of Technology Topic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737) ABSTRACT: We will demonstrate the system "automata" that implements finite state machines… … After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package CMU Seminar Announcement Corpus [Freitag, 2000] 37

Information Extraction from Semi-Structured Text What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of Technology Topic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737) ABSTRACT: We will demonstrate the system "automata" that implements finite state machines… … After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package speaker start time location speaker CMU Seminar Announcement Corpus [Freitag, 2000] 38

Skip-Chain CRF for Info Extraction … … O S S S S O S … … Klaus Who: Prof. Sutner will Sutner Prof. CMU Seminar Announcement Corpus [Freitag, 2000] Skip-chain CRF [Sutton and McCallum, 2005; Finkel et al., 2005] 39 Extract speaker, location, stime, and etime from seminar announcement emails

Semi-Structured Information Extraction 40

Semi-Structured Information Extraction *Boldfaced results are significantly better than all others (p < 0.05). 43

Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libyahas not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports Reuters Corpus Version 2 [Lewis et al, 2004] 44

Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports Reuters Corpus Version 2 [Lewis et al, 2004] 45

Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports 46

Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports [Ghamrawi and McCallum, 2005; Finley and Joachims, 2008] 47

Multi-Label Classification 48

Learning Approximate Inference Policies for Fast Prediction

Learning Approximate Inference Policies for Fast Prediction

Presentation Transcript

Global Approximate Inference

Approximate Inference Using Planar Graph Decomposition

Approximate Bayesian Inference I:

Fast Approximate Correlation for Massive Time Series Data

Bayesian Networks: Sampling Algorithms for Approximate Inference

Stochastic approximate inference

INFERENCE VS PREDICTION

Observation/Inference/Prediction

Approximate Inference Using Planar Graph Decomposition

Approximate genealogical inference

Global Approximate Inference

Approximate inference for stochastic dynamics in large biochemical networks

Notes: Observation, Inference, Prediction

Boosting for prediction and inference

Inference for Clinical Decision Making Policies

Approximate Inference 2: Importance Sampling

Approximate inference for stochastic dynamics in large biochemical networks

Approximate genealogical inference

Approximate Inference by Sampling

Approximate Inference

Boosting for prediction and inference