Statistical NLP Lecture 18: Bayesian grammar induction & machine translation

Statistical NLPLecture 18: Bayesian grammar induction & machine translation Roger Levy Department of Linguistics, UCSD Thanks to Percy Liang, Noah Smith, and Dan Klein for slides

Plan • Recent developments in Bayesian unsupervised grammar induction • Nonparametric grammars • Non-conjugate priors • A bit about machine translation

Nonparametric grammars • Motivation: • How many symbols should a grammar have? • Really an open question • “Let the data have a say”

Hierarchical Dirichlet Process PCFG • Start with the standard Bayesian picture: (Liang et al., 2007)

Grammar representation • Liang et al. use Chomsky normal-form (CNF) grammars • A CNF grammar has no e-productions, and only has rules of form • X ® Y Z [binary rewrite] • X ® a [unary terminal production] CNF not CNF

HDP-PCFG defined • Each grammar has a top-level distribution over (non-terminal) symbolsb • This distribution is a Dirichlet process (stick-breaking distribution; Sethuraman, 1994) • So really there are infinitely many nonterminals • Each nonterminal symbol has: • an emission distribution • a binary rule distribution • and a distribution over what type of rule to use

HDP-PCFG defined

The prior over symbols • The Dirichlet Process controls expectations about symbol distributions

Binary rewrite rules

Inference • VariationalBayes • The tractable distribution is factored into data, top-level symbol, and rewrite components

Results • Simple synthetic grammar (all rule probs equal): • Successfully recovers sparse symbol structure (standard ML-PCFG fails)

Results on treebank parsing • Binarize the Penn Treebank and erase category labels • Try to recover label structure, and then parse sentences with the resulting grammar ML estimation

Dependency grammar induction & other priors • We’ll now cover work by Noah Smith and colleagues on unsupervised dependency grammar induction • Highlight on: non-conjugate priors • What types of priors are interesting to use?

Klein & Manning dependency recap

Klein and Manning’s DMV • Probabilistic, unlexicalized dependency grammar over part-of-speech sequences, designed for unsupervised learning (Klein and Manning, 2004). • Left and right arguments are independent; two states to handle valence. $ Vpast Nsing Nsing . Det Prep Adj

Aside: Visual Notation maximized over G T integrated out Yt observed Xt

EM for Maximum Likelihood Estimation • E step: calculate exact posterior given current grammar • M step: calculate best grammar, assuming current posterior G T Yt Xt

Convenient Change of Variable G G T T E Yt Ft,e Xt Xt

EM (Algorithmic View) • E step: calculate derivation event posteriors given grammar • M step: calculate best grammar using event posteriors G T E Ft,e Xt

Maximum a Posteriori (MAP) Estimation • The data are not the only source of information about the grammar. • Robustness: the grammar should not have many zeroes. Smooth. • This can be accomplished by putting a prior U on the grammar (Chen, 1995; Eisner, 2001, inter alia). • The most computationally convenient prior is a Dirichlet, with α > 1.

MAP EM (Algorithmic View) • E step: calculate derivation event posteriors given grammar • M step: calculate best grammar using event posteriors U G T E Ft,e Xt

Experimental Results: EM and MAP EM • Evaluation of learned grammar on a parsing task (unseen test data). • Initialization and, for MAP, smoothing hyperparameter “u” need to be chosen. • Can do this with unlabeled dev data (modulo infinite cross-ent), • or labeled (shown in blue). Smith (2006, ch. 8)

Structural Bias and Annealing • Simple idea: use soft structural constraints to encourage structures that are more plausible. • This affects the E step only. The final grammar takes the same form as usual. • Here: “favor short dependencies.” • Annealing: gradually shift this bias over time. U G T Yt B Xt

Algorithmic Issues • Structural bias score for a tree needs to factor in such a way that dynamic programming algorithms are still efficient. • Equivalently, g and b, taken together, factor into local features. • Idea explored here: string distance between a word and its parent is penalized geometrically.

Experimental Results: Structural Bias & Annealing • Labeled dev data used to pick • Initialization • Hyperparameter • Structural bias strength (for SB) • Annealing schedule (for SA) Smith (2006, ch. 8)

Correlating Grammar Events • Observation by Blei and Lafferty (2006), regarding topic models: • A multinomial over states that gives high probability to some states is likely to give high probability to other, correlated states. • For us: a class that favors one type of dependents is likely to favor similar types of dependents. • If Vpast favors Nsing as a subject, it might also favor Nplural. • In general, certain classes are likely to have correlated child distributions. • Can we build a grammar-prior that encodes (and learns) these tendencies?

Logistic Normal Distribution over Multinomials • Given: mean vector μ, covariance matrix Σ • Draw a vector η from Normal(η; μ, Σ). • Apply softmax:

Logistic Normal Distributions softmax p2→ 1 η p1 = 0 p1 = 1 m = [ ] p1 = p2 = 0.5 0.4 0.6 p1→ 1

Logistic Normal Distributions p2→ 1 μ, Σ p1→ 1

Logistic Normal Grammar η2 ηn ... η1 η3

Logistic Normal Grammar softmax softmax softmax softmax

Logistic Normal Grammar g softmax softmax softmax softmax

Logistic Normal Grammar g

Learning a Logistic Normal Grammar • We use variational EM as before to achieve Empirical Bayes; the result is a learned μ and Σ corresponding to each multinomial distribution in the grammar. • Variational model for G also has a logistic normal form. • Cohen et al. (2009) exploit tricks from Blei and Lafferty (2006), as well as the dynamic programming trick for trees/derivation events used previously.

Experimental Results: EB • Single initializer. • MAP hyperparameter value is fixed at 1.1. • LN covariance matrix is 1 on the diagonal and 0.5 for tag pairs within the same “family” (thirteen, designed to be language-independent). Cohen, Gimpel, and Smith (NIPS 2008) Cohen and Smith (NAACL-HLT 2009)

Shared Logistic Normals • Logistic normal softly ties grammar event probabilities within the same distribution. • What about across distributions? • If Vpast is likely to have a noun argument, so is Vpresent. • In general, certain classes are likely to have correlated parent distributions. • We can capture this by combining draws from logistic normal distributions.

Shared Logistic Normal Distributions η2 ηn ... η1 η3

Shared Logistic Normal Distributions

Shared Logistic Normal Distributions average & softmax average & softmax average & softmax average & softmax

Shared Logistic Normal Distributions average & softmax g average & softmax average & softmax average & softmax

Shared Logistic Normal Distributions g

What to Tie? • All verb tags share components for all six distributions (left children, right children, and stopping in each direction in each state). • All noun tags share components for all six distributions (left children, right children, and stopping in each direction in each state). • (Clearly, many more ideas to try!)

Experimental Results: EB • Single initializer. • MAP hyperparameter value is fixed at 1.1. • Tag families used for logistic normal and shared logistic normal models. • Verb-as-parent distributions, noun-as-parent distributions each tied in shared logistic normal models. Cohen and Smith (NAACL-HLT 2009)

Bayesian grammar induction summary • This is an exciting (though technical and computationally complex) area! • Nonparametric models’ ability to scale model complexity with data complexity is attractive • Since likelihood clearly won’t guide us to the right grammars, exploring a wider variety of priors is also attractive • Open issue: nonparametric models constrain what types of priors can be used

Machine translation • Shifting gears…

Machine Translation: Examples

Machine Translation Madame la présidente, votre présidence de cette institution a été marquante. Mrs Fontaine, your presidency of this institution has been outstanding. Madam President, president of this house has been discoveries. Madam President, your presidency of this institution has been impressive. Je vais maintenant m'exprimer brièvement en irlandais. I shall now speak briefly in Irish . I will now speak briefly in Ireland . I will now speak briefly in Irish . Nous trouvons en vous un président tel que nous le souhaitions. We think that you are the type of president that we want. We are in you a president as the wanted. We are in you a president as we the wanted.

History • 1950’s: Intensive research activity in MT • 1960’s: Direct word-for-word replacement • 1966 (ALPAC): NRC Report on MT • Conclusion: MT no longer worthy of serious scientific investigation. • 1966-1975: `Recovery period’ • 1975-1985: Resurgence (Europe, Japan) • 1985-present: Gradual Resurgence (US) http://ourworld.compuserve.com/homepages/WJHutchins/MTS-93.htm

Statistical NLP Lecture 18: Bayesian grammar induction & machine translation