Tree-Structured Unsupervised Learning in NLP

Statistical NLPWinter 2009 Lecture 17: Unsupervised Learning IV Tree-structured learning Roger Levy [thanks to Jason Eisner and Mike Frank for some slides]

Inferring tree structure from raw text • We’ve looked at a few types of unsupervised inference of linguistic structure: • Wordsfrom unsegmented utterances • Semantic word classesfrom word bags (=documents) • Syntactic word classesfrom word sequences • But a hallmark of language is recursive structure • To a first approximation, this is trees • How might we learn tree structures from raw text?

words: “blue rings” objects: rings, big bird words: “and green rings” objects: rings, big bird words: “and yellow rings” objects: rings, big bird words: “Bigbird! Do you want to hold the rings?” objects: big bird Is this the right problem? • Before proceeding…are we tackling the right problem? • We’ve already seen how the structures giving high likelihood to raw text may not be the structures we want • Also, it’s fairly clear that kids don’t learn language structure from linguistic input alone… • Real learning is cross-modal • So why study unsupervised grammar induction from raw text?

Why induce grammars from raw text? • I don’t think this is a trivial question to answer. But there are some answers: • Practical: if unsupervised grammar induction worked well, it’d save a whole lot of annotation effort • Practical: we need tree structure to get compositional semantics to work • Scientific: unsupervised grammar induction may help us a place an upper bound on how much grammatical knowledge must be innate • Theoretical: the models and algorithms that make unsupervised grammar induction work well may help us with other, related tasks (e.g., machine translation)

Plan of today • We’ll start with a look at an early but influential clustering-based method (Clark, 2001) • We’ll then talk about EM for PCFGs • Then we’ll talk about the next influential work (setting the bar for all modern work), Klein & Manning (2004)

Distributional clustering for syntax • How do we evaluate unsupervised grammar induction? • Basically like we evaluate supervised parsing: through bracketing accuracy • But this means that we don’t have to focus on inducing a grammar • We can focus on inducing constituents in sentences instead • This contrast has been called structure search versusparameter search

Distributional clustering for syntax (II) • What makes a good constituent? • Old insight from Zellig Harris (Chomsky’s teacher): something can appear in a consistent set of contexts Sarah saw a dog standing near a tree. Sarah saw a big dog standing near a tree. Sarah saw me standing near a tree. Sarah saw three posts standing near a tree. • Therefore we could try to characterize the space of possible constituents by the contexts they appear in • We could then do clustering in this space

Distributional clustering for syntax (III) • Clark, 2001: k-means clustering of common tag sequences occurring in context distributions • Applied to 12 million words from the British National Corpus • The results were promising: (determiner) “the big dog” “the very big dog” “the dog” “the dog near the tree”

Distributional clustering for syntax (IV) • But…there was a problem: • Clustering picks up both constituents and things that can’t be constituents (distituents) “big dog the” “dog big the” “dog the” “dog and big” good at distinguishing constituent types bad at weeding out distituents

Distributional clustering for syntax (V) • Clark 2001’s solution: “with real constituents, there is high mutual information between the symbol occurring before the putative constituent and the symbol after…” • Filter out distituents on the basis of low mutual information • (controlling for distance between symbols) constituents distituents

Distributional clustering for syntax (VI) • The final step: train an SCFG using vanilla maximum-likelihood estimation • (hand-label the clusters to get interpretability) Ten most frequent rules expanding “NP”

Structure vs. parameter search • This did pretty well • But the clustering-based structure-search approach is throwing out a hugely important type of information • (did you see what it is?) • …bracketing consistency! Sarah gave the big dog the most enthuisiastic hug. • Parameter-estimation approaches automatically incorporate this type of information • This motivated the work of Klein & Manning (2002) both can’t be a constituent at once

Interlude • Before we discuss later work: how do we do parameter search for PCFGs? • The EM algorithm is key • But as with HMM learning, there are exponentially many structures to consider • We need dynamic programming (but a different sort than for HMMs)

Review: EM for HMMs • For the M-step, the fundamental quantity we need is the expected counts of state-to-state transitions αi Probability of getting from beginning to state si at time t aijbijot Probability of transitioning from state sj at time tto state sj at time t+1, emitting ot βj Probability of getting from state sj at time t+1 to the end

EM for PCFGs • In PCFGs, for the M-step, the fundamental quantity is the expected counts of various rule rewrites something new: outside probability the familiar inside probability

Inside & outside probabilities • In Manning & Schuetze (1999), here’s the picture: outside probability of Nj You’re interested in node Nj inside probability of Nj

Compute  Bottom-Up by CKY 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

Compute  Bottom-Up by CKY 2-1 S  NP VP 2-6 S  Vst NP 2-2 S  S PP 2-1 VP  V NP 2-2 VP  VP PP 2-1 NP  Det N 2-2 NP  NP PP 2-3 NP  NP NP 2-0 PP  P NP S 2-22

Compute  Bottom-Up by CKY 2-1 S  NP VP 2-6 S  Vst NP 2-2 S  S PP 2-1 VP  V NP 2-2 VP  VP PP 2-1 NP  Det N 2-2 NP  NP PP 2-3 NP  NP NP 2-0 PP  P NP S 2-27

The Efficient Version: Add as we go 2-1 S  NP VP 2-6 S  Vst NP 2-2 S  S PP 2-1 VP  V NP 2-2 VP  VP PP 2-1 NP  Det N 2-2 NP  NP PP 2-3 NP  NP NP 2-0 PP  P NP

s(0,5) S(0,2) s(0,2)* PP(2,5)*p(S S PP | S) PP(2,5) The Efficient Version: Add as we go 2-1 S  NP VP 2-6 S  Vst NP 2-2 S  S PP 2-1 VP  V NP 2-2 VP  VP PP 2-1 NP  Det N 2-2 NP  NP PP 2-3 NP  NP NP 2-0 PP  P NP +2-22 +2-27

X Y Z what if you changed + to max? j k i Compute  probs bottom-up (CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules X  Y Z X(i,k) +=p(X Y Z | X) * Y(i,j) *Z(j,k) need some initialization up here for the width-1 case

S “outside” the VP NP time VP VP(1,5) = p(time VP today | S) VP NP today VP(1,5) = p(flies like an arrow | VP) “inside” the VP Inside & Outside Probabilities PP V flies P like NP VP(1,5) *VP(1,5) = p(time [VPflies like an arrow] today| S) Det an N arrow

S NP time VP VP(1,5) = p(time VP today | S) VP NP today / S(0,6) p(time flies like an arrow today | S) Inside & Outside Probabilities PP V flies VP(1,5) = p(flies like an arrow | VP) P like NP VP(1,5) *VP(1,5) = p(timeflies like an arrowtoday & VP(1,5) | S) Det an N arrow = p(VP(1,5) | time flies like an arrow today, S)

S NP time VP VP(1,5) = p(time VP today | S) C C C VP NP today Start H H H Inside & Outside Probabilities PP V flies VP(1,5) = p(flies like an arrow | VP) P like NP strictly analogousto forward-backward in the finite-state case! Det an N arrow So VP(1,5) *VP(1,5) / s(0,6) is probability that there is a VP here, given all of the observed data (words)

S NP time VP VP(1,5) = p(time VP today | S) VP NP today Inside & Outside Probabilities PP V flies V(1,2) = p(flies | V) P like NP PP(2,5) = p(like an arrow | PP) Det an N arrow So VP(1,5) * V(1,2) *PP(2,5) / s(0,6) is probability that there is VP  V PP here, given all of the observed data (words) … or is it?

S NP time VP VP(1,5) = p(time VP today | S) VP NP today sum prob over all position triples like (1,2,5) to get expected c(VP V PP); reestimate PCFG! Inside & Outside Probabilities PP V flies V(1,2) = p(flies | V) P like NP PP(2,5) = p(like an arrow | PP) Det an N arrow So VP(1,5) * p(VP V PP) *V(1,2) *PP(2,5) / s(0,6) is probability that there is VP V PP here (at 1-2-5), given all of the observed data (words)

Summary: VP(1,5) += p(V PP |VP) * V(1,2) * PP(2,5) inside 1,5 inside 1,2 inside 2,5 VP(1,5) p(flies like an arrow |VP) VP += p(V PP |VP) * p(flies |V) p(like an arrow |PP) * Compute  probs bottom-up PP(2,5) PP V flies V(1,2) P like NP Det an N arrow

Summary: V(1,2) += p(V PP |VP) * VP(1,5) * PP(2,5) S outside 1,2 outside 1,5 inside 2,5 p(time Vlike an arrow today | S) V(1,2) Compute  probs top-down (uses  probs as well) NP time VP NP today VP(1,5) VP PP V flies PP(2,5) +=p(time VP today | S) * p(V PP |VP) * p(like an arrow | PP) P like NP Det an N arrow

Summary: PP(2,5) += p(V PP |VP) * VP(1,5) * V(1,2) outside 2,5 outside 1,5 inside 1,2 p(time fliesPP today | S) PP(2,5) Compute  probs top-down (uses  probs as well) S NP time VP NP today VP(1,5) VP PP V flies +=p(time VP today | S) * p(V PP |VP) * p(flies| V) V(1,2) P like NP Det an N arrow

VP(1,5) VP Compute  probs bottom-up When you build VP(1,5), from VP(1,2) and VP(2,5)during CKY,increment VP(1,5) by p(VP VP PP) * VP(1,2) *PP(2,5) Why? VP(1,5) is total probability of all derivations p(flies like an arrow | VP)and we just found another. (See earlier slide of CKY chart.) PP(2,5) VP(1,2) PP VP flies P like NP Det an N arrow

X Y Z j k i Compute  probs bottom-up (CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules X  Y Z X(i,k) +=p(X Y Z) * Y(i,j) *Z(j,k)

Compute  probs top-down (reverse CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules X  Y Z Y(i,j) += ??? Z(j,k) += ??? n downto 2 “unbuild” biggest first X Y Z j k i

PP(2,5) VP(1,2) PP VP VP(1,2) PP(2,5) already computed on bottom-up pass Compute  probs top-down S After computing  during CKY, revisit constits in reverse order (i.e., bigger constituents first).When you “unbuild” VP(1,5) from VP(1,2) and VP(2,5),increment VP(1,2) by VP(1,5)* p(VP  VP PP) *PP(2,5) and increment PP(2,5) by VP(1,5)* p(VP  VP PP) *VP(1,2) NP time VP VP(1,5) VP NP today VP(1,2) is total prob of all ways to gen VP(1,2) and all outside words.

Compute  probs top-down (reverse CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules X  Y Z Y(i,j) +=X(i,k)* p(X  Y Z) *Z(j,k) Z(j,k) +=X(i,k)* p(X  Y Z) *Y(i,j) n downto 2 “unbuild” biggest first X Y Z j k i

Back to the actual history • Early approaches: Pereira & Schabes (1992), Carroll & Charniak (1994) • Initialize a simple PCFG using some “good guesses” • Use EM to re-estimate • Horrible failure

EM on a constituent-context model • Klein & Manning (2002) used a generative model that constructed sentences through bracketings • (1) choose a bracketing of a sentence • (2) for each span in the bracketing, (independently) generate a yield and a context • This is horribly deficient, but it worked well! context yield

Adding a dependency component • Klein & Manning (2004) added a dependency component, which wasn’t deficient – a nice generative model! • The dependency component introduced valence – the idea that a governor’s dependent-seeking behavior changes once it has a dependent

Constituent-context + dependency • Some really nifty factorized-inference techniques allowed these models to be combined • The results are very good

More recent developments • Improved parameter search (Smith, 2006) • Bayesian methods (Liang et al., 2007) • Unsupervised all-subtrees parsing (Bod 2007)

Tree-Structured Unsupervised Learning in NLP

Tree-Structured Unsupervised Learning in NLP

Presentation Transcript

Statistical NLP Winter 2008

Statistical NLP Winter 2009

Statistical NLP Winter 2009

Statistical NLP: Lecture 6

Statistical NLP: Lecture 2

Statistical NLP Winter 2009

Statistical NLP Winter 2008

Statistical NLP Winter 2009

Statistical NLP Winter 2009

Statistical NLP Winter 2009

Statistical NLP Winter 2009

Seminar: Statistical NLP

Statistical NLP Winter 2008

COMP790: Statistical NLP

Statistical NLP Winter 2009

Statistical NLP: Lecture 7

Statistical NLP Spring 2010

Statistical NLP: Lecture 8

Statistical NLP Spring 2010

Statistical NLP: Lecture 5

Statistical NLP Winter 2009

Statistical NLP Spring 2010