Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution

Using the Web as an Implicit Training Set:Application to Structural Ambiguity Resolution Preslav Nakov andMarti HearstComputer Science Division and SIMSUniversity of California, Berkeley Supported by NSF DBI-0317510 and a gift from Genentech

Motivation • Huge datasets trump sophisticated algorithms. • “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, ACL 2001 (Banko & Brill, 2001) • Task: spelling correction • Raw text as “training data” • Log-linear improvement even to billion words • Getting more data is better than fine-tuning algorithms. • How to generalize to other problems?

Web as a Baseline • “Web as a baseline” (Lapata & Keller 04;05): applied simple n-gram models to: • machine translation candidate selection • article generation • noun compound interpretation • noun compound bracketing • adjective ordering • spelling correction • countability detection • prepositional phrase attachment • All unsupervised • Findings: • Sometimes rival best supervised approaches. • => Web n-grams should be used as a baseline. Significantly better than the best supervised algorithm. Not significantly different from the best supervised algorithm.

Our Contribution • Potential of these ideas is not yet fully realized • We introduce new features • paraphrases • surface features • Applied to structural ambiguity problems • Data sparseness: need statistics for every possible word and for word combinations • Problems (unsupervised): • Noun compound bracketing • PP attachment • NP coordination state-of-the-art results (Nakov&Hearst, 2005) this work

Task 1: Prepositional Phrase Attachment

PP attachment (a) Peter spent millions of dollars. (noun) (b) Peter spent time with his family. (verb) quadruple: (v, n1, p, n2) (a) (spent, millions, of, dollars) (b) (spent, time, with, family) PP combines with the NP to form another NP PP is an indirect object of the verb Human performance: • quadruple: 88% • whole sentence: 93%

Related Work Supervised • (Brill & Resnik, 94): transformation-based learning, WordNet classes, P=82% • (Ratnaparkhi & al., 94): ME, word classes (MI), P=81.6% • (Collins & Brooks, 95): back-off, P=84.5% • (Stetina & Makoto, 97): decision trees, WordNet, P=88.1% • (Toutanova & al., 04): morphology, syntax, WordNet, P=87.5% • (Olteanu & Moldovan, 05): in context, parser, FrameNet, Web, SVM, P=92.85% Unsupervised • (Hindle & Rooth, 93): partially parsed corpus, lexical associations over subsets of (v,n1,p), P=80%,R=80% • (Ratnaparkhi, 98): POS tagged corpus, unambiguous cases for (v,n1,p), (n1,p,n2), classifier: P=81.9% • (Pantel & Lin,00): collocation database, dependency parser, large corpus (125M words), P=84.3% Ratnaparkhi dataset Unsup. state-of-the-art

Related Work: Web Unsup. • (Volk, 00): Altavista, NEAR operator, German, compare Pr(p|n1) vs. Pr(p|v), P=75%, R=58% • (Volk, 01): Altavista, NEAR operator, German, inflected queries, Pr(p,n2|n1) vs. Pr(p,n2|v), P=75%, R=85% • (Calvo & Gelbukh, 03): exact phrases, Spanish, P=91.97%, R=89.5% • (Lapata & Keller,05): Web n-grams, English, Ratnaparkhi dataset, P in low 70’s • (Olteanu & Moldovan, 05): supervised, English, in context, parser, FrameNet, Web counts, SVM, P=92.85%

PP-attachment: Our Approach • Unsupervised • (v,n1,p,n2) quadruples, Ratnaparkhi test set • Google and MSN Search • Exact phrase queries • Inflections: WordNet 2.0 • Adding determiners where appropriate • Models: • n-gram association models • Web-derived surface features • paraphrases

Probabilities: Estimation • Using page hits as a proxy for n-gram counts • Pr(w1|w2) = #(w1,w2) / #(w2) • #(w2) word frequency; query for “w2” • #(w1,w2) bigram frequency; query for “w1 w2” • Pr(w1,w2|w3) = #(w1,w2,w3) / #(w3)

N-gram models • (i) Pr(p|n1) vs. Pr(p|v) • (ii) Pr(p,n2|n1) vs. Pr(p,n2|v) • I eat/v spaghetti/n1 with/p a fork/n2. • I eat/vspaghetti/n1 with/psauce/n2. • Pr or # (frequency) • smoothing as in (Hindle & Rooth, 93) • back-off from (ii) to (i) • N-grams unreliable, if n1 or n2 is a pronoun. • MSN Search: no rounding of n-gram estimates

Web-derived Surface Features P R • Example features • open the door / with a key verb (100.00%, 0.13%) • open the door (with a key) verb (73.58%, 2.44%) • open the door –with a key verb (68.18%, 2.03%) • open the door , with a key verb (58.44%, 7.09%) • eat Spaghetti with sauce noun (100.00%, 0.14%) • eat ? spaghetti with sauce noun (83.33%,0.55%) • eat , spaghetti with sauce noun (65.77%,5.11%) • eat : spaghetti with sauce noun (64.71%,1.57%) • Summing achieves high precision, low recall. sum compare sum

Paraphrases v n1p n2 • v n2n1 (noun) • v p n2n1 (verb) • p n2 * v n1 (verb) • n1p n2 v (noun) • v PRONOUN p n2 (verb) • BE n1 p n2 (noun)

Paraphrases: pattern (1) • v n1pn2 v n2n1 (noun) • Can we turn “n1p n2” into a noun compound “n2 n1”? • meet/v demands/n1from/p customers/n2  • meet/v the customer/n2 demands/n1 • Problem: ditransitive verbs like give • gave/v an apple/n1 to/p him/n2 • gave/v him/n2 an apple/n1 • Solution: • no determiner before n1 • determiner before n2 is required • the preposition cannot be to

Paraphrases: pattern (2) • v n1pn2 v p n2n1 (verb) • If “p n2” is an indirect object of v, then it could be switched with the direct object n1. • had/v a program/n1in/p place/n2 • had/v in/p place/n2aprogram/n1 Determiner before n1 is required to prevent “n2 n1” from forming a noun compound.

Paraphrases: pattern (3) • v n1pn2 p n2 * v n1 (verb) • “*” indicates a wildcard position (up to three intervening words are allowed) • Looks for appositions, where the PP has moved in front of the verb, e.g. • I gave/v an apple/n1to/p him/n2 • to/p him/n2 I gave/v an apple/n1

Paraphrases: pattern (4) • v n1p n2 n1p n2 v (noun) • Looks for appositions, where “n1 p n2” has moved in front of v • shaken/v confidence/n1 in/p markets/n2  • confidence/n1 in/p markets/n2 shaken/v

Paraphrases: pattern (5) • v n1 p n2  v PRONOUN p n2 (verb) • n1 is a pronoun  verb (Hindle&Rooth, 93) • Pattern (5) substitutes n1 with a dative pronoun (him or her), e.g. • put/v a client/n1 at/p odds/n2  • put/v himat/p odds/n2 pronoun

Paraphrases: pattern (6) • v n1 p n2  BEn1 p n2 (noun) • BE is typically used with a noun attachment • Pattern (6) substitutes v with a form of to be (isor are), e.g. • eat/v spaghetti/n1 with/p sauce/n2  • isspaghetti/n1 with/p sauce/n2 to be

Evaluation Ratnaparkhi dataset • 3097 test examples, e.g. prepare dinner for family V shipped crabs from province V • n1 or n2 is a bare determiner: 149 examples • problem for unsupervised methods left chairmanship of the N is the of kind N acquire securities for an N • special symbols: %, /, & etc.: 230 examples • problem for Web queries buy % for 10 V beat S&P-down from % V is 43%-owned by firm N

Results For prepositions other then OF. (of noun attachment) Smoothing is not needed on the Web Models in bold are combined in a majority vote. Simpler but not significantly different from 84.3% (Pantel&Lin,00). Checking directly for...

Task 2: Coordination

Coordination & Problems • (Modified) real sentence: • The Department of Chronic Diseases andHealth Promotion leads and strengthens global efforts to prevent and control chronic diseases or disabilities and to promote health and quality of life. • Problems: • boundaries: words, constituents, clauses etc. • interactions with PPs: [health and [quality of life]] vs. [[health and quality] of life] • or meaning and: chronic diseases or disabilities • ellipsis

NC coordination: ellipsis • Ellipsis • car and truck production • means car production and truck production • No ellipsis • president and chief executive • All-way coordination • Securities and Exchange Commission

NC Coordination: ellipsis • Quadruple (n1,c,n2,h) • Penn Treebank annotations • ellipsis: (NPcar/NN and/CC truck/NN production/NN). • no ellipsis: (NP (NPpresident/NN) and/CC (NP chief/NN executive/NN)) • all-way: can be annotated either way • This is a problem a parser must deal with. Collins’ parser always predicts ellipsis, but other parsers (e.g. Charniak’s) try to solve it.

Related Work • (Resnik, 99): similarity of form and meaning, conceptual association, decision tree, P=80%, R=100% • (Rus & al., 02): deterministic, rule-based bracketing in context, P=87.42%, R=71.05% • (Chantree & al., 05): distributional similarities from BNC, Sketch Engine (freqs., object/modifier etc.), P=80.3%, R=53.8% • (Goldberg, 99): different problem (n1,p,n2,c,n3), adapts Ratnaparkhi (99) algorithm, P=72%, R=100%

N-gram models (n1,c,n2,h) • (i) #(n1,h) vs. #(n2,h) • (ii) #(n1,h) vs. #(n1,c,n2)

Surface Features sum compare sum

Paraphrases n1 c n2 h • n2 c n1h (ellipsis) • n2 h c n1 (NO ellipsis) • n1 h c n2 h (ellipsis) • n2 h c n1 h (ellipsis)

Paraphrases: Pattern (1) • n1 c n2 h  n2 c n1 h (ellipsis) • Switch places of n1 and n2 • bar/n1 and/c pie/n2 graph/h  • pie/n2 and/c bar/n1 graph/h

Paraphrases: Pattern (2) • n1 c n2h n2 h c n1 (NO ellipsis) • Switch places of n1 and n2 h • president/n1 and/c chief/n2 executive/h  • chief/n2 executive/h and/c president/n1

Paraphrases: Pattern (3) h • n1 c n2h n1 h c n2 h (ellipsis) • Insert the elided head h • bar/n1 and/c pie/n2 graph/h • bar/n1 graph/h and/c pie/n2 graph/h

Paraphrases: Pattern (4) h • n1 c n2h n2 h c n1 h (ellipsis) • Insert the elided head h, but alsoswitch n1 andn2 • bar/n1 and/c pie/n2 graph/h • pie/n2 graph/h and/c bar/n1 graph/h

(Rus & al.,02) Heuristics • Heuristic 1: no ellipsis • n1=n2 • milk/n1 and/c milk/n2 products/h • Heuristic 4: no ellipsis • n1 and n2 are modified by an adjective • Heuristic 5: ellipsis • only n1 is modified by an adjective • Heuristic 6: no ellipsis • only n2 is modified by an adjective We use a determiner.

Number Agreement • Introduced by Resnik (93) (a) n1&n2 agree, but n1&h do not  ellipsis; (b) n1&n2 don’t agree, but n1&h do no ellipsis; (c) otherwise leave undecided.

Results428 examples from Penn TB Bad, compares bigram to trigram. Models in bold are combined in a majority vote. Comparable to other researchers (but no standard dataset).

Conclusions & Future Work • Tapping the potential of very large corpora for unsupervised algorithms • Go beyond n-grams • Surface features • Paraphrases • Results competitive with best unsupervised • Results can rival supervised algorithms’ • Future Work • other NLP tasks • better evidence combination There should be even more exciting features on the Web!

The End Thank you!

Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution