Acquiring entailment pairs across languages and domains: A data analysis

Acquiring entailment pairs across languages and domains: A data analysis Manaal Faruqui Dept. of Computer Science & Engineering IIT Kharagpur Sebastian Padó Institut für ComputerlinguistikUniversität Heidelberg

Textual Entailment • A Premise P entails a Hypothesis H if a human reading P can infer that H is most likely true (Dagan et al. 2004) • (P): I have won Rs. 5000 in a lottery today ! • (H): I made a huge profit today. • (P): Victor, the parrot kept shrieking “Water, water”. • (H): Thirsty Jaguar procures water for Bulgarian zoo. + -

Recognizing Textual Entailment • Variety of approaches • Alignment/matching (Monz & de Rijke 2001, MacCartney et al. 2006) • Transformations (Bar-Haim et al. 2007, Harmeling 2009) • Logics-based (MacCartney & Manning 2008, Bos & Markert 2005) • Many systems have a supervised learning component • Optimize model parameters • Training requires positive/negative entailment pairs • Little available: RTE Challenges create ~1000 pairs per year • Creating manually tagged training data is expensive • Wanted: Automatic extraction of entailment pairs

A heuristic for extracting entailment pairs • The most prominent idea: Take advantage of document structure (Burger and Ferro 2005) • In newswire articles, title is often abbreviated version of first sentence: First sentence entails title. Sainsbury's reports record Christmas sales J Sainsbury has posted its best ever Christmas sales, with strong demand for homeware and electrical goods driving up trading over the festive period. …(Guardian, Jan 12 2011)

Previous work – Our questions • Burger and Ferro (2005) • 50% of title-first sentence pairs show entailment • SVM identifies documents with 77% accuracy • Hickl et al. (2006) • remove pairs that “do not share an entity (or NP)”: 92% acc. • Not a lot of detail available • Our questions: • Does this work across languages? • Does this work across sources (genres)?

Our Agenda • Extract headline-first sentence pairs from newswire • Experiment 1: Different languages (English, German, Hindi) • Experiment 2: Different sources (German newspapers) • Filtering of entailment pairs (motivated by Hickl et al.) • Remove sentences that do not share a noun, questions • Manual annotation of entailment pair candidates • Identify phenomena that break entailment • Classification wrt entailment by logistic regression model • Analyse usefulness of predictors

Step 3: Manual annotation (I) • A fine-grained annotation scheme (8 classes) • Main improvement: Subdivision of “No” class into five subclasses for entailment-breaking phenomena • “No-par(tial)”: When P “almost” entails H, but P misses one crucial bit of information (P): Gaza will soon get its first American fast food outlet (H): KFC to open restaurant in Gaza • “No-pre”: Comprehension of P presupposes H (P): In this manner, he hopes to increase industrial growth. (H): Bush ordered tax rates on import to be reduced.

A fine-grained annotation scheme • “No-con”: Direct contradiction between P and H. (P): How the biological clock works is still unknown. (H): Light regulates the biological clock. • “No-emb(ed)”: Some type of embedding (e.g. a modal verb) breaks the entailment (P): A gambling amendment is expected to be submitted to the state Supreme Court (H):Gaming petition goes before court • “No-oth(er)”: All cases without a more specific category (P): Victor, the parrot kept shrieking “Water, water”. (H): Thirsty Jaguar procures water for Bulgarian zoo.

Ill-formed sentence pairs • “Err”: Due to errors in sentence boundary detection • “Ill”: Some titles are not single grammatical sentences and can not be interpreted sensibly as a hypothesis (H): Research Alert: Mexico Upped, Chile cut.

Logistic regression modeling • Logistic regression models predict a binary response variable y based on a set of predictors x: • Train on annotated data (lump all classes into “yes”/“no”) • Analysis step 1: Compute coefficients β of predictors • Significance • eβ can be seen as odds: change in p(y=1) when x changes • Analysis step 2: test how well predictors generalize • Apply models trained on corpus 1 to predict entailment for corpus 2

Predictors • Four (hopefully language-independent) predictors • Weighted word-overlap: a tf-idf (informativity-based) weighting scheme to compute the word overlap between P and H • Hypothesis: high word overlap  higher chance of entailment

Predictors • Strict noun match: Precision-focused boolean predictor: true if all H nouns are present in P • Hypothesis: strict noun match  higher chance of entailment • Log num words: (logarithmized) length of article • Hypothesis: longer article  lower chance of entailment • Punctuation: presence of colon, full stop, hypen in title -- indicator of titles that cannot be interpreted as hypotheses • Hypothesis: punctuation  lower chance of entailment

Exp 1, Analysis by Language: Annotation • English & German: Reuters RCV2 (politics/economy) • Hindi: EMILLE Corpus (politics) • Reasonable number of entailing pairs (“yes”) • More than 50% (Burger/Ferro) but less than 92% (Hickl) • German headlines often not simple sentences (“ill”) • Many Hindi “other” cases: 1st sentence less “to the point” • Embeddings, presuppositions, contradictions very rare

Exp 1, Analysis by Language: Predictors • Odds of predictors trained on different corpora: • Highly significant for all three languages: Word overlap and punctuation • Hypotheses validated by the data • Generally insignificant: Article length (too noisy) and noun match (too strict)

Exp 1, Analysis by Language: Accuracy • Application of L1 models to predict L2 data • Goal is a clean dataset (high precision) • Evaluation : Set recall of “yes” class to 30%, compare precision • Precision for all models over 90% • Only minor losses when applying models across languages • Overall: Languages and predictors behave similarly

Exp 2, Analysis by Source: Annotation • German newspapers: Reuters, StuttgarterZeitung, Die Zeit • StuttgarterZeitung (StuttZ): newswiry, but less consistent “house style”, more coverage of regional and local events • Die Zeit: “high-brow” weekly (culture, science, sociopolitics) • StuttZ, Die Zeit: less entailment pairs (“yes”) • Die Zeit: many ill-formed and unrelated (“no-oth”) pairs • “Intellectual style”

Exp 2, Analysis by Source: Predictors • Odds of predictors trained on different corpora: • Fairly similar picture to Exp 1 • word overlap, punctuation highly significant • log num words, noun match not significant

Exp 2, Analysis by Source: Accuracy • Results much worse than in Exp 1 • Precision > 90% only for Reuters; StuttZ 84%, Die Zeit < 50%! • Generally larger losses for application across sources (5%) • Reflects “more difficult” distribution of Die Zeit (only?) • Generalization across sources seems to be more difficult than across languages

Lexical analysis of the sources • How domain-specific are the three sources? • KL divergence between their unigram distribution and a domain-unspecific reference corpus (Ciaramita and Baroni 2006) • Higher KL divergence = more specific • Stand-in for the reference corpus: deWac(Baroni et al. 2009) • Reuters most specific, Die Zeit least specific

Summary: The “newswire heuristic” • A prominent heuristic to obtain entailment pairs: Combine title of newspaper article with first sentence • We applied it to three languages and three sources • Annotation + analysis by logistic regression model • Main Results: • Entailment breakers: Title ill-formed,unrelatedness • Generalization across languages works well • Generalization across sources does not work well

Why does the newswire heuristic work? • Reuters articles have a consistent style • Reuters articles come from a specific domain • These two properties are shared by similar news agency outlets in other languages… • …but not necessarily by other types of newspapers!

The take-home message • Unless you want to extract entailment pairs from Reuters, look for another heuristic • Extraction from Wikipedia (edits)? • Committee of RTE systems? • Generation? • At the end of the day, the question is what entailment phenomena you want to collect instances of • There is no such thing as a representative sample of entailment pairs

Thank you !

Acquiring entailment pairs across languages and domains: A data analysis