1 / 21

Extracting a Lexical Entailment Rule-base from Wikipedia Eyal Shnarch, Libby Barak, Ido Dagan

Extracting a Lexical Entailment Rule-base from Wikipedia Eyal Shnarch, Libby Barak, Ido Dagan Bar Ilan University. Entailment - What is it and what is it good for?. Question Answering: Information Retrieval: “The Beatles”. luxury cars. are produced in ?”. Britain. “Which.

skyler
Télécharger la présentation

Extracting a Lexical Entailment Rule-base from Wikipedia Eyal Shnarch, Libby Barak, Ido Dagan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting a Lexical Entailment Rule-base from Wikipedia Eyal Shnarch, Libby Barak, Ido Dagan Bar Ilan University

  2. Entailment - What is it and what is it good for? • Question Answering: • Information Retrieval: “The Beatles” luxury cars are produced in ?” Britain “Which

  3. Lexical Entailment • Lexical Entailment rules model such lexical relations • Part of the Textual Entailment paradigm – a generic framework for semantic inference • Encompasses a variety of relations: • Synonymy: Hypertension Elevated blood-pressure • IS-A: Jim Carrey actor • Predicates: Crime and Punishment  Fyodor Dostoyevsky • Reference: Abbey Road  The Beatles

  4. What was done so far? • Lexical database, made for computational consumption, NLP resource - WordNet • Costly, need experts, many years of development (since 1985) • Distributional similarity • Country and State share similar contexts • But also Nurse and Doctor, Bear and Tiger - Low precision • Patterns: • NP1 such as NP2 luxury car such as Jaguar • NP1 and other NP2 dogs and other domestic pets • Low coverage, mainly IS-A patterns

  5. Our approach – Utilize Definitions • Pen: an instrument for writing or drawing with ink. • pen is-an instrument • pen used for writing / drawing • ink is part of pen • Source of definitions: • Dictionary: describes language terms, slow growth • Encyclopedia: contains knowledge, proper names, events, concepts, rapidly grow • We chose Wikipedia • Very dynamic, constantly growing and updating • Covers a vast range of domains • Gaining popularity in research - AAAI 2008 workshop

  6. Extraction Types • Be-compliment • noun in the position of a compliment of a verb ‘be’ • All-Nouns • all nouns in the definition • different likelihood to be entailed

  7. film subj vrel title directed by-subj by pcomp-n noun Ranking All-Nouns Rules • The likelihood of entailment depends greatly on the syntactic path connecting the title and the noun. • Path in a parsed tree • An unsupervised entailment likelihood score for a syntactic path p within a definition: • Split Def-N into Def-Ntop and Def-Nbot • Indicative for rule reliability - Def-Ntop rules’ precision is much higher than Def-Nbot’s.

  8. Extraction Types • Redirect • noun in the position of a • Parenthesis • all nouns in the definition • Link • all nouns in the definition

  9. Ranking Rules by Supervised Learning

  10. Ranking Rules by Supervised Learning • An alternative approach for deciding which rules to select out of all extracted rules. • Each rule is represented by: • 6 binary features: one for each extraction type • 2 binary features: one for each side of the rule indicating whether it is NE • 2 numerical features: rule sides’ co-occurrence & count extracted • 1 numeric feature: the score of the path for Def-N extraction type • Manually annotated set used to train SVMlight • Varied the J parameter in order to obtain different recall-precision tradeoffs Extraction Types

  11. Results and Evaluation • The obtained knowledge base include: • About 10 million rules • For comparison: Snow’s extension to WordNet includes 400,000 relations. • More than 2.4 million distinct RHSs • 18% of the rules extracted by more than one extraction type • Mostly named entities and specific concepts, as expected from encyclopedia • Two Evaluation types: • Rule-based: rule correctness relative to human judgment • Inside real application: the utility of the extracted rules for lexical expansion in keyword-based text categorization Results & Evaluations

  12. Rule-base Evaluation • Randomly sampled 830 rules and annotated them for correctness • inter annotators agreement achieved Kappa of 0.7 • Precision: the percentage of correct rules • Est. # of correct rules: number of rules annotated as correct multiply by the sampling proportion. Results & Evaluations

  13. Supervised Learning Evaluation • 5-fold cross validation on the annotated sample: • Although considering additional information, performance is almost identical to considering only extraction types. • Further research is needed to improve our current feature set and classification performance. Results & Evaluations

  14. Text Categorization Evaluation • Represent a category by a feature vector of characteristic terms for it. • The characteristic terms should entail the category name. • Compare the term-based feature vector of a classified document with the feature vectors of all categories. • Assign the document to the category which yields the highest cosine similarity score (single-class classification). • 20-News Groups collection • 3 baselines: No expansions, WordNet, WikiBL, [Snow] • Also evaluated the union of Wikipedia and WordNet Results & Evaluations

  15. Text Categorization Evaluation Results & Evaluations

  16. Promising Directions for Future Work • Learning semantic relations in addition to Taxonomical relations (hyponym, synonyms) : • Fine-grained relations of LE is important for inference Conclusions & Future Work

  17. Promising Directions for Future Work • Natural Types, naturally phrased entities: • 56,000 terms entail Album • 31,000 terms entail Politician • 11,000 terms entail Footballer • 20,000 terms entail Actor • 15,000 terms entail Actress • 4,000 terms entail American Actor Conclusions & Future Work

  18. Conclusions • First large-scale rule base directed to cover LE. • Learning ontology which is a very important knowledge for reasoning systems (one of the conclusions of the first 3 RTE benchmarks). • Automatically extracting lexical entailment rules from an unstructured source • Comparable results, on a real NLP task, to a costly manually crafted resource such as WordNet. Thank You Conclusions & Future Work

  19. Inference System t: Strong sales were shown for Abbey Road in 1969. grammar rule: passive to active Abbey Road showed strong sales in 1969. lexical entailment rule: Abbey Road  The Beatles The Beatles showed strong sales in 1969. lexico-syntactic rule: show strong sales  gain commercial success h: The Beatles gained commercial success in 1969. Textual Entailment

More Related