1 / 46

Open Information Extraction from the Web Oren Etzioni

Open Information Extraction from the Web Oren Etzioni. KnowItAll Project (2003…). Rob Bart Janara Christensen Tony Fader Tom Lin Alan Ritter Michael Schmitz Dr. Niranjan Balasubramanian Dr. Stephen Soderland Prof. Mausam Prof. Dan Weld

amalie
Télécharger la présentation

Open Information Extraction from the Web Oren Etzioni

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Open Information Extractionfrom the WebOren Etzioni

  2. KnowItAll Project (2003…) Rob Bart Janara Christensen Tony Fader Tom Lin Alan Ritter Michael Schmitz Dr. Niranjan Balasubramanian Dr. Stephen Soderland Prof. Mausam Prof. Dan Weld PhD alumni: Michele Banko, Prof. Michael Cafarella, Prof. Doug Downey, Ana-Maria Popescu, Stefan Schoenmackers, and Prof. Alex Yates Funding: DARPA, IARPA, NSF, ONR, Google. Etzioni, University of Washington

  3. Outline • A “scruffy” view of Machine Reading • Open IE (overview, progress, new demo) • Critique of Open IE • Future work: Open, Open IE Etzioni, University of Washington

  4. I. Machine Reading (Etzioni, AAAI ‘06) • “MR is an exploratory, open-ended, serendipitous process” • “In contrast with many NLP tasks, MR is inherently unsupervised” • “Very large scale” • “Forming Generalizations based on extracted assertions” No Ontology… Ontology Free! Etzioni, University of Washington

  5. Lessons from DB/KR Research • Declarative KR is expensive & difficult • Formal semantics is at odds with • Broad scope • Distributed authorship • KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance” (Halevy IJCAI ‘03) A fortiori, for KBs extracted from text! Etzioni, University of Washington

  6. Machine Reading at Web Scale • A “universal ontology” is impossible • Global consistency is like world peace • Micro ontologies--scale? Interconnections? • Ontological “glass ceiling” • Limited vocabulary • Pre-determined predicates • Swamped by reading at scale! Etzioni, University of Washington

  7. OPEN VERSUS TRADITIONAL IE II. Open vs. Traditional IE How is Open IE Possible? Etzioni, University of Washington

  8. Semantic Tractability Hypothesis easy-to-understand subset of English • Characterized relations/arguments syntactically (Banko, ACL ’08; Fader, EMNLP ’11; Etzioni, IJCAI ‘11) • Characterization is compact, domain independent • Covers 85% of binary, verb-based relations Etzioni, University of Washington

  9. SAMPLEOrrF EXTRACTED RELATIONS SAMPLE RELATION PHRASES Etzioni, University of Washington

  10. NUMBER OF RELATIONS Number of Relations Etzioni, University of Washington

  11. TEXTRUNNER TextRunner (2007) First Web-scale Open IE system Distant supervision + CRF models of relations (Arg1, Relation phrase, Arg2) 1,000,000,000 distinct extractions Etzioni, University of Washington

  12. Relation Extraction from Web Etzioni, University of Washington

  13. After beating the Heat, the Celticsare now the “top dog” in the NBA. (the Celtics, beat, the Heat) Open IE (2012) If he wins 5 key states, Romney will be president (counterfactual: “if he wins 5 key states”) • Open source ReVerb extractor • Synonym detection • Parser-based Ollie extractor (Mausam EMNLP ‘12) • Verbs  Nouns and more • Analyze context (beliefs, counterfactuals) • Sophistication of IE is a major focus But what about entities, types, ontologies? Etzioni, University of Washington

  14. Towards “Ontologized” Open IE • Link arguments to Freebase (Lin, AKBC ‘12) • When possible! • Associate types with Args • No Noun Phrase Left Behind (Lin, EMNLP ‘12) Etzioni, University of Washington

  15. System Architecture Processing Output Input Relation-independent extraction Web corpus Extractor (XYZ Corp.; acquired; Go Inc.) (oranges; contain; Vitamin C) (Einstein; was born in; Ulm) (XYZ; buyout of; Go Inc.) (Albert Einstein; born in; Ulm) (Einstein Bros.; sell; bagels) Raw tuples Synonyms, Confidence XYZ Corp. = XYZ Albert Einstein = Einstein != Einstein Bros. Assessor Acquire(XYZ Corp., Go Inc.) [7] BornIn(Albert Einstein, Ulm) [5] Sell(Einstein Bros., bagels) [1] Contain(oranges, Vitamin C) [1] Extractions Index in Lucene; Link entities Query processor DEMO Etzioni, University of Washington

  16. III. Critique of Open IE • Lack of formal ontology/vocabulary • Inconsistent extractions • Can it support reasoning? • What’s the point of Open IE? Etzioni, University of Washington

  17. Perspectives on Open IE • “Search Needs a Shakeup”(Etzioni, Nature ’11) • Textual Resources • Reasoning over Extractions Etzioni, University of Washington

  18. A. New Paradigm for Search “Moving Up the Information Food Chain” (Etzioni, AAAI ‘96) Retrieval  Extraction Snippets, docs  Entities, Relations Keyword queries  Questions List of docs  Answers Essential for smartphones! (Siri meets Watson) Etzioni, University of Washington

  19. Case Study over Yelp Reviews • Map review corpus to (attribute, value) (sushi = fresh) (parking = free) • Natural-language queries “Where’s the best sushi in Seattle?” • Sort results via sentiment analysis exquisite > very good > so, so Etzioni, University of Washington

  20. RevMiner: Extractive Interface to 400K Yelp Reviews (Huang, UIST ’12) Revminer.com Etzioni, University of Washington

  21. (police investigate X)  (police charge Y) B. Public Textual Resources(Leveraging Open IE) • 94MRel-grams: n-grams, but over relations in text (Balasubarmanian. AKBC’12) • 600K Relation phrases (Fader, EMNLP ‘11) • Relation Meta-data: • 50K Domain/range for relations (Ritter, ACL ‘10) • 10K Functional relations (Lin, EMNLP ‘10) • 30K learned Horn clauses (Schoenmackers, EMNLP ‘10) • CLEAN (Berant, ACL ‘12) • 10M entailment rules (coming soon) • Precision double that of DIRT See openie.cs.washington.edu Etzioni, University of Washington

  22. C. Reasoning over Extractions Identify synonyms (Yates & Etzioni JAIR ‘09) Linear-time 1st order Horn-clause inference (Schoenmackers EMNLP ’08) 1,000,000,000 Extractions Transitive Inference (Berant ACL ’11) Learn argument types Via generative model (RitterACL ‘10) Etzioni, University of Washington

  23. Unsupervised, probabilistic model for identifying synonyms • P(Bill Clinton = President Clinton) • Count shared (relation, arg2) • P(acquired = bought) • Relations: count shared (arg1, arg2) • Functions, mutual recursion • Next step: unify with Etzioni, University of Washington

  24. Scalable Textual Inference Desiderata for inference: • In text  probabilistic inference • On the Web  linear in |Corpus| Argument distributions of textual relations: • Inference provably linear • Empirically linear!

  25. Inference Scalability for Holmes

  26. Extractions  Domain/range • Much previous work (Resnick, Pantel, etc.) • Utilize generative topic models Extractions of R  Document Domain/range of R  topics

  27. Relations as Documents TextRunner Extractions born_in(Sergey Brin, Moscow) headquartered_in(Microsoft, Redmond) born_in(Bill Gates, Seattle) born_in(Einstein, March) founded_in(Google, 1998) headquartered_in(Google, Mountain View) born_in(Sergey Brin, 1973) founded_in(Microsoft, Albuquerque) born_in(Einstein, Ulm) founded_in(Microsoft, 1973)

  28. Generative Story [LinkLDA, Erosheva et. al. 2004] a For each relation, randomly pick a distribution over types X born_in Y P(Topic1|born_in)=0.5 P(Topic2|born_in)=0.3 …  z1 z2 Then pick arguments based on types For each extraction, pick type for a1, a2 Two separate sets of type distributions Person born_in Location Pick a topic for arg2 Pick a topic for arg2 Pick a topic for arg2 a1 a2 N Sergey Brin born_in Moscow R  g T T h1 h2

  29. Examples of Learned Domain/range • elect(Country, Person) • predict(Expert, Event) • download(People, Software) • invest(People, Assets) • Was-born-in(Person, Location OR Date) Etzioni, University of Washington

  30. Summary: Trajectory of Open IE Openie.cs.washington.edu Etzioni, University of Washington

  31. IV. Future: Open Open IE • Open input: ingest tuples from any source (Tuple, Source, Confidence) • Linked Open Output: • Extractions  Linked-open Data (LOD) cloud • Relation normalization • Use LOD best practices • Specialized reasoners Etzioni, University of Washington

  32. Conclusions • Ontology is not necessary for reasoning • Open IE is “gracefully” ontologized • Open IE is boosting text analysis • LOD has distribution & scale (but not text) = opportunity Thank you Etzioni, University of Washington

  33. qs • Why Open? • What’s next? • Dimensions for analyzing systems • What’s worked, what’s failed? (lessons) • What can we learn from watson? • What can we learn from db/kr? (alon) Etzioni, University of Washington

  34. Questions • What extraction mechanism is used? • What corpus? • What input knowledge? • Role for people/manual labling • Form of the extracted knowledge? • Size/scope of extracted knowledge? • What reasoning is done? • Most unique aspect? • Biggest challenge? Etzioni, University of Washington

  35. Scalability notes • Interoperability, distributed authorship, vs. a monolithic system • Open IE meets RDF: • Need URI’s for predicates. How to obtain? • What about errors in mapping to URI? • Ambiguity? Uncertainty? Etzioni, University of Washington

  36. reasoning • Nell: inter-class constraints to gen negative egs Etzioni, University of Washington

  37. Dims of scalability • Corpus size • Syn coverage over text • Sem coverage over text • Time, belief, n-ary relations, etc. • Number of entities, relations • Ability to reason • How much cpu? • How much manual effort? • Bounding, cielign effect, ontological glass ceiling Etzioni, University of Washington

  38. Example of limiting assumptions • Nell: apple has single meaning • Single atom per entity • Global computation to add entity • Can’t be sure • LOD: • Best practice • Same-as links Etzioni, University of Washington

  39. Risk for scalable system • Limited semantics, reasoning • No reasoning… Etzioni, University of Washington

  40. LOD triple in aug 2011: 31,634,213,770 Etzioni, University of Washington

  41. . The following statement appears in the last paragraph of W3C Linked Library Data Group Final Report: • . . . Linked Data follows an open-world assumption: the assumption that data cannot generally be assumed to be complete and that, in principle, more data may become available for any given entity. Etzioni, University of Washington

  42. Etzioni, University of Washington

  43. Entity Linking an Extraction Corpus Einstein quit his job at the patent office (8) 1. String Match 2. Prominence Priors 3. Context Match Link Score (med) (med) (med) (med) (low) 1,281 inlinks 168 inlinks 56 inlinks 101 inlinks 4,620 inlinks US Patent Office EU Patent Office Japan Patent Office Swiss Patent Office Patent (low) (low) (low) (very high) (low) (med) (low) (low) (high) (low) ∝ Wikipedia Article Texts Obtain candidates, and measure string similarity. Exact String Match = best match also consider: “Document” of the extraction’s source sentences Prominence # of links in Wikipedia to that Entity’s article Collective Linking vs One Extraction at a time Link Score is a function of (String Match Score, Prominence Prior Score, Context Match Score) e.g., String Match Score x ln(Prominence Prior Score) x Context Match Score Link Ambiguity = US Patent Office “Einstein quit his job at the patent office.” “Einstein quit his job at the patent office to become a professor.” “In 1909, Einstein quit his job at the patent office.” “Einstein quit his job at the patent office where he worked.” US Patent Office EU Patent Office Japan Patent Office cosine similarity Swiss Patent Office Patent EU Patent Office Japan Patent Office 2.53GHz computer links 15 million text arguments in ~3 days (60+ per second) Faster Higher Precision Known Aliases Alternate capitalization Edit distance Word overlap Substring/ Superstring Potential Abbreviations Swiss Patent Office Patent 2nd Top Link Score Top Link Score Etzioni, University of Washington

  44. Q/A with Linked Extractions • Ambiguous Entities • Typed Search • Linked Resources Leverages KBs by linking textual arguments to entities found in the knowledge base. Sports that originated in China “The Titanic set sail from Southampton”” “RMS Titanic weighed about 26 kt” “The Titanic was built for safety and comfort” “The Titanic sank in 12,460 feet of water” (1,902 more …) “Titanic earned more than $1 billion worldwide” “The Titanic sank in 1912” “The Titanic was released in 1998” “Titanic represents the state-of-the-art in special effects” “Titanic was built in Belfast” (3,761 more …) “Noodles originated in China” “Printmaking originated in China” “Soy Beans originated in China” “Wushu originated in China” “Taoism originated in China” “Ping Pong originated in China” (534 more …) “Golf originated in China” “Soccer originated in China” “Karate originated in China” “Dragon Boating originated in China” (14 more …) Soccer Golf “I need to learn about Titanic the ship for my homework.” “Which sports originated in China?” Wushu Karate Dragon Boating Ping Pong Freebase Sports “Dragon Boat Racing” “Table Tennis” … … Etzioni, University of Washington

  45. Linked Extractions support Reasoning In addition to Question Answering, Linking can also benefit: Functions [Ritter et al., 2008; Lin et al., 2010] Other Relation Properties [Popescu 2007; Lin et al., CSK 2010] Inference [Schoenmackerset al., 2008; Berantet al., 2011] Knowledge-Base Population [Dredze et al., 2010] Concept-Level Annotations [Christensen and Pasca, 2012] … basically anything using the output of extraction Other Web-based text containing Entities (e.g., Query Logs) can also be linked to enable new experiences… Etzioni, University of Washington

  46. Challenges • Single-sentence extraction • He believed the plan will work • John Glenn was the first American in space • Obama was elected President in 2008. • American president Barack Obama asserted… • ?? Etzioni, University of Washington

More Related