1 / 53

Machine Reading of Web Text

Machine Reading of Web Text . Oren Etzioni Turing Center University of Washington http://turing.cs.washington.edu. Rorschach Test. Rorschach Test for CS . Moore’s Law? . Storage Capacity? . Number of Web Pages? . Number of Facebook Users?. Turing Center Foci.

colum
Télécharger la présentation

Machine Reading of Web Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Reading of Web Text Oren Etzioni Turing Center University of Washington http://turing.cs.washington.edu

  2. Rorschach Test

  3. Rorschach Test for CS

  4. Moore’s Law?

  5. Storage Capacity?

  6. Number of Web Pages?

  7. Number of Facebook Users?

  8. Turing Center Foci • Scale MT to 49,000,000 language pairs • 2,500,000 word translation graph • P(V  F  C)? • PanImages • Accumulate knowledge from the Web • A new paradigm for Web Search

  9. Outline • A New Paradigm for Search • Open Information Extraction • Tractable Inference • Conclusions

  10. Web Search in 2020? • Type key words into a search box? • Social or “human powered” Search? • The Semantic Web? • What about our technology exponentials? “The best way to predict the future is to invent it!”

  11. Intelligent Search Instead of merely retrieving Web pages, read ‘em! Machine Reading = Information Extraction (IE) + tractable inference • IE(sentence) = who did what? • speaker(Alon Halevy, UW) • Inference = uncover implicit information • Will Alon visit Seattle?

  12. Application: Information Fusion • What kills bacteria? • What west coast, nano-technology companies are hiring? • Compare Obama’s “buzz” versus Hillary’s? • What is a quiet, inexpensive, 4-star hotel in Vancouver?

  13. Opinion Mining • Opine (Popescu & Etzioni, EMNLP ’05) • IE(product reviews) • Informative • Abundant, but varied • Textual • Summarize reviews without any prior knowledge of product category

  14. But “Reading” the Web is Tough • Traditional IE is narrow • IE has been applied to small, homogenous corpora • No parser achieves high accuracy • No named-entity taggers • Nosupervised learning How about semi-supervised learning?

  15. Semi-Supervised Learning per concept! • Few hand-labeled examples •  Limit on the number of concepts •  Concepts are pre-specified •  Problematic for the Web • Alternative:self-supervisedlearning • Learner discovers concepts on the fly • Learner automatically labels examples

  16. 2. Open IE = Self-supervised IE (Banko, Cafarella, Soderland, et. al, IJCAI ’07)

  17. Extractor Overview (Banko & Etzioni, ’08) • Use a simple model of relationships in English to label extractions • Bootstrap a general model of relationships in English sentences, encoded as a CRF • Decompose each sentence into one or more (NP1, VP, NP2) “chunks” • Use CRF model to retain relevant parts of each NP and VP. The extractor is relation-independent!

  18. TextRunner Extraction • Extract Triple representing binary relation (Arg1, Relation, Arg2) from sentence. Internet powerhouse, EBay, was originally founded by Pierre Omidyar. Internet powerhouse,EBay,was originally founded byPierre Omidyar. (Ebay, Founded by, Pierre Omidyar)

  19. Numerous Extraction Challenges • Drop non-essential info: “was originally founded by”  founded by • Retain key distinctions Ebay founded byPierr≠Ebayfounded Pierre • Non-verb relationships “George Bush, president of the U.S…” • Synonymy & aliasing Albert Einstein = Einstein ≠ Einstein Bros.

  20. TextRunner (Web’s 1st Open IE system) • Self-Supervised Learner: automatically labels example extractions & learns an extractor • Single-Pass Extractor: single pass over corpus, identifying extractions in each sentence • Query Processor: indexes extractions enables queries at interactive speeds

  21. TextRunner Demo

  22. Concrete facts: (Oppenheimer, taught at, Berkeley) Abstract facts: (fruit, contain, vitamins) Triples 11.3 million With Well-Formed Relation 9.3 million With Well-Formed Entities 7.8 million Abstract 6.8 million 79.2% correct Concrete 1.0 million 88.1% correct Sample of 9 million Web Pages

  23. 3. Tractable Inference Much of textual information is implicit • Entity and predicate resolution • Probability of correctness • Composing facts to draw conclusions

  24. I. Entity Resolution Resolver(Yates & Etzioni, HLT ’07): determines synonymy based on relations found by TextRunner (cf. Pantel & Lin ‘01) • (X, born in, 1941) (M, born in, 1941) • (X, citizen of, US) (M, citizen of, US) • (X, friend of, Joe) (M, friend of, Mary) P(X = M) ~ shared relations

  25. (1, R, 2) (2, R 4) (4, R, 8) Etc. (1, R’ 2) (2, R’, 4) (4, R’ 8) Etc. Relation Synonymy • P(R = R’) ~ shared argument pairs • Unsupervised probabilistic model • O(N log N) algorithm run on millions of docs

  26. II. Probability of Correctness How likely is an extraction to be correct? Factors to consider include: • Authoritativeness of source • Confidence in extraction method • Number of independent extractions

  27. Counting Extractions Lexico-syntactic patterns: (Hearst ’92) “…cities such as Seattle,Boston, and…” Turney’s PMI-IR, ACL ’02: • PMI ~ co-occur frequency  # results • # results  confidence in class membership.

  28. Formal Problem Statement If an extraction xappears ktimes in a set of ndistinctsentences each suggesting that xbelongs to C, what is the probability that x  C? C is a class (“cities”) or a relation (“mayor of”) Note: we only count distinct sentences!

  29. Combinatorial Model (“Urns”) Odds increase exponentially withk, but decrease exponentially withn See Downey et al.’s IJCAI ’05 paper for formal details.

  30. Performance (15x Improvement) Self supervised, domain independent method

  31. e.g., (MichaelBloomberg, New York City) Tend to be correct e.g.,(Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland) A mixture of correct and incorrect URNS limited on “sparse” facts context

  32. Language Models to the Rescue (Downey, Schoenmackers, Etzioni, ACL ’07) Instead of only lexico-syntactic patterns, leverage all contexts of a particular entity Statistical ‘type check’: does Pickerington “behave” like a city? does Shaver “behave” like a mayor? Language model = HMM (built once per corpus) • Project string to point in 20-dimensional space • Measure proximity of Pickerington to Seattle, Boston, etc.

  33. III Compositional Inference (work in progress, Schoenmackers, Etzioni, Weld) Implicit information, (2+2=4) • TextRunner: (Turing, born in, London) • WordNet: (London, part of, England) • Rule: ‘born in’ is transitive thru ‘part of’ • Conclusion: (Turing, born in, England) • Mechanism: MLN instantiated on the fly • Rules: learned from corpus (future work) • Inference Demo

  34. KnowItAll Family Tree Mulder ‘01WebKB ‘99PMI-IR ‘01 KnowItAll, ‘04 Opine ‘05 BE ‘05 Urns Woodward ‘06 KnowItNow ‘05 Resolver ‘07 TextRunner ‘07 REALM ‘07 Inference ‘08

  35. KnowItAll Team • Michele Banko • Michael Cafarella • Doug Downey • Alan Ritter • Dr. Stephen Soderland • Stefan Schoenmackers • Prof. Dan Weld • Mausam • Alumni: Dr. Ana-Maria Popescu, Dr. Alex Yates, and others.

  36. Related Work • Sekine’s “pre-empty IE” • Powerset • Textual Entailment • AAAI ‘07 Symposium on “Machine Reading” • Growing body of work on IE from the Web

  37. 4. Conclusions Imagine search systems that operate over a (more) semantic space • Key words, documents  extractions • TF-IDF, pagerank  relational models • Web pages, hyper links  entities, relns Reading the Web  new Search Paradigm

  38. Thank you

  39. Machine Reading =Unsupervised understanding of text Much is implicit tractableinference is key!

  40. HMM in more detail Training: seek to maximizeprobability of corpus wgiven latent states t using EM: ti ti+1 ti+2 ti+3 ti+4 wi wi+1 wi+2 wi+3 wi+4 cities such as Los Angeles

  41. Using the HMM at Query Time • Given a set of extractions (Arg1, Rln, Arg2) • Seeds = most frequent Args for Rln • Distribution over t is read from the HMM • Compute KL divergence via f(arg, seeds) • For each extraction, average f over Arg1 & Arg2 • Sort “sparse” extractions in ascending order

  42. Language Modeling & Open IE • Self supervised • Illuminating phrases  full context • Handles sparse extractions

  43. Focus: Open IE on Web Text Advantages Challenges Difficult, ungrammatical sentences Unreliable information Heterogeneous corpus “Semantically tractable” sentences Redundancy Search engines

  44. II. Probability of Correctness How likely is an extraction to be correct? Distributional Hypothesis: “words that occur in the same contexts tend to have similar meanings ” KnowItAll Hypothesis: extractions that occur in the same informative contexts morefrequently are more likely to be correct.

  45. Argument “Type checking” via HMM Relation’s arguments are “typed”: (Person,Mayor Of,City) Training: Model distribution of Person&City contexts in corpus (Distributional Hypothesis) Query time: Rank sparse triples by how well each argument’s context distribution matches that of its type

More Related