1 / 52

Information Extraction from Wikipedia: Moving Down the Long Tail

Information Extraction from Wikipedia: Moving Down the Long Tail. Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA. Intelligence in Wikipedia:.

montague
Télécharger la présentation

Information Extraction from Wikipedia: Moving Down the Long Tail

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Intelligence in Wikipedia: Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel, Stef Schoenmackers & Michael Skinner

  2. Bob was born in Northwestern Memorial Hospital. … … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago … … Bob Black is an active actor who was selected as this year’s … Motivating Vision • Next-Generation Search = Information Extraction + Ontology + Inference Which performing artists were born in Chicago?

  3. Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who … … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago … Next-Generation Search • Information Extraction • <Bob, Born-In, NMH> • <Bob Black, ISA, actor> • <NMH, in Chicago> … • Ontology • Actor ISA Performing Artist … • Inference • Born-In(A) ^ PartOf(A,B) => Born-In(B) …

  4. Wikipedia – Bootstrap for the Web • Goal: search over the Web • Now: search over Wikipedia • Comprehensive • High-quality • (Semi-)Structured data

  5. Infoboxes • Infoboxes are designed to present summary information about an article's subject, such that similar subjects have a uniform look and in a common format • An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms.

  6. Infobox examples Basic infobox Taxobox –Plant species

  7. More example Infobox People - Actor Infobox- Convention Center

  8. Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extracting from the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

  9. Kylin: Autonomously Semantifying Wikipedia • Totally autonomous with no additional human efforts • Form training dataset based on infoboxes • Extract semantic relations from Wikipedia articles Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage. ------Wikipedia

  10. Kylin • It is a prototype of self-supervised, machine learning system • It looks for classes of pages with similar infoboxes • It determines common attributes • It creates training examples

  11. Infobox Generation

  12. Preprocessor Schema Refinement Free edit -> schema drift • Duplicate templates:U.S.County(1428), US County(574), Counties(50), County(19) • Low usage of attribute • Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year” Kylin: • Strict name match • 15% occurrences

  13. Preprocessor • Training Dataset Construction Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812. Its county seat is Clearfield. 2,972 km² (1,147 mi²) of it is land and 17 km²  (7 mi²) of it (0.56%) is water. As of 2005, the population density was 28.2/km².

  14. Classifier • Document Classifier List and Category • Fast • Precision(98.5%) • Recall(68.8%) • Sentence Classifier • Predicts which attribute value are contained in given sentence. • It uses maximum entropy model. • To decrease noisy and incomplete training dataset, Kylin apply bagging.

  15. CRF Extractor Conditional Random Fields Model • Attribute value extraction: sequential data labeling • CRF model for each attribute independently • Relabel–filter false negative training examples • 2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water. Preprocessor: Water_area Classifier: Water_area; Land_area • Though Kylin is successful on popular classes, its performance decreases on sparse classes where there is insufficient training data.

  16. Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extractingfrom the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

  17. Long-Tail1: Sparse Infobox Class • Kylin Performs Well on Popular Classes: • Precision: mid 70% ~ high 90% • Recall: low 50% ~ mid 90% • Kylin Flounders on Sparse Classes – Little Training Data e.g: for “US County class ” Kylin has 97.3% precision and 95.9% recall while many other classes like “Irish Newspaper” contains very small number of infobox containing articles

  18. Long-Tail 2: Incomplete Articles • Desired Information Missing from Wikipedia • Among 1.8 millions pages [July 2007 of Wikipedia ] many are short articles and almost 800,000 (44.2%) are marked as stub pages indicating much needed information is missing.

  19. Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extractingfrom the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

  20. Shrinkage • Attempt to improve Kylin’s performance using shrinkage. • We use Shrinkage when training an extractor of an instance-space infobox class by aggregating data from its parent and children classes

  21. .birth_place person (1201) .location performer (44) .birthplace .birth_place .cityofbirth .origin actor (8738) comedian (106) Shrinkage [McCallum et al., ICML98]

  22. Shrinkage • KOG (Kylin Ontology Generator) [Wu & Weld, WWW08] person (1201) performer (44) actor (8738) comedian (106) .birth_place .location .birthplace .birth_place .cityofbirth .origin

  23. Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extractingfrom the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

  24. TextRunner@UW Retraining Complementary to Shrinkage: Harvest extra training data frombroaderWeb Key: • Identify relevant sentences given the sea of Web data? Andrew Murray was born in Scotland in 1828 …… <Andrew Murray, was born in, Scotland> <Andrew Murray, was born in, 1828>

  25. Retraining Kylin Extraction: TextRunner Extraction: Query TextRunner for relevant sentences: t=< Ada Cambridge, location, “St Germans , Norfolk , England”> • r1=<Ada Cambridge, was born in, England> Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in 1870. • r2=<Ada Cambridge, was born in, “Norfolk , England”> Ada Cambridge was born in Norfolk , England , in 1844 .

  26. Effect of Shrinkage & Retraining

  27. Effect of Shrinkage & Retraining 1755% improvement for a sparse class 13.7% improvement for a popular class

  28. Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extracting from the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

  29. Extraction from the Web • Idea: apply Kylin extractors trained on Wikipedia to general Web pages • Challenge: maintain high precision • General Web pages are noisy • Many Web pages describe multiple objects • Key: retrieve relevant sentences • Procedure • Generate a set of search engine queries • Retrieve top-k pages from Google • Weight extractions from these pages

  30. Choosing Queries Example: get birth date attribute for article titled “Andrew Murray (minister)” “andrew murray”“andrew murray” birth date“andrew murray” was born in“andrew murray” … attribute name predicatesfromTextRunner

  31. Weighting Extractions Which extractions are more relevant? Features • : # sentences between sentence and closest occurrence of title (‘andrewmurray’) • : rank of page on Google’s result lists • : Kylin’s extractor confidence

  32. Web Extraction Experiment • Extractor confidence alone performs poor • Weighted combination is the best

  33. Combining Wikipedia & Web Recall Benefit from Shrinkage / Retraining…

  34. Combining Wikipedia & Web Benefit from Shrinkage + Retraining + Web

  35. Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extracting from the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

  36. Problem • Information Extraction is Imprecise • Wikipedians Don’t Want 90% Precision • How Improve Precision? • People!

  37. Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extracting from the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

  38. Intelligence in Wikipedia What is IWP? A project/system that aims to combine IE (Information Extraction) CCC (communal content creation)

  39. Information Extraction Examples: Zoominfo.com Fligdog.com Citeseer Google Advantage: Autonomy Disadvantage: Expensive

  40. IE system contributors Contributors in this room? Wikipedia IE systems Citeseer Rexa DBlife

  41. Communal Content Creation Examples Wikipedia Ebay Netflix Advantage: more accuracy then IE Disadvantage: bootstrapping, incentives, and management

  42. Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extracting from the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

  43. Virtuous Cycle

  44. Contributing as a Non-Primary Task Encourage contributions Without annoying or abusing readers Compared 5 different interfaces

  45. Results • Contribution Rate • 1.6%  13% • 90% of positive labels were correct

  46. Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extracting from the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

  47. IWP and Shrinkage, Retraining, and Extracting from the Web • Shrinkage – improves IWP’s precision, and recall • Retraining – improves the robustness of IWP’s extractors • Extraction – further helps IWP’s performance

  48. Multi-Lingual Extraction • Idea: Further leverage the virtuous feedback cycle • Utilize IE methods to add or update missing information by copying from one language to another • Utilize CCC to validate and improve updates. • Example • Nombre = “Jerry Seinfeld” and Name = “Jerry Seinfeld • Cónyuge = “Jessica Sklar” and Spouse = “Jessica Sienfeld”

  49. Summary • Kylin’s initial performance is unacceptable • Methods for increasing recall • Shrinkage • Retraining • Extraction from the web

More Related