Information Extraction from Wikipedia: Moving Down the Long Tail

Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Intelligence in Wikipedia: Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel, Stef Schoenmackers & Michael Skinner

… Bob was born in Northwestern Memorial Hospital. … … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago … … Bob Black is an active actor who was selected as this year’s … Motivating Vision • Next-Generation Search = Information Extraction + Ontology + Inference Which performing artists were born in Chicago?

… Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who … … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago … Next-Generation Search • Information Extraction • <Bob, Born-In, NMH> • <Bob Black, ISA, actor> • <NMH, in Chicago> … • Ontology • Actor ISA Performing Artist … • Inference • Born-In(A) ^ PartOf(A,B) => Born-In(B) …

Wikipedia – Bootstrap for the Web • Goal: search over the Web • Now: search over Wikipedia • Comprehensive • High-quality • (Semi-)Structured data

Infoboxes • Infoboxes are designed to present summary information about an article's subject, such that similar subjects have a uniform look and in a common format • An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms.

Infobox examples Basic infobox Taxobox –Plant species

More example Infobox People - Actor Infobox- Convention Center

Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extracting from the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

Kylin: Autonomously Semantifying Wikipedia • Totally autonomous with no additional human efforts • Form training dataset based on infoboxes • Extract semantic relations from Wikipedia articles Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage. ------Wikipedia

Kylin • It is a prototype of self-supervised, machine learning system • It looks for classes of pages with similar infoboxes • It determines common attributes • It creates training examples

Infobox Generation

Preprocessor Schema Refinement Free edit -> schema drift • Duplicate templates:U.S.County(1428), US County(574), Counties(50), County(19) • Low usage of attribute • Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year” Kylin: • Strict name match • 15% occurrences

Preprocessor • Training Dataset Construction Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812. Its county seat is Clearfield. 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. As of 2005, the population density was 28.2/km².

Classifier • Document Classifier List and Category • Fast • Precision(98.5%) • Recall(68.8%) • Sentence Classifier • Predicts which attribute value are contained in given sentence. • It uses maximum entropy model. • To decrease noisy and incomplete training dataset, Kylin apply bagging.

CRF Extractor Conditional Random Fields Model • Attribute value extraction: sequential data labeling • CRF model for each attribute independently • Relabel–filter false negative training examples • 2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water. Preprocessor: Water_area Classifier: Water_area; Land_area • Though Kylin is successful on popular classes, its performance decreases on sparse classes where there is insufficient training data.

Outline • Background: Kylin Extraction • Long-Tailed Challenges • Sparse infobox classes • Incomplete articles • Moving Down the Long Tails • Shrinkage • Retraining • Extractingfrom the Web • Problem with information Extraction • IWP (Intelligence in Wikipedia) • CCC and IE • Virtuous Cycle • IWP (Shrinkage, Retraining and Extracting from Web) • Multilingual Extraction • Summary

Long-Tail1: Sparse Infobox Class • Kylin Performs Well on Popular Classes: • Precision: mid 70% ~ high 90% • Recall: low 50% ~ mid 90% • Kylin Flounders on Sparse Classes – Little Training Data e.g: for “US County class ” Kylin has 97.3% precision and 95.9% recall while many other classes like “Irish Newspaper” contains very small number of infobox containing articles

Long-Tail 2: Incomplete Articles • Desired Information Missing from Wikipedia • Among 1.8 millions pages [July 2007 of Wikipedia ] many are short articles and almost 800,000 (44.2%) are marked as stub pages indicating much needed information is missing.

Shrinkage • Attempt to improve Kylin’s performance using shrinkage. • We use Shrinkage when training an extractor of an instance-space infobox class by aggregating data from its parent and children classes

.birth_place person (1201) .location performer (44) .birthplace .birth_place .cityofbirth .origin actor (8738) comedian (106) Shrinkage [McCallum et al., ICML98]

Shrinkage • KOG (Kylin Ontology Generator) [Wu & Weld, WWW08] person (1201) performer (44) actor (8738) comedian (106) .birth_place .location .birthplace .birth_place .cityofbirth .origin

TextRunner@UW Retraining Complementary to Shrinkage: Harvest extra training data frombroaderWeb Key: • Identify relevant sentences given the sea of Web data? Andrew Murray was born in Scotland in 1828 …… <Andrew Murray, was born in, Scotland> <Andrew Murray, was born in, 1828>

Retraining Kylin Extraction: TextRunner Extraction: Query TextRunner for relevant sentences: t=< Ada Cambridge, location, “St Germans , Norfolk , England”> • r1=<Ada Cambridge, was born in, England> Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in 1870. • r2=<Ada Cambridge, was born in, “Norfolk , England”> Ada Cambridge was born in Norfolk , England , in 1844 .

Effect of Shrinkage & Retraining

Effect of Shrinkage & Retraining 1755% improvement for a sparse class 13.7% improvement for a popular class

Extraction from the Web • Idea: apply Kylin extractors trained on Wikipedia to general Web pages • Challenge: maintain high precision • General Web pages are noisy • Many Web pages describe multiple objects • Key: retrieve relevant sentences • Procedure • Generate a set of search engine queries • Retrieve top-k pages from Google • Weight extractions from these pages

Choosing Queries Example: get birth date attribute for article titled “Andrew Murray (minister)” “andrew murray”“andrew murray” birth date“andrew murray” was born in“andrew murray” … attribute name predicatesfromTextRunner

Weighting Extractions Which extractions are more relevant? Features • : # sentences between sentence and closest occurrence of title (‘andrewmurray’) • : rank of page on Google’s result lists • : Kylin’s extractor confidence

Web Extraction Experiment • Extractor confidence alone performs poor • Weighted combination is the best

Combining Wikipedia & Web Recall Benefit from Shrinkage / Retraining…

Combining Wikipedia & Web Benefit from Shrinkage + Retraining + Web

Problem • Information Extraction is Imprecise • Wikipedians Don’t Want 90% Precision • How Improve Precision? • People!

Intelligence in Wikipedia What is IWP? A project/system that aims to combine IE (Information Extraction) CCC (communal content creation)

Information Extraction Examples: Zoominfo.com Fligdog.com Citeseer Google Advantage: Autonomy Disadvantage: Expensive

IE system contributors Contributors in this room? Wikipedia IE systems Citeseer Rexa DBlife

Communal Content Creation Examples Wikipedia Ebay Netflix Advantage: more accuracy then IE Disadvantage: bootstrapping, incentives, and management

Virtuous Cycle

Contributing as a Non-Primary Task Encourage contributions Without annoying or abusing readers Compared 5 different interfaces

Results • Contribution Rate • 1.6%  13% • 90% of positive labels were correct

IWP and Shrinkage, Retraining, and Extracting from the Web • Shrinkage – improves IWP’s precision, and recall • Retraining – improves the robustness of IWP’s extractors • Extraction – further helps IWP’s performance

Multi-Lingual Extraction • Idea: Further leverage the virtuous feedback cycle • Utilize IE methods to add or update missing information by copying from one language to another • Utilize CCC to validate and improve updates. • Example • Nombre = “Jerry Seinfeld” and Name = “Jerry Seinfeld • Cónyuge = “Jessica Sklar” and Spouse = “Jessica Sienfeld”

Summary • Kylin’s initial performance is unacceptable • Methods for increasing recall • Shrinkage • Retraining • Extraction from the web

Information Extraction from Wikipedia: Moving Down the Long Tail