300 likes | 425 Vues
This document explores advanced methods in information extraction (IE), focusing on bootstrapping techniques that utilize unlabeled data. It outlines various strategies, including semi-supervised and meta-bootstrapping algorithms, to enhance the extraction of named entities and relationships from unrestricted text. The study discusses commercial applications such as resume processing and job postings, highlighting the effectiveness and challenges of supervised and unsupervised learning in diverse domain applications. It emphasizes the importance of context and redundancy in achieving accurate information extraction.
E N D
Bootstrapping Information Extraction with Unlabeled Data Rayid GhaniAccenture Technology Labs Rosie JonesCarnegie Mellon University & Overture (With contributions from Tom Mitchell and Ellen Riloff)
What is Information Extraction? • Analyze unrestricted text in order to extract pre-specified types of events, entities or relationships • Recent Commercial Applications • Database of Job Postings extracted from corporate web pages (flipdog.com) • Extracting specific fields from resumes to populate HR databases (mohomine.com) • Information Integration (fetch.com) • Shopping Portals
IE Approaches • Hand-Constructed Rules • Supervised Learning • Still costly to train and port to new domains • 3-6 months to port to new domain (Cardie 98) • 20,000 words to learn named entity extraction (Seymore et al 99) • 7000 labeled examples to learn MUC extraction rules (Soderland 99) • Semi-Supervised Learning
Semi-Supervised Approaches • Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora • Expectation-Maximization, Co-Training, CoBoost, Meta-Bootstrapping, Co-EM, etc. • Goal: • Systematically analyze and test • The Assumptions underlying the algorithms • The Effectiveness of the algorithms on a common set of problems and corpus
Tasks • Extract Noun Phrases belonging to the following semantic classes • Locations • Organizations • People
Aren’t you missing the obvious? • Acquire lists of proper nouns • Locations : countries, states, cities • Organizations : online database • People: Names • Named Entity Extraction? • But not all instances are proper nouns • *by the river*, *customer*,*client*
Use context to disambiguate • A lot of NPs are unambiguous • “The corporation” • A lot of contexts are also unambiguous • Subsidiary of <NP> • But as always, there are exceptions….and a LOT of them in this case • customer, John Hancock, Washington
Bootstrapping Approaches • Utilize Redundancy in Text • Noun-Phrases • New York, China, place we met last time • Contexts • Located in <X>, Traveled to <X> • Learn two models • Use NPs to label Contexts • Use Contexts to label NPs
Interesting Dimensions for Bootstrapping Algorithms • Incremental vs. Iterative • Symmetric vs. Asymmetric • Probabilistic vs. Heuristic
Algorithms for Bootstrapping • Meta-Bootstrapping (Riloff & Jones, 1999) • Incremental, Asymmetric, Heuristic • Co-Training (Blum & Mitchell, 1999) • Incremental, Symmetric, Probabilistic(?) • Co-EM (Nigam & Ghani, 2000) • Iterative, Symmetric, Probabilistic • Baseline • Seed-Labeling: label all NPs that match the seeds • Head-Labeling: label all NPs whose head matches the seeds
Data Set • ~4200 corporate web pages (WebKB project at CMU) • Test data marked up manually by labeling every NP as one or more of the following semantic categories: • location, organization, person, none • Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)
Seeds • Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan • People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director • Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier
Intuition Behind Bootstrapping Noun Phrases Contexts the dog <X> ran away australia travelled to <X> france <X> is beautiful the canary islands
Co-Training(Blum & Mitchell, 99) • Incremental, symmetric, probabilistic • Initialize with pos and neg NP seeds • Use NPs to label all contexts • Add n top scoring contexts for both positive and negative class • Use new contexts to label all NPS • Add n top scoring NPs for both positive and negative class • Loop
Co-EM(Nigam & Ghani, 2000) • Iterative, Symmetric, Probabilistic • Similar to Co-Training • Probabilistically labels and adds all NPs and contexts to the labeled set
Meta-Bootstrapping(Riloff & Jones, 99) • Incremental, Asymmetric, Heuristic • Two-level process • NPs are used to score contexts according to co-occurring frequency and diversity • After first level, all contexts are discarded and only the best NPs are retained
Common Assumptions • Seeds • Seed Density in the corpus • Head-labeling Accuracy • Syntactic-Semantic Agreement • Redundancy • Feature Sets are redundant and sufficient • Labeling disagreement
Feature Set Ambiguity • Feature Sets: NPs and Contexts • If Feature Sets were redundantly sufficient, either of them alone would be enough to correctly classify the instance • Calculate the ambiguity for each feature set • Washington, Went to <<X>>, Visit <<X>>
NP Ambiguity 2%
Labeling Disagreement • Agreement among human labelers • Same set of instances but different levels of information • NP only • Context Only • NP and Context • NP, Context and the entire sentence from the corpus
Labeling Disagreement • 90.5% agreement when NP, context and sentence are given • 88.5% when sentence is not given
Results Comparing Bootstrapping Algorithms • Meta-Bootstrapping, Co-Training, co-EM • Locations, Organizations, Person
Co-EM MetaBoot Co-Training
Co-EM Co-Training MetaBoot
Co-EM Co-Training MetaBoot
More Results • Bootstrapping outperforms both baselines • Improvement is less pronounced for “people” class • Ambiguous classes don’t benefit as much from bootstrapping?
Why does co-EM work well? • Co-EM outperforms Meta-bootstrapping & Co-Training • Co-EM is probabilistic and does not do hard classifications • Reflective of the ambiguity among classes
Summary • Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM • Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes • Co-EM performs robustly even when the underlying assumptions are violated
Ongoing Work • Varying initial seed size and type • Collecting Training Corpus automatically (from the Web) • Incorporating the user in the loop (Active Learning)