1 / 30

Bootstrapping Information Extraction with Unlabeled Data

Bootstrapping Information Extraction with Unlabeled Data. Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture. (With contributions from Tom Mitchell and Ellen Riloff). What is Information Extraction?.

Télécharger la présentation

Bootstrapping Information Extraction with Unlabeled Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bootstrapping Information Extraction with Unlabeled Data Rayid GhaniAccenture Technology Labs Rosie JonesCarnegie Mellon University & Overture (With contributions from Tom Mitchell and Ellen Riloff)

  2. What is Information Extraction? • Analyze unrestricted text in order to extract pre-specified types of events, entities or relationships • Recent Commercial Applications • Database of Job Postings extracted from corporate web pages (flipdog.com) • Extracting specific fields from resumes to populate HR databases (mohomine.com) • Information Integration (fetch.com) • Shopping Portals

  3. IE Approaches • Hand-Constructed Rules • Supervised Learning • Still costly to train and port to new domains • 3-6 months to port to new domain (Cardie 98) • 20,000 words to learn named entity extraction (Seymore et al 99) • 7000 labeled examples to learn MUC extraction rules (Soderland 99) • Semi-Supervised Learning

  4. Semi-Supervised Approaches • Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora • Expectation-Maximization, Co-Training, CoBoost, Meta-Bootstrapping, Co-EM, etc. • Goal: • Systematically analyze and test • The Assumptions underlying the algorithms • The Effectiveness of the algorithms on a common set of problems and corpus

  5. Tasks • Extract Noun Phrases belonging to the following semantic classes • Locations • Organizations • People

  6. Aren’t you missing the obvious? • Acquire lists of proper nouns • Locations : countries, states, cities • Organizations : online database • People: Names • Named Entity Extraction? • But not all instances are proper nouns • *by the river*, *customer*,*client*

  7. Use context to disambiguate • A lot of NPs are unambiguous • “The corporation” • A lot of contexts are also unambiguous • Subsidiary of <NP> • But as always, there are exceptions….and a LOT of them in this case • customer, John Hancock, Washington

  8. Bootstrapping Approaches • Utilize Redundancy in Text • Noun-Phrases • New York, China, place we met last time • Contexts • Located in <X>, Traveled to <X> • Learn two models • Use NPs to label Contexts • Use Contexts to label NPs

  9. Interesting Dimensions for Bootstrapping Algorithms • Incremental vs. Iterative • Symmetric vs. Asymmetric • Probabilistic vs. Heuristic

  10. Algorithms for Bootstrapping • Meta-Bootstrapping (Riloff & Jones, 1999) • Incremental, Asymmetric, Heuristic • Co-Training (Blum & Mitchell, 1999) • Incremental, Symmetric, Probabilistic(?) • Co-EM (Nigam & Ghani, 2000) • Iterative, Symmetric, Probabilistic • Baseline • Seed-Labeling: label all NPs that match the seeds • Head-Labeling: label all NPs whose head matches the seeds

  11. Data Set • ~4200 corporate web pages (WebKB project at CMU) • Test data marked up manually by labeling every NP as one or more of the following semantic categories: • location, organization, person, none • Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)

  12. Seeds • Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan • People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director • Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier

  13. Intuition Behind Bootstrapping Noun Phrases Contexts the dog <X> ran away australia travelled to <X> france <X> is beautiful the canary islands

  14. Co-Training(Blum & Mitchell, 99) • Incremental, symmetric, probabilistic • Initialize with pos and neg NP seeds • Use NPs to label all contexts • Add n top scoring contexts for both positive and negative class • Use new contexts to label all NPS • Add n top scoring NPs for both positive and negative class • Loop

  15. Co-EM(Nigam & Ghani, 2000) • Iterative, Symmetric, Probabilistic • Similar to Co-Training • Probabilistically labels and adds all NPs and contexts to the labeled set

  16. Meta-Bootstrapping(Riloff & Jones, 99) • Incremental, Asymmetric, Heuristic • Two-level process • NPs are used to score contexts according to co-occurring frequency and diversity • After first level, all contexts are discarded and only the best NPs are retained

  17. Common Assumptions • Seeds • Seed Density in the corpus • Head-labeling Accuracy • Syntactic-Semantic Agreement • Redundancy • Feature Sets are redundant and sufficient • Labeling disagreement

  18. Feature Set Ambiguity • Feature Sets: NPs and Contexts • If Feature Sets were redundantly sufficient, either of them alone would be enough to correctly classify the instance • Calculate the ambiguity for each feature set • Washington, Went to <<X>>, Visit <<X>>

  19. NP Ambiguity 2%

  20. Context Ambiguity 36%

  21. Labeling Disagreement • Agreement among human labelers • Same set of instances but different levels of information • NP only • Context Only • NP and Context • NP, Context and the entire sentence from the corpus

  22. Labeling Disagreement • 90.5% agreement when NP, context and sentence are given • 88.5% when sentence is not given

  23. Results Comparing Bootstrapping Algorithms • Meta-Bootstrapping, Co-Training, co-EM • Locations, Organizations, Person

  24. Co-EM MetaBoot Co-Training

  25. Co-EM Co-Training MetaBoot

  26. Co-EM Co-Training MetaBoot

  27. More Results • Bootstrapping outperforms both baselines • Improvement is less pronounced for “people” class • Ambiguous classes don’t benefit as much from bootstrapping?

  28. Why does co-EM work well? • Co-EM outperforms Meta-bootstrapping & Co-Training • Co-EM is probabilistic and does not do hard classifications • Reflective of the ambiguity among classes

  29. Summary • Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM • Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes • Co-EM performs robustly even when the underlying assumptions are violated

  30. Ongoing Work • Varying initial seed size and type • Collecting Training Corpus automatically (from the Web) • Incorporating the user in the loop (Active Learning)

More Related