1 / 27

Suggestions to Improve the Flexibility and Adaptivity of Information Extraction

Suggestions to Improve the Flexibility and Adaptivity of Information Extraction. Irene M. Cramer Supervisor: Prof. Dr. D. Klakow Lehrstuhl für Sprachsignalverarbeitung Saarland University. Outline. Information Extraction Some comments on IE An IE example system: FASTUS The challenge

anitra
Télécharger la présentation

Suggestions to Improve the Flexibility and Adaptivity of Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suggestions to Improve the Flexibility and Adaptivity of Information Extraction Irene M. Cramer Supervisor: Prof. Dr. D. Klakow Lehrstuhl für Sprachsignalverarbeitung Saarland University IGK Colloquium - Winter 04/05

  2. Outline • Information Extraction • Some comments on IE • An IE example system: FASTUS • The challenge • Answers to the challenge • The possible method • Some case studies • Dissertation roadmap IGK Colloquium - Winter 04/05

  3. Information Extraction • Problem: • Huge amount of textual information available • Who is able to read and analyze it? IGK Colloquium - Winter 04/05

  4. Information Extraction Solution IE: • Find relevant information • Analyze relevant information automatically • Structure relevant information IGK Colloquium - Winter 04/05

  5. Information Extraction • Input: specification of relevant information  templates and documents • Output: set of instantiated templates  e.g. store in a data base IGK Colloquium - Winter 04/05

  6. Information Extraction • Evaluation: precision/recall, F-measure • Application: • Text Classification • Text Mining • Text Summarization • Question Answering IGK Colloquium - Winter 04/05

  7. IE example system: FASTUS • FASTUS (= Finite State Automa-based Text Understanding System) MUC IE system • Extraction of information in unstructured text • No real text understanding! IGK Colloquium - Winter 04/05

  8. IE example system: FASTUS • Series of cascaded, finite-state automata • Basically, 3 steps: • Recognize phrases • Complex words (multi words, proper names) • Simple phrases • Complex phrases • Recognize patterns • Merge bits of information found IGK Colloquium - Winter 04/05

  9. Information Extraction • Limitation • Someone has to build the templates, which is time consuming. • Thus, the templates are normally static. • What about adaptation to new domain …? IGK Colloquium - Winter 04/05

  10. The Challenge: • To be more flexible (and to support open domain QA): • Have many more patterns than in a typical IE system • Base work on (already existing) QA ontology • Learn the patterns automatically!? IGK Colloquium - Winter 04/05

  11. The Method – Constraints • We are looking for common entities (as MUC Named Entities) … • … and also for exceptional ones (book titles, sports, occupations etc.) • No annotated corpora • No hand crafted rules • Thus, we will have to start with almost nothing  unsupervised or semi unsupervised learning IGK Colloquium - Winter 04/05

  12. The Method – Bootstrapping • “… a process where a simple system activates a more complicated system… “ (http://en.wikipedia.org) • “… a complex system emerges by starting simply and, bit by bit, developing more complex capabilities on top of the simpler ones…” (http://en.wikipedia.org) IGK Colloquium - Winter 04/05

  13. The Method – Bootstrapping • Start with seed • Learn • Evaluate the learned • Add evaluated to the seed • Restart with new seed IGK Colloquium - Winter 04/05

  14. Excursus: Bootstrapping for WSD Yarowsky 1995 • Start with small set of contexts for given word (e.g. plant) • Determine log likelihood values from small annotated corpus • Arrange log likelihood according to values IGK Colloquium - Winter 04/05

  15. Excursus: Bootstrapping for WSD • Look for word (plant) and its context in corpus • Assign sense (sense1 or sense2) on basis of best log likelihood ratio applicable • Find new context words that co-occur with known context often enough Example: • target: plant • known context: species • co-occurrence: animal IGK Colloquium - Winter 04/05

  16. Excursus: Bootstrapping for WSD • Calculate log likelihood ratios of this new context • Add them to list • Note: smoothing is useful IGK Colloquium - Winter 04/05

  17. The Method – Bootstrapping What does this mean for Information Extraction? • Start with a small number of instances (and/or patterns) • Learn thereby patterns • Evaluate patterns • Add new patterns to pattern set • Derive more instances from these new patterns • Evaluate new instances • Add new instances to instance set • Restart with enlarged instance (or pattern) set This is an iterative process. There are basically two “nested” bootstrapping loops. IGK Colloquium - Winter 04/05

  18. The Method – Bootstrapping Some principle problems: • How to evaluate the patterns and the instances? • Add all instances (patterns) to the instance (pattern) set? • Start with instances or pattern or even with both? • By the way, what is a pattern? • What about convergence of the algorithm? • What about corpus size? IGK Colloquium - Winter 04/05

  19. Some Case Studies:Corpus and Method • Corpus: web and WSJ • Apply algorithm described but chose patterns/instances manually IGK Colloquium - Winter 04/05

  20. Some Case Studies:City • Start with one instance: “Berlin”  pattern: • Hotels in • und Umgebung • search for “Hotels in *” • Paris • München • Hamburg • etc. • but also: Europa, Mecklenburg-Vorpommern • Now, restart web search with new instances to get new patterns IGK Colloquium - Winter 04/05

  21. Some Case Studies:Professions • Start with one instance: “lawyer”  pattern: • lawyer’s job • hire a lawyer • search for “*’s job” • forester • therapist • reporter • etc. • but also: employee, John … • Now, restart web search with new instances to get new patterns IGK Colloquium - Winter 04/05

  22. Some Case Studies: Problems • Patterns match a lot of different instance types  possible criteria to chose good patterns • Instances could be multi words  criteria that determine “instance boundaries” • Even if patterns are good, instances found could be wrong ones  criteria to decide about instances IGK Colloquium - Winter 04/05

  23. Some Case Studies: List Search • Start web search with 5 instances at a time  instances • tennis • football • ballet • sailing • baseball • Get lists with lots of additional instances all at once IGK Colloquium - Winter 04/05

  24. Some Case Studies: Problems • Only works on the web! • For some instance types it doesn’t work at all! • Decide about 5 instances  to similar or to different = find no lists • Find actual list in web page IGK Colloquium - Winter 04/05

  25. Roadmap • Decide about bootstrapping and implement it • Run for MUC Named Entities • Run for “simple”, “one word” classes (e.g. sports, occupations) • Run for “difficult” classes (e.g. book titles, movies) • Run for different classes at the same time IGK Colloquium - Winter 04/05

  26. Literature survey There are some publications which address either bootstrapping or flexible IE • E. Riloff, R. Jones (1999): Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. • E. Agichtein, L. Gravano (2000): Snowball: extracting relations from large plain-text collections. • R. Yangarber, et al. (2000): Automatic acquisition of domain knowledge for Information Extraction. • O. Etzioni, et al. (2004): Methods for Domain-Indepedent Information Extraction from the Web: An Experimental Comparison. • D. Yarowsky (1995): Unsupervised word sense disambiguation rivaling supervised methods. • St. Abney (2002): Bootstrapping. • St. Abney (2004): Understanding the Yarowsky Algorithm. IGK Colloquium - Winter 04/05

  27. Thank you! IGK Colloquium - Winter 04/05

More Related