270 likes | 422 Vues
Suggestions to Improve the Flexibility and Adaptivity of Information Extraction. Irene M. Cramer Supervisor: Prof. Dr. D. Klakow Lehrstuhl für Sprachsignalverarbeitung Saarland University. Outline. Information Extraction Some comments on IE An IE example system: FASTUS The challenge
E N D
Suggestions to Improve the Flexibility and Adaptivity of Information Extraction Irene M. Cramer Supervisor: Prof. Dr. D. Klakow Lehrstuhl für Sprachsignalverarbeitung Saarland University IGK Colloquium - Winter 04/05
Outline • Information Extraction • Some comments on IE • An IE example system: FASTUS • The challenge • Answers to the challenge • The possible method • Some case studies • Dissertation roadmap IGK Colloquium - Winter 04/05
Information Extraction • Problem: • Huge amount of textual information available • Who is able to read and analyze it? IGK Colloquium - Winter 04/05
Information Extraction Solution IE: • Find relevant information • Analyze relevant information automatically • Structure relevant information IGK Colloquium - Winter 04/05
Information Extraction • Input: specification of relevant information templates and documents • Output: set of instantiated templates e.g. store in a data base IGK Colloquium - Winter 04/05
Information Extraction • Evaluation: precision/recall, F-measure • Application: • Text Classification • Text Mining • Text Summarization • Question Answering IGK Colloquium - Winter 04/05
IE example system: FASTUS • FASTUS (= Finite State Automa-based Text Understanding System) MUC IE system • Extraction of information in unstructured text • No real text understanding! IGK Colloquium - Winter 04/05
IE example system: FASTUS • Series of cascaded, finite-state automata • Basically, 3 steps: • Recognize phrases • Complex words (multi words, proper names) • Simple phrases • Complex phrases • Recognize patterns • Merge bits of information found IGK Colloquium - Winter 04/05
Information Extraction • Limitation • Someone has to build the templates, which is time consuming. • Thus, the templates are normally static. • What about adaptation to new domain …? IGK Colloquium - Winter 04/05
The Challenge: • To be more flexible (and to support open domain QA): • Have many more patterns than in a typical IE system • Base work on (already existing) QA ontology • Learn the patterns automatically!? IGK Colloquium - Winter 04/05
The Method – Constraints • We are looking for common entities (as MUC Named Entities) … • … and also for exceptional ones (book titles, sports, occupations etc.) • No annotated corpora • No hand crafted rules • Thus, we will have to start with almost nothing unsupervised or semi unsupervised learning IGK Colloquium - Winter 04/05
The Method – Bootstrapping • “… a process where a simple system activates a more complicated system… “ (http://en.wikipedia.org) • “… a complex system emerges by starting simply and, bit by bit, developing more complex capabilities on top of the simpler ones…” (http://en.wikipedia.org) IGK Colloquium - Winter 04/05
The Method – Bootstrapping • Start with seed • Learn • Evaluate the learned • Add evaluated to the seed • Restart with new seed IGK Colloquium - Winter 04/05
Excursus: Bootstrapping for WSD Yarowsky 1995 • Start with small set of contexts for given word (e.g. plant) • Determine log likelihood values from small annotated corpus • Arrange log likelihood according to values IGK Colloquium - Winter 04/05
Excursus: Bootstrapping for WSD • Look for word (plant) and its context in corpus • Assign sense (sense1 or sense2) on basis of best log likelihood ratio applicable • Find new context words that co-occur with known context often enough Example: • target: plant • known context: species • co-occurrence: animal IGK Colloquium - Winter 04/05
Excursus: Bootstrapping for WSD • Calculate log likelihood ratios of this new context • Add them to list • Note: smoothing is useful IGK Colloquium - Winter 04/05
The Method – Bootstrapping What does this mean for Information Extraction? • Start with a small number of instances (and/or patterns) • Learn thereby patterns • Evaluate patterns • Add new patterns to pattern set • Derive more instances from these new patterns • Evaluate new instances • Add new instances to instance set • Restart with enlarged instance (or pattern) set This is an iterative process. There are basically two “nested” bootstrapping loops. IGK Colloquium - Winter 04/05
The Method – Bootstrapping Some principle problems: • How to evaluate the patterns and the instances? • Add all instances (patterns) to the instance (pattern) set? • Start with instances or pattern or even with both? • By the way, what is a pattern? • What about convergence of the algorithm? • What about corpus size? IGK Colloquium - Winter 04/05
Some Case Studies:Corpus and Method • Corpus: web and WSJ • Apply algorithm described but chose patterns/instances manually IGK Colloquium - Winter 04/05
Some Case Studies:City • Start with one instance: “Berlin” pattern: • Hotels in • und Umgebung • search for “Hotels in *” • Paris • München • Hamburg • etc. • but also: Europa, Mecklenburg-Vorpommern • Now, restart web search with new instances to get new patterns IGK Colloquium - Winter 04/05
Some Case Studies:Professions • Start with one instance: “lawyer” pattern: • lawyer’s job • hire a lawyer • search for “*’s job” • forester • therapist • reporter • etc. • but also: employee, John … • Now, restart web search with new instances to get new patterns IGK Colloquium - Winter 04/05
Some Case Studies: Problems • Patterns match a lot of different instance types possible criteria to chose good patterns • Instances could be multi words criteria that determine “instance boundaries” • Even if patterns are good, instances found could be wrong ones criteria to decide about instances IGK Colloquium - Winter 04/05
Some Case Studies: List Search • Start web search with 5 instances at a time instances • tennis • football • ballet • sailing • baseball • Get lists with lots of additional instances all at once IGK Colloquium - Winter 04/05
Some Case Studies: Problems • Only works on the web! • For some instance types it doesn’t work at all! • Decide about 5 instances to similar or to different = find no lists • Find actual list in web page IGK Colloquium - Winter 04/05
Roadmap • Decide about bootstrapping and implement it • Run for MUC Named Entities • Run for “simple”, “one word” classes (e.g. sports, occupations) • Run for “difficult” classes (e.g. book titles, movies) • Run for different classes at the same time IGK Colloquium - Winter 04/05
Literature survey There are some publications which address either bootstrapping or flexible IE • E. Riloff, R. Jones (1999): Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. • E. Agichtein, L. Gravano (2000): Snowball: extracting relations from large plain-text collections. • R. Yangarber, et al. (2000): Automatic acquisition of domain knowledge for Information Extraction. • O. Etzioni, et al. (2004): Methods for Domain-Indepedent Information Extraction from the Web: An Experimental Comparison. • D. Yarowsky (1995): Unsupervised word sense disambiguation rivaling supervised methods. • St. Abney (2002): Bootstrapping. • St. Abney (2004): Understanding the Yarowsky Algorithm. IGK Colloquium - Winter 04/05
Thank you! IGK Colloquium - Winter 04/05