370 likes | 537 Vues
Automatic Discovery of Scenario-Level Patterns for Information Extraction. Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen. Outline. Information Extraction: background Problems in IE Prior Work: Machine Learning for IE Discover patterns from raw text Experimental results
E N D
Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen
Outline • Information Extraction: background • Problems in IE • Prior Work: Machine Learning for IE • Discover patterns from raw text • Experimental results • Current work
Quick Overview • What is Information Extraction ? • Definition: • finding facts about a specified class of events from free text • filling a table in a data base (slots in a template) • Events: instances in relations, with many arguments
Example: Management Succession • George Garrick, 40 years old, president of theLondon-based European Information ServicesInc., was appointed chief executive officer ofNielsen Marketing Research, USA.
Example: Management Succession • George Garrick, 40 years old, president of theLondon-based European Information ServicesInc., was appointed chief executive officer ofNielsen Marketing Research, USA.
System Architecture: Proteus Lexical Analysis Input Text Name Recognition Partial Syntax Scenario Patterns Reference Resolution sentence discourse Discourse Analyzer Output Generation Extracted Information
System Architecture: Proteus Lexical Analysis Input Text Name Recognition Partial Syntax Scenario Patterns Reference Resolution sentence discourse Discourse Analyzer Output Generation Extracted Information
Problems • Customization • Performance
Problems: Customization To customize a system for a new extraction task, we have to develop • new patterns for new types of events • word classes for the domain • inference rules This can be a large job requiring skilled labor • expense of customization limits uses of extraction
Problems: Performance • Performance on event IE is limited • On MUC tasks, typical top performance is recall < 55%, precision < 75% • Errors propagate through multiple phases: • name recognition errors • syntax analysis errors • missing patterns • reference resolution errors • complex inference required
Missing Patterns • As with many language phenomena • a few common patterns • a large number of rare patterns • Rare patterns do not surface sufficiently often in limited corpus • Missing patterns make customization expensive and limit performance • Finding good patterns is necessary to improve customization and performance Freq Rank
Prior Research • build patterns from examples • Yangarber ‘97 • generalize from multiple examples: annotated text • Crystal, Whisk (Soderland), Rapier (Califf) • active learning: reduce annotation • Soderland ‘99, Califf ‘99 • learning from corpus with relevance judgements • Riloff ‘96, ‘99 • co-learning/bootstrapping • Brin ‘98, Agichtein ‘00
Our Goals • Minimize manual labor required to construct pattern bases for new domain • un-annotated text • un-classified text • un-supervised learning • Use very large corpora -- larger than we could ever tag manually -- to improve coverage of patterns
Principle: Pattern Density • If we have relevance judgements for documents in a corpus, for the given task,then the patterns which are much more frequent in relevant documents will generally be good patterns • Riloff (1996) finds patterns related to terrorist attacks
Principle: Duality • Duality between patterns and documents: • relevant documents are strong indicators of good patterns • good patterns are strong indicators of relevant documents
Outline of Procedure • Initial query: a small set of seed patterns which partially characterize the topic of interest Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Rank patterns in relevant documents byfrequency in relevant docs vs. overall frequency Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Rank patterns in relevant documents byfrequency in relevant docs vs. overall frequency Add top-ranked pattern to seed pattern set repeat
#1: pick seed pattern Seed: < person retires >
#2: retrieve relevant documents Seed: < person retires > Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president. Relevant documents Otherdocuments
#3: pick new pattern Seed: < person retires > < person was named president > appears in several relevant documents(top-ranked by Riloff metric) Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president.
#4: add new pattern to pattern set Pattern set: < person retires > < person was named president >
Pre-processing • For each document, find and classify names: • { person | location | organization | …} • Parse document • (regularize passive, relative clauses, etc.) • For each clause, collect a candidate pattern:tuple: heads of • [ subject verb direct object object/subject complement locative and temporal modifiers… ]
Experiment • Task: Management succession (as MUC-6) • Source: Wall Street Journal • Training corpus: ~ 6,000 articles • Test corpus: • 100 documents: MUC-6 formal training • + 150 documents judged manually
Experiment: two seed patterns • v-appoint = { appoint, elect, promote, name } • v-resign = { resign, depart, quit, step-down } • Run discovery procedure for 80 iterations
Evaluation • Look at discovered patterns • new patterns, missed in manual training • Document filtering • Slot filling
Evaluation: new patterns • Not found in manual training
Evaluation: Text Filtering • How effective are discovered patterns at selecting relevant documents? • IR-style • documents matching at least one pattern
Evaluation: Slot filling • How effective are patterns within a complete IE system? • MUC-style IE on MUC-6 corpora • Caveat
Evaluation: Slot filling • How effective are patterns within a complete IE system? • MUC-style IE on MUC-6 corpora • Caveat
Evaluation: Slot filling • How effective are patterns within a complete IE system? • MUC-style IE on MUC-6 corpora • Caveat
Conclusion: Automatic discovery • Performance comparable to human(4-week development) • From un-annotated text: allows us to take advantage of very large corpora • redundancy • duality • Will likely help wider use of IE
Good Patterns • U - universe of all documentsR - set of relevant documentsH= H(p) - set of documents where pattern p matched • Density criterion:
Graded Relevance • Documents matching seed patterns considered 100% relevant • Discovered patterns are considered less certain • Documents containing them are considered partially relevant
Scoring Patterns • document frequency in relevant documents overall document frequency • document frequency in relevant documents • (metrics similar to those used in Riloff-96)