Automatic Discovery of Scenario-Level Patterns for Information Extraction

Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen

Outline • Information Extraction: background • Problems in IE • Prior Work: Machine Learning for IE • Discover patterns from raw text • Experimental results • Current work

Quick Overview • What is Information Extraction ? • Definition: • finding facts about a specified class of events from free text • filling a table in a data base (slots in a template) • Events: instances in relations, with many arguments

Example: Management Succession • George Garrick, 40 years old, president of theLondon-based European Information ServicesInc., was appointed chief executive officer ofNielsen Marketing Research, USA.

System Architecture: Proteus Lexical Analysis Input Text Name Recognition Partial Syntax Scenario Patterns Reference Resolution sentence discourse Discourse Analyzer Output Generation Extracted Information

Problems • Customization • Performance

Problems: Customization To customize a system for a new extraction task, we have to develop • new patterns for new types of events • word classes for the domain • inference rules This can be a large job requiring skilled labor • expense of customization limits uses of extraction

Problems: Performance • Performance on event IE is limited • On MUC tasks, typical top performance is recall < 55%, precision < 75% • Errors propagate through multiple phases: • name recognition errors • syntax analysis errors • missing patterns • reference resolution errors • complex inference required

Missing Patterns • As with many language phenomena • a few common patterns • a large number of rare patterns • Rare patterns do not surface sufficiently often in limited corpus • Missing patterns make customization expensive and limit performance • Finding good patterns is necessary to improve customization and performance Freq Rank

Prior Research • build patterns from examples • Yangarber ‘97 • generalize from multiple examples: annotated text • Crystal, Whisk (Soderland), Rapier (Califf) • active learning: reduce annotation • Soderland ‘99, Califf ‘99 • learning from corpus with relevance judgements • Riloff ‘96, ‘99 • co-learning/bootstrapping • Brin ‘98, Agichtein ‘00

Our Goals • Minimize manual labor required to construct pattern bases for new domain • un-annotated text • un-classified text • un-supervised learning • Use very large corpora -- larger than we could ever tag manually -- to improve coverage of patterns

Principle: Pattern Density • If we have relevance judgements for documents in a corpus, for the given task,then the patterns which are much more frequent in relevant documents will generally be good patterns • Riloff (1996) finds patterns related to terrorist attacks

Principle: Duality • Duality between patterns and documents: • relevant documents are strong indicators of good patterns • good patterns are strong indicators of relevant documents

Outline of Procedure • Initial query: a small set of seed patterns which partially characterize the topic of interest Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Rank patterns in relevant documents byfrequency in relevant docs vs. overall frequency Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Rank patterns in relevant documents byfrequency in relevant docs vs. overall frequency Add top-ranked pattern to seed pattern set repeat

#1: pick seed pattern Seed: < person retires >

#2: retrieve relevant documents Seed: < person retires > Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president. Relevant documents Otherdocuments

#3: pick new pattern Seed: < person retires > < person was named president > appears in several relevant documents(top-ranked by Riloff metric) Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president.

#4: add new pattern to pattern set Pattern set: < person retires > < person was named president >

Pre-processing • For each document, find and classify names: • { person | location | organization | …} • Parse document • (regularize passive, relative clauses, etc.) • For each clause, collect a candidate pattern:tuple: heads of • [ subject verb direct object object/subject complement locative and temporal modifiers… ]

Experiment • Task: Management succession (as MUC-6) • Source: Wall Street Journal • Training corpus: ~ 6,000 articles • Test corpus: • 100 documents: MUC-6 formal training • + 150 documents judged manually

Experiment: two seed patterns • v-appoint = { appoint, elect, promote, name } • v-resign = { resign, depart, quit, step-down } • Run discovery procedure for 80 iterations

Evaluation • Look at discovered patterns • new patterns, missed in manual training • Document filtering • Slot filling

Discovered patterns

Evaluation: new patterns • Not found in manual training

Evaluation: Text Filtering • How effective are discovered patterns at selecting relevant documents? • IR-style • documents matching at least one pattern

Evaluation: Slot filling • How effective are patterns within a complete IE system? • MUC-style IE on MUC-6 corpora • Caveat

Conclusion: Automatic discovery • Performance comparable to human(4-week development) • From un-annotated text: allows us to take advantage of very large corpora • redundancy • duality • Will likely help wider use of IE

Good Patterns • U - universe of all documentsR - set of relevant documentsH= H(p) - set of documents where pattern p matched • Density criterion:

Graded Relevance • Documents matching seed patterns considered 100% relevant • Discovered patterns are considered less certain • Documents containing them are considered partially relevant

Scoring Patterns • document frequency in relevant documents overall document frequency • document frequency in relevant documents • (metrics similar to those used in Riloff-96)

Automatic Discovery of Scenario-Level Patterns for Information Extraction

Automatic Discovery of Scenario-Level Patterns for Information Extraction

Presentation Transcript

Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction

Automatic Causal Discovery

Information Extraction

Automatic Discovery of Parasitic Malware

information extraction

Automatic Discovery of Parasitic Malware

Lexico -semantic Patterns for Information Extraction from Text

Information Extraction

Learning Effective Patterns for Information Extraction

Automatic template creation for biomedical information extraction: theory and practice

Automatic Centerline Extraction for Virtual Colonoscopy

DAGIS : Automatic Discovery of Geospatial Information Services

IEPAD: Information Extraction Based on Pattern Discovery

Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping

IEPAD: Information Extraction based on Pattern Discovery

DAGIS : Automatic Discovery of Geospatial Information Services

Sentence Level Information Patterns for Novelty Detection

Scenario Discovery

Data Harvesting: automatic extraction of information necessary

Information Extraction

Joint Unsupervised Structure Discovery and Information Extraction

Learning syntactic patterns for automatic hypernym discovery