1 / 37

Automatic Discovery of Scenario-Level Patterns for Information Extraction

Automatic Discovery of Scenario-Level Patterns for Information Extraction. Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen. Outline. Information Extraction: background Problems in IE Prior Work: Machine Learning for IE Discover patterns from raw text Experimental results

early
Télécharger la présentation

Automatic Discovery of Scenario-Level Patterns for Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen

  2. Outline • Information Extraction: background • Problems in IE • Prior Work: Machine Learning for IE • Discover patterns from raw text • Experimental results • Current work

  3. Quick Overview • What is Information Extraction ? • Definition: • finding facts about a specified class of events from free text • filling a table in a data base (slots in a template) • Events: instances in relations, with many arguments

  4. Example: Management Succession • George Garrick, 40 years old, president of theLondon-based European Information ServicesInc., was appointed chief executive officer ofNielsen Marketing Research, USA.

  5. Example: Management Succession • George Garrick, 40 years old, president of theLondon-based European Information ServicesInc., was appointed chief executive officer ofNielsen Marketing Research, USA.

  6. System Architecture: Proteus Lexical Analysis Input Text Name Recognition Partial Syntax Scenario Patterns Reference Resolution sentence discourse Discourse Analyzer Output Generation Extracted Information

  7. System Architecture: Proteus Lexical Analysis Input Text Name Recognition Partial Syntax Scenario Patterns Reference Resolution sentence discourse Discourse Analyzer Output Generation Extracted Information

  8. Problems • Customization • Performance

  9. Problems: Customization To customize a system for a new extraction task, we have to develop • new patterns for new types of events • word classes for the domain • inference rules This can be a large job requiring skilled labor • expense of customization limits uses of extraction

  10. Problems: Performance • Performance on event IE is limited • On MUC tasks, typical top performance is recall < 55%, precision < 75% • Errors propagate through multiple phases: • name recognition errors • syntax analysis errors • missing patterns • reference resolution errors • complex inference required

  11. Missing Patterns • As with many language phenomena • a few common patterns • a large number of rare patterns • Rare patterns do not surface sufficiently often in limited corpus • Missing patterns make customization expensive and limit performance • Finding good patterns is necessary to improve customization and performance Freq Rank

  12. Prior Research • build patterns from examples • Yangarber ‘97 • generalize from multiple examples: annotated text • Crystal, Whisk (Soderland), Rapier (Califf) • active learning: reduce annotation • Soderland ‘99, Califf ‘99 • learning from corpus with relevance judgements • Riloff ‘96, ‘99 • co-learning/bootstrapping • Brin ‘98, Agichtein ‘00

  13. Our Goals • Minimize manual labor required to construct pattern bases for new domain • un-annotated text • un-classified text • un-supervised learning • Use very large corpora -- larger than we could ever tag manually -- to improve coverage of patterns

  14. Principle: Pattern Density • If we have relevance judgements for documents in a corpus, for the given task,then the patterns which are much more frequent in relevant documents will generally be good patterns • Riloff (1996) finds patterns related to terrorist attacks

  15. Principle: Duality • Duality between patterns and documents: • relevant documents are strong indicators of good patterns • good patterns are strong indicators of relevant documents

  16. Outline of Procedure • Initial query: a small set of seed patterns which partially characterize the topic of interest Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Rank patterns in relevant documents byfrequency in relevant docs vs. overall frequency Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Rank patterns in relevant documents byfrequency in relevant docs vs. overall frequency Add top-ranked pattern to seed pattern set repeat

  17. #1: pick seed pattern Seed: < person retires >

  18. #2: retrieve relevant documents Seed: < person retires > Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president. Relevant documents Otherdocuments

  19. #3: pick new pattern Seed: < person retires > < person was named president > appears in several relevant documents(top-ranked by Riloff metric) Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president.

  20. #4: add new pattern to pattern set Pattern set: < person retires > < person was named president >

  21. Pre-processing • For each document, find and classify names: • { person | location | organization | …} • Parse document • (regularize passive, relative clauses, etc.) • For each clause, collect a candidate pattern:tuple: heads of • [ subject verb direct object object/subject complement locative and temporal modifiers… ]

  22. Experiment • Task: Management succession (as MUC-6) • Source: Wall Street Journal • Training corpus: ~ 6,000 articles • Test corpus: • 100 documents: MUC-6 formal training • + 150 documents judged manually

  23. Experiment: two seed patterns • v-appoint = { appoint, elect, promote, name } • v-resign = { resign, depart, quit, step-down } • Run discovery procedure for 80 iterations

  24. Evaluation • Look at discovered patterns • new patterns, missed in manual training • Document filtering • Slot filling

  25. Discovered patterns

  26. Evaluation: new patterns • Not found in manual training

  27. Evaluation: Text Filtering • How effective are discovered patterns at selecting relevant documents? • IR-style • documents matching at least one pattern

  28. Evaluation: Slot filling • How effective are patterns within a complete IE system? • MUC-style IE on MUC-6 corpora • Caveat

  29. Evaluation: Slot filling • How effective are patterns within a complete IE system? • MUC-style IE on MUC-6 corpora • Caveat

  30. Evaluation: Slot filling • How effective are patterns within a complete IE system? • MUC-style IE on MUC-6 corpora • Caveat

  31. Conclusion: Automatic discovery • Performance comparable to human(4-week development) • From un-annotated text: allows us to take advantage of very large corpora • redundancy • duality • Will likely help wider use of IE

  32. Good Patterns • U - universe of all documentsR - set of relevant documentsH= H(p) - set of documents where pattern p matched • Density criterion:

  33. Graded Relevance • Documents matching seed patterns considered 100% relevant • Discovered patterns are considered less certain • Documents containing them are considered partially relevant

  34. Scoring Patterns • document frequency in relevant documents overall document frequency • document frequency in relevant documents • (metrics similar to those used in Riloff-96)

More Related