1 / 60

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 22 4 /8/2013. Recommended reading. Jurafsky & Martin Chapter 22, Information Extraction Ellen Riloff . 1996. Automatically generating extraction patterns from untagged text. Proceedings of AAAI.

briana
Télécharger la présentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing • Lecture 22 • 4/8/2013

  2. Recommended reading • Jurafsky & Martin Chapter 22, Information Extraction • Ellen Riloff. 1996. Automatically generating extraction patterns from untagged text. Proceedings of AAAI. • Roman Yangarber et al. 2000. Automatic acquisition of domain knowledge for information extraction. Proceedings of COLING. • Roman Yangarber. 2003. Counter-training in discovery of semantic patterns. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL). • William Gale and Kenneth Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1). • Regina Barzilay and Kathleen McKeown. 2001. Extracting paraphrases from a parallel corpus. Proc. of ACL/EACL.

  3. Outline • Review: information extraction and paraphrases • Semi-supervised learning of rules for information extraction • Sentence alignment • Semi-supervised learning of paraphrases

  4. Lexical Analysis Name Recognition (Partial) Syntax Rel’n/Event Patterns Reference Resolution Output Generation Pipeline (sequence of steps) for pattern-based extraction

  5. Application of pattern fills in a database entry

  6. Paraphrase relations • Similar linguistic patterns person was appointed as post of companycompany named person to post • Results from: • Different words • named, appointed, selected, chosen, promoted, … • Different syntactic constructions • IBM named Fred president • IBM announced the appointment of Fred as president • Fred, who was named president by IBM

  7. MUC templates and gold standard answers TST4-MUC4-0010 SANTIAGO, 31 JUL 88 (EL MERCURIO) -- [TEXT] RANCAGUA – THE NATIONAL VANGUARD OFFICES IN THIS CITY WERE ATTACKED ON 29 JULY AT 2220. UNIDENTIFIED INDIVIDUALS DETONATED A BOMB THAT DAMAGED THE WINDOWS OF THE NATIONAL VANGUARD OFFICES AND THOSE OF THE NEIGHBORING HOUSES. 0. MESSAGE: ID TST4-MUC4-0010 1. MESSAGE: TEMPLATE 1 2. INCIDENT: DATE 29 JUL 88 3. INCIDENT: LOCATION CHILE: RANCAGUA (CITY) 4. INCIDENT: TYPE BOMBING 5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED 6. INCIDENT: INSTRUMENT ID "BOMB" 7. INCIDENT: INSTRUMENT TYPE BOMB: "BOMB" 8. PERP: INCIDENT CATEGORY - 9. PERP: INDIVIDUAL ID "UNIDENTIFIED INDIVIDUALS" 10. PERP: ORGANIZATION ID - 11. PERP: ORGANIZATION CONFIDENCE -

  8. TST4-MUC4-0010 SANTIAGO, 31 JUL 88 (EL MERCURIO) -- [TEXT] RANCAGUA – THE NATIONAL VANGUARD OFFICES IN THIS CITY WERE ATTACKED ON 29 JULY AT 2220. UNIDENTIFIED INDIVIDUALS DETONATED A BOMB THAT DAMAGED THE WINDOWS OF THE NATIONAL VANGUARD OFFICES AND THOSE OF THE NEIGHBORING HOUSES. 12. PHYS TGT: ID "NATIONAL VANGUARD OFFICES" "NEIGHBORING HOUSES" / "HOUSES" 13. PHYS TGT: TYPE ORGANIZATION OFFICE / COMMERCIAL / OTHER: "NATIONAL VANGUARD OFFICES" CIVILIAN RESIDENCE: "NEIGHBORING HOUSES" / "HOUSES" 14. PHYS TGT: NUMBER PLURAL: "NATIONAL VANGUARD OFFICES" PLURAL: "NEIGHBORING HOUSES" / "HOUSES" 15. PHYS TGT: FOREIGN NATION - 16. PHYS TGT: EFFECT OF INCIDENT SOME DAMAGE: "NATIONAL VANGUARD OFFICES" SOME DAMAGE: "NEIGHBORING HOUSES" / "HOUSES" 17. PHYS TGT: TOTAL NUMBER -

  9. Machine learning for information extraction • Want to learn IE patterns and paraphrases • Annotation of corpora is very expensive • MUC and other corpora have limited linguistic coverage • Patterns would need to be highly domain-specific • Medicine, terrorism, business, news, law, etc. • Therefore would need an annotated corpus for each domain • Solution: use semi-supervised learning

  10. Outline • Review: information extraction and paraphrases • Semi-supervised learning of rules for information extraction • Sentence alignment • Semi-supervised learning of paraphrases

  11. Semi-supervised learning of IE patterns • Intuition: if we collect documents DR relevant to the scenario, patterns relevant to the scenario will occur more frequently in DR than in the language as a whole

  12. Riloff 1996 • AutoSlog: an earlier supervised system that began with knowledge of target strings • Strings in the text are already annotated as: victim, perpetrator, target, instrument • Want to discover extraction patterns for these strings • Apply sentence analyzer • CIRCUS (Lehnert 1991) • Finds subject, verb, direct object, prepositional phrases • See what patterns occur with annotated strings

  13. Patterns found

  14. Semi-supervised pattern extraction: AutoSlog-TS • Begin with a corpus of relevant documents and irrelevant documents • Construct patterns for all noun phrases in all documents • Using the 13 pattern templates in previous slide • Compare frequencies of patterns in relevant vs. irrelevant documents

  15. Finding relevant patterns • Relevance rate for patterni = p(relevant doc | doc contains patterni) • Also consider raw frequency of patterns • Some relevant patterns, such as “X was killed”, occur in both relevant and irrelevant texts • Rank patterns by: relevance rate * log2(frequency)

  16. Apply to terrorism data • MUC-4 training set: 1500 documents, about 50% relevant • AutoSlog-TS generated 32,345 unique extraction patterns • Discarded patterns only occurring once; 11,225 remaining patterns • Rank patterns • By the formula (on previous slide) • Manual filtering by human (!!!) • Started with 1,970 patterns, kept 210 • Evaluation: compared by hand against 100 documents in test set

  17. (before manual filtering)

  18. Problems with AutoSlog-TS • Assumes you already know which documents are relevant and irrelevant • Manual review of extracted patterns by a human (kept 210 out of 1970) • How many should we choose?

  19. Yangarber et. al. 2000: ExDisco (“semi-unsupervised”) • Begin with seed extraction patterns written by hand • Use these seed patterns to identify relevant documents • Construct new patterns for all documents • Rank patterns by their correlation with document relevance • Add the highest ranking pattern to the pattern set • Apply patterns to corpus, and repeat process

  20. Experiments • Data: • MUC-6 (news reports) • Topic: negotiation of labor disputes and corporate management succession • Compared performance of: • Seed patterns only • Top 100 extracted by ExDisco • Patterns manually developed by computational linguists for 1 month on MUC data • (and others)

  21. Performance on test data Recall Precision F-measure Seed 27 74 39.58 ExDisco 52 72 60.16 Manual 47 70 56.40

  22. Yangarber 2003 • Problem: how do you know how many patterns to keep? • Earlier system: kept top 100, an arbitrary number • Lower-ranked patterns: would not be domain-specific

  23. Counter-training • Basic idea (see paper for details): • Identify patterns for multiple scenarios simultaneously • Begin with seed patterns for each scenario, grow incrementally • Automatically stop: stop adding patterns if they are more common in other scenarios

  24. Outline • Review: information extraction and paraphrases • Semi-supervised learning of rules for information extraction • Sentence alignment • Semi-supervised learning of paraphrases

  25. Sentence alignment • Used as a preprocessing step by the paraphrase induction algorithm • Though most often used in machine translation • Previous explanation of MT: • Begin with sentence-aligned corpus • Then estimate alignments • Develop translation model from alignments

  26. Review: minimum edit distance • Find minimum operations needed to transform one string into another, such as intention execution • Fill in table with total edit cost: • Cost of 1 for insertions/deletions; cost of 2 for substitutions • Follow backpointers to recover sequence of edit operations http://conspectus-timgluz1conspectus.dotcloud.com/_images/lec1_levenshteinAlgo.png

  27. http://conspectus-timgluz1conspectus.dotcloud.com/_images/lec1_editDistance.pnghttp://conspectus-timgluz1conspectus.dotcloud.com/_images/lec1_editDistance.png

  28. Outline • Review: information extraction and paraphrases • Semi-supervised learning of rules for information extraction • Sentence alignment • Semi-supervised learning of paraphrases

  29. Paraphrases • Phrases that mean roughly the same thing • <PER> was killed • <PER> died • <PER> kicked the bucket • According to linguistsHalliday 1985 and de Beaugrande and Dressler, 1981: paraphrases retain “approximate conceptual equivalence”

  30. Acquisition of paraphrases • Lexical resources • Hand-built • E.g. WordNet: limited in scope, doesn’t include phrasal or syntactically-based paraphrases • Unsupervised acquisition using parallel corpora • Barzilay & McKeown 2001: multiple English translations of foreign novels • Shinyama et al. 2002: multiple news articles about the same subject

  31. Barzilay & McKeown 2001

  32. Data • Based on literary texts by foreign authors • 11 English translations total, over 3 different books: • Madame Bovary (Flaubert) • Fairy Tales (Andersen) • Twenty Thousand Leagues Under the Sea (Verne)

More Related