1 / 42

Natural Language Processing Information Extraction

Natural Language Processing Information Extraction. Meeting 15, Oct 18, 2012 Rodney Nielsen Most of these slides were adapted from James Martin. Statistical Sequence Labeling. Typical Features. Given a small sliding window around target Features extracted from the window Current word token

derek-hood
Télécharger la présentation

Natural Language Processing Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language ProcessingInformation Extraction Meeting 15, Oct 18, 2012 Rodney Nielsen Most of these slides were adapted from James Martin

  2. Statistical Sequence Labeling

  3. Typical Features • Given a small sliding window around target • Features extracted from the window • Current word token • Previous/next N word tokens • Current word POS • Previous/next N POS tags • Previous N chunk labels • Capitalization information • ...

  4. Today • Information Extraction • Entities • Relations • Bio- case study

  5. Information Extraction • Partial parsing and chunking for shallow semantics • Named Entity Recognition (NER) (people, instruments, locations, businesses, etc.) • Event detection • Relation extraction

  6. Information Extraction • Generally newswire text • Useful for many things • But the real interest/money is in specialized domains • Bioinformatics • Electronic medical records • Stock market analysis • Intelligence analysis • Social media

  7. Example App

  8. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

  9. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

  10. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

  11. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

  12. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

  13. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

  14. NER • Find and classify all the named entities • What’s a named entity? • A reference to an entity via the mention of its name • Colorado Rockies • This is a subset of the possible mentions... • Rockies, the team, it, they... • Find: identify the exact span of text • Classify: categorize the entity referenced

  15. NER Approaches • Two basic approaches, as with partial parsing and chunking (plus hybrids) • Rule-based (regular expressions) • Lists of names • Patterns to match things that look like names • .. environments that names tend to occur in. • ML-based • Get annotated training data • Extract features • Train systems to replicate the annotation

  16. ML Approach

  17. Encoding for Sequence Labeling • Same IOB encoding as for chunking • For N classes we have 2*N+1 tags • I and B for each class and O for outside all classes • Each token in a text gets a tag.

  18. NER Features

  19. NER as Sequence Labeling

  20. NER Evaluation • Recall, it is not wise to evaluate chunkers at the tag level because ? • Most frequent class, O, is a very high baseline • Use P/R/F at the entity level. • If some entities are more important than others: • Weight them differently in the evaluation

  21. Relations • Relations between entities are also useful

  22. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

  23. Relation Types • As with NEs, the list is application specific • For generic news texts:

  24. Relations • Relation = tuple • Somewhat like a database

  25. Relation Analysis • Two aspects • Relation detection • Relation identification / classification • Two reasons • Reduce training time for relation classification: fewer pairs • Use different feature-sets for each task

  26. Relation Analysis • Within sentence relation analysis

  27. Features • Three categories of Features • Entity features • Local context features • Syntax features

  28. Features • Entity Features • Their types • Concatenation of the types • Headwords of the entities • George Washington Bridge • Words in the entities • Committee on Foreign Relations • Bridge to Terabithia

  29. Features • Local Context Features • Bag of words to the left, right or between entities • +/- 1, 2, 3 • Roger lives in Denton, just outside of Dallas • Roger lives in denial, just outside of reality

  30. Features • Syntax Features • Constituent path connecting the entities • Base syntactic chunk sequence between entities • Dependency path

  31. Example • American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

  32. Case Study: Bioinformatic NLP • BioInformatics • Very important • Practitioners care about the technology • They have problems they’re trying to solve • Lots and lots of text available • Lots of interesting problems

  33. Lots of Text

  34. Problem Areas • Mainly variants of NER and relation analysis • NER • Detecting and classifying named entities • And also normalization • Mapping that named entity to a particular entity in some external database or ontology • President Obama  Barack Hussein Obama II • 10/18/12  18 October 2012 • Relation analysis • How various biological entities interact

  35. Bio NER • Large number of fairly specific types • Wide (really quite insane) variation in the naming of entities • Gene names • White, insulin, BRCA1, ether a go-go, breast cancer associated 1, etc.

  36. Bio NER Types

  37. Bio Relations • Combination of IE and SRL-style relation analysis

  38. Bioinformatic IE • Much work in NLP is concerned with portability and generality • How can we get systems trained on one genre/domain to work well on a different one • Biologists don’t seem to care much about this... • They’re happy if you build an effective specific system to solve their specific problem • This is true of most practical domains

  39. Project Presentation • Shibamouli Lahiri • Dialogue-based Software Development

  40. Student Questions • What's the difference between Information Extraction Vs Retrieval? • Should the use of Regular Expression be considered as one of the I.E techniques? • Can you go over the concept of un|semi|-supervised machine learning approaches?

  41. Student Questions • I was wondering somewhat regarding the high F-measure obtained by statistical slot-filling approaches. Don't you think there is a reason to believe that at least some of these measures are inflated (or biased) or are simply overfitting the training data? • Especially, the CMU seminar announcement dataset is not that large, and neither is the TIMEBANK dataset. • Another potential problem is with the individual documents. As the authors point out, the documents are somewhat small and homogeneous, so is there a reason to believe that real-life statistical slot-fillers may not perform as well ?

  42. Student Questions • Why is the standard evaluation process for a Named Entity model per entity and not per token? • In IOB encoding how does the model determine which entities end and which entities have a continuation?

More Related