Natural Language Processing Information Extraction

Natural Language ProcessingInformation Extraction Meeting 15, Oct 18, 2012 Rodney Nielsen Most of these slides were adapted from James Martin

Statistical Sequence Labeling

Typical Features • Given a small sliding window around target • Features extracted from the window • Current word token • Previous/next N word tokens • Current word POS • Previous/next N POS tags • Previous N chunk labels • Capitalization information • ...

Today • Information Extraction • Entities • Relations • Bio- case study

Information Extraction • Partial parsing and chunking for shallow semantics • Named Entity Recognition (NER) (people, instruments, locations, businesses, etc.) • Event detection • Relation extraction

Information Extraction • Generally newswire text • Useful for many things • But the real interest/money is in specialized domains • Bioinformatics • Electronic medical records • Stock market analysis • Intelligence analysis • Social media

Example App

Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

NER • Find and classify all the named entities • What’s a named entity? • A reference to an entity via the mention of its name • Colorado Rockies • This is a subset of the possible mentions... • Rockies, the team, it, they... • Find: identify the exact span of text • Classify: categorize the entity referenced

NER Approaches • Two basic approaches, as with partial parsing and chunking (plus hybrids) • Rule-based (regular expressions) • Lists of names • Patterns to match things that look like names • .. environments that names tend to occur in. • ML-based • Get annotated training data • Extract features • Train systems to replicate the annotation

ML Approach

Encoding for Sequence Labeling • Same IOB encoding as for chunking • For N classes we have 2*N+1 tags • I and B for each class and O for outside all classes • Each token in a text gets a tag.

NER Features

NER as Sequence Labeling

NER Evaluation • Recall, it is not wise to evaluate chunkers at the tag level because ? • Most frequent class, O, is a very high baseline • Use P/R/F at the entity level. • If some entities are more important than others: • Weight them differently in the evaluation

Relations • Relations between entities are also useful

Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

Relation Types • As with NEs, the list is application specific • For generic news texts:

Relations • Relation = tuple • Somewhat like a database

Relation Analysis • Two aspects • Relation detection • Relation identification / classification • Two reasons • Reduce training time for relation classification: fewer pairs • Use different feature-sets for each task

Relation Analysis • Within sentence relation analysis

Features • Three categories of Features • Entity features • Local context features • Syntax features

Features • Entity Features • Their types • Concatenation of the types • Headwords of the entities • George Washington Bridge • Words in the entities • Committee on Foreign Relations • Bridge to Terabithia

Features • Local Context Features • Bag of words to the left, right or between entities • +/- 1, 2, 3 • Roger lives in Denton, just outside of Dallas • Roger lives in denial, just outside of reality

Features • Syntax Features • Constituent path connecting the entities • Base syntactic chunk sequence between entities • Dependency path

Example • American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

Case Study: Bioinformatic NLP • BioInformatics • Very important • Practitioners care about the technology • They have problems they’re trying to solve • Lots and lots of text available • Lots of interesting problems

Lots of Text

Problem Areas • Mainly variants of NER and relation analysis • NER • Detecting and classifying named entities • And also normalization • Mapping that named entity to a particular entity in some external database or ontology • President Obama  Barack Hussein Obama II • 10/18/12  18 October 2012 • Relation analysis • How various biological entities interact

Bio NER • Large number of fairly specific types • Wide (really quite insane) variation in the naming of entities • Gene names • White, insulin, BRCA1, ether a go-go, breast cancer associated 1, etc.

Bio NER Types

Bio Relations • Combination of IE and SRL-style relation analysis

Bioinformatic IE • Much work in NLP is concerned with portability and generality • How can we get systems trained on one genre/domain to work well on a different one • Biologists don’t seem to care much about this... • They’re happy if you build an effective specific system to solve their specific problem • This is true of most practical domains

Project Presentation • Shibamouli Lahiri • Dialogue-based Software Development

Student Questions • What's the difference between Information Extraction Vs Retrieval? • Should the use of Regular Expression be considered as one of the I.E techniques? • Can you go over the concept of un|semi|-supervised machine learning approaches?

Student Questions • I was wondering somewhat regarding the high F-measure obtained by statistical slot-filling approaches. Don't you think there is a reason to believe that at least some of these measures are inflated (or biased) or are simply overfitting the training data? • Especially, the CMU seminar announcement dataset is not that large, and neither is the TIMEBANK dataset. • Another potential problem is with the individual documents. As the authors point out, the documents are somewhat small and homogeneous, so is there a reason to believe that real-life statistical slot-fillers may not perform as well ?

Student Questions • Why is the standard evaluation process for a Named Entity model per entity and not per token? • In IOB encoding how does the model determine which entities end and which entities have a continuation?

Natural Language Processing Information Extraction