I256 Applied Natural Language Processing Fall 2009

I256 Applied Natural Language ProcessingFall 2009 Lecture 13 Information Extraction (1) Barbara Rosario

Today • Another project proposal • Classification recap • Information Extraction (1) • Midterm evaluations

Classifying at Different Granularities • Text Categorization: • Classify an entire document • Information Extraction (IE): • Identify and classify small units within documents • Named Entity Extraction (NE): • A subset of IE • Identify and classify proper names • People, locations, organizations

NAME TITLE ORGANIZATION What is Information Extraction? As a task: Filling slots in a database from unstructured text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is [..]. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Adapted from slide by William Cohen

Information Extraction Extract structured data, such as tables, from unstructured text Task: October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is [..]. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. Adapted from slide by William Cohen

Information Extraction As a familyof techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction/detection” Adapted from slide by William Cohen

Information Extraction A familyof techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Adapted from slide by William Cohen

Information Extraction A familyof techniques: Information Extraction = segmentation + classification+ association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Adapted from slide by William Cohen

Text paragraphs without formatting Grammatical sentencesand some formatting & links Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets,rich formatting & links Tables Landscape of IE Tasks:Degree of Formatting

Landscape of IE Tasks:Intended Breadth of Coverage Web site specific Genre specific Wide, non-specific Formatting Layout Language Amazon.com Book Pages Resumes University Names Adapted from slide by William Cohen

Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama… The CALD main office can be reached at 412-268-1299 The big Wyoming sky… Ambiguous patterns,needing context andmany sources of evidence Complex pattern U.S. postal addresses Person names University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, SoftwareEngineer at WhizBang Labs. Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Landscape of IE Tasks:Complexity Adapted from slide by William Cohen

Applications • Information Extraction’s applications • Extract structured data out of electronically-available scientific literature, especially in the domain of biology and medicine • Legal documents • Business intelligence • Resume harvesting • Media analysis • Sentiment detection • Patent search • Email scanning

Information Extraction Architecture

Main tasks • Named Entity Recognition • Relation Extraction • Relations like subject are syntactic, relations like person, location, agent or message are semantic

Single entity Named entity recognition Binary relationship Relation Extraction N-ary record Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut Location: Connecticut Landscape of IE Tasks Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Adapted from slide by William Cohen

Identification of NE Named entity recognition • The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. • Named entities: definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates etc.. • Can be broken down into two sub-tasks: • identifying the boundaries of the NE (segmentation) • identifying its type (classification) Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Named entity recognition • The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. • Named entities: definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates etc.. • Can be broken down into two sub-tasks: • identifying the boundaries of the NE (segmentation) • identifying its type (classification) Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Classification of the NE

Methods • Look up each word in an appropriate list of names. For example, for locations, we could use a gazetteer, or geographical dictionary, such as the Alexandria Gazetteer or the Getty Gazetteer. • error-prone; case distinctions may help, but these are not always present. Location Detection by Simple Lookup for a News Story:

Ambiguity • Many named entity terms are ambiguous. • May and North are likely to be parts of named entities for DATE and LOCATION, respectively, but could both be part of a PERSON • Christian Dior looks like a PERSON but is more likely to be of type ORGANIZATION. • A term like Yankee will be ordinary modifier in some contexts, but will be marked as an entity of type ORGANIZATION in the phrase Yankee infielders. • Further challenges: • Multi-word names like Stanford University • Names that contain other names such as Cecil H. Green Library and Escondido Village Conference Service Center. • In named entity recognition, therefore, we need to be able to identify the beginning and end of multi-token sequences chunking

Chunking • Chunking useful for entity recognition • Segment and label multi-token sequences • Each of these larger boxes is called a chunk

Chunking • The CoNLL 2000 corpus contains 270k words of Wall Street Journal text, annotated with part-of-speech tags and chunk tags. Three chunk types in CoNLL 2000: NP chunks VP chunks PP chunks

Chunking • More info in Section 7.2 NLTK book • We may cover this later during the course but for now, just remember that: • You probably need to do chunking if you do a named entity recognition task • And perhaps also parse-tree-based features:

Path Features From Dan Kein’s CS 288 slides (UC Berkeley)

NLTK NE classifier • NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). • If we set the parameter binary=True, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE (geo-political entity).

NE methods • Next class

Relation Extraction • Once named entities have been identified in a text, we then want to extract the relations that exist between them • Typically relations between specified types of named entity. • Example: look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y.

Relation Extraction • Example: RE: hard-code patterns that contain strings that express the relation that we are looking for. • Example: look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y.

Relation Extraction • nltk.sem.extract_rels • Problem with hand-crafted patterns? • Machine-learning systems which typically attempt to learn ‘relation patterns’ automatically from a training corpus. • Next week we’ll see some machine learning methods to tackle this

Single entity Named entity recognition Binary relationship Relation Extraction Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Person: Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut Location: Connecticut Semantic Parsers Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. N-ary record Semantic Parsers Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Adapted from slide by William Cohen

PropBank / FrameNet FrameNet: roles shared between verbs PropBank: each verb has it’s own roles PropBank more used, because it’s layered over the treebank (and so has greater coverage, plus parses) Note: some linguistic theories postulate even fewer roles than FrameNet (e.g. 5-20 total: agent, patient, instrument, etc.) From Dan Kein’s CS 288 slides (UC Berkeley)

PropBank Example From Dan Kein’s CS 288 slides (UC Berkeley)

Message Understanding Conference (MUC) • DARPA funded significant efforts in IE in the early to mid 1990’s. • Message Understanding Conference (MUC) was an annual event/competition where results were presented. • Focused on extracting information from news articles: • Terrorist events • Industrial joint ventures • Company management changes • Information extraction of particular interest to the intelligence community (CIA, NSA).

Message Understanding Conference (MUC) • Named entity • Person, Organization, Location • Co-reference • Clinton President Bill Clinton • Template element • Perpetrator, Target • Template relation • Incident • Multilingual

MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone SportsTaiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 MUC template Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE from FASTUS (1993)

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 MUC template Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 MUC template Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)

Three generations of IE systems • Hand-Built Systems – Knowledge Engineering [1980s– ] • Rules written by hand • Require experts who understand both the systems and the domain • Iterative guess-test-tweak-repeat cycle • Automatic, Trainable Rule-Extraction Systems [1990s– ] • Rules discovered automatically using predefined templates, using automated rule learners • Statistical Models [1997 – ] • Use machine learning to learn which features indicate boundaries and types of entities. • Learning usually supervised; may be partially unsupervised

Successors to MUC • CoNNL: Conference on Computational Natural Language Learning • Different topics each year • 2002, 2003: Language-independent NER • 2004: Semantic Role recognition • 2001: Identify clauses in text • 2000: Chunking boundaries • http://cnts.uia.ac.be/conll2003/ (also conll2004, conll2002…) • ACE: Automated Content Extraction • Entity Detection and Tracking • Sponsored byNIST • http://wave.ldc.upenn.edu/Projects/ACE/ • Several others recently • See http://cnts.uia.ac.be/conll2003/ner/

Word POS Chunk EntityType CoNNL-2003 • Goal: identify boundaries and types of named entities • People, Organizations, Locations, Misc. • Experiment with incorporating external resources (Gazeteers) and unlabeled data • Data: 4 pieces of info for each term

Summary of Results • 16 systems participated • Machine Learning Techniques • Combinations of Maximum Entropy Models (5) + Hidden Markov Models (4) + Winnow/Perceptron (4) • Others used once were Support Vector Machines, Conditional Random Fields, Transformation-Based learning, AdaBoost, and memory-based learning • Combining techniques often worked well • Features • Choice of features is at least as important as ML method • Top-scoring systems used many types • No one feature stands out as essential (other than words) Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

Evaluating IE Accuracy • Always evaluate performance on independent, manually-annotated test data not used during system development. • Measure for each test document: • Total number of correct extractions in the solution template: N • Total number of slot/value pairs extracted by the system: E • Number of extracted slot/value pairs that are correct (i.e. in the solution template): C • Compute average value of metrics adapted from IR: • Recall = C/N • Precision = C/E • F-Measure = Harmonic mean of recall and precision

Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

Use of External Information • Improvement from using Gazeteers vs. unlabeled data nearly equal • Gazeteers less useful for German than English (higher quality) • Note: standard methods to understand the impact of some features (for a given algorithm); remove them and see error reductions

Precision, Recall, and F-Scores * * * * * * Not significantly different Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

Combining Results • What happens if we combine the results of all of the systems? • Used a majority-vote of 5 systems for each set • English: F = 90.30 (14% error reduction of best system) • German: F = 74.17 (6% error reduction of best system)

IE Tools • Research tools • Gate • http://gate.ac.uk/ • MinorThird • http://minorthird.sourceforge.net/ • Alembic (only NE tagging) • http://www.mitre.org/tech/alembic-workbench/ • Commercial • ?? I don’t know which ones work well

Resources • Not checked but from http://nlp.stanford.edu/links/statnlp.html Semantic Parsers • ASSERT • PropBank semantic roles (and opinions, etc.) by Sameer Pradhan. • Shalmaneser • FrameNet-based by Katrin Erk. • Tree Kernels in SVMlight by Alessandro Moschitti. • A general package, but it has particularly been used for SRL. Named Entity Recognition • Stanford Named Entity Recognizer • A Java Conditional Random Field sequence model with trained models for Named Entity Recognition. Java. GPL. By Jenny Finkel. • LingPipe • Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co. • YamCha • SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

I256 Applied Natural Language Processing Fall 2009