Language Processing for Information Extraction

Language Processing for Information Extraction Ralph Weischedel 1 July 2005

An ACE View of Information Extraction

What is information extraction for? • (Semi-)Automatic fill of data base from language sources • Social Network Analysis • Organizational structure & leadership • Expert systems/reasoning systems • Power the semantic web • Improved translation, summarization, and detection algorithms • Already being used by above technologies • Name tagging • Parsing • Expected future use by above technologies • Entity normalization & disambiguation (against a data base) • Essential elements of information (entities, relations, events)

Extracting Names (“Named Entity Recognition”) American and Iraqi forces have captured a former member of Saddam Hussein's government and two of the dictator's relatives. The relatives, Abdullah Maher Abdul Rashid and his cousin Marwan Taher Abdul Rashid, were seized on March 8 in Tikrit, Mr. Hussein's hometown. Marwan Taher Abdul Rashid worked as a bodyguard for Mr. Hussein. Abdullah Maher Abdul Rashid is a brother-in-law of Mr. Hussein's son Qusay, who was killed in a battle with American soldiers in July 2003. The third man arrested, Omar Hassan Chiad, was an official in Mr. Hussein's government and was caught by American soldiers. Person GPE-Geopolitical Entity No connection among the strings; each is unrelated.

Extracting Entities-1 (only showing type Person) American and Iraqi forces have captured a former member of Saddam Hussein's government and two of the dictator's relatives. The relatives, Abdullah Maher Abdul Rashid and his cousin Marwan Taher Abdul Rashid, were seized on March 8 in Tikrit, Mr. Hussein's hometown. Marwan Taher Abdul Rashid worked as a bodyguard for Mr. Hussein. Abdullah Maher Abdul Rashid is a brother-in-law of Mr. Hussein's son Qusay, who was killed in a battle with American soldiers in July 2003. The third man arrested, Omar Hassan Chiad, was an official in Mr. Hussein's government and was caught by American soldiers.

Extracting Entities-2 (only showing type Person) American and Iraqi forces have captured a former member of Saddam Hussein's government and two of the dictator's relatives. The relatives, Abdullah Maher Abdul Rashid and his cousin Marwan Taher Abdul Rashid, were seized on March 8 in Tikrit, Mr. Hussein's hometown. Marwan Taher Abdul Rashid worked as a bodyguard for Mr. Hussein. Abdullah Maher Abdul Rashid is a brother-in-law of Mr. Hussein's son Qusay, who was killed in a battle with American soldiers in July 2003. The third man arrested, Omar Hassan Chiad, was an official in Mr. Hussein's government and was caught by American soldiers.

Extracting Relations (only showing some examples) American and Iraqi forces have captured a former member of Saddam Hussein's government and two of the dictator's relatives. The relatives, Abdullah Maher Abdul Rashid and his cousin Marwan Taher Abdul Rashid, were seized on March 8 in Tikrit, Mr. Hussein's hometown. Marwan Taher Abdul Rashid worked as a bodyguard for Mr. Hussein. Abdullah Maher Abdul Rashid is a brother-in-law of Mr. Hussein's son Qusay, who was killed in a battle with American soldiers in July 2003. The third man arrested, Omar Hassan Chiad, was an official in Mr. Hussein's government and was caught by American soldiers. Entities, NOT strings

Extracting Events (only showing some examples) American and Iraqi forces have captured a former member of Saddam Hussein's government and two of the dictator's relatives. The relatives, Abdullah Maher Abdul Rashid and his cousin Marwan Taher Abdul Rashid, were seized on March 8 in Tikrit, … Qusay, who was killed in a battle with American soldiers in July 2003. Entities, NOT strings

BBN FactBrowser Example Entity Descriptions Query Retrieved Documents Relations Involving Entity Evidence for Relation Entities Mentioned Document Text 9

Levels of Linguistic Analysis S VP NP PP NP Person: Slobodan Milosevic Position: president Organization: Yugoslavia NPA Person: Milos Milosavljevic Position: President Organization: Association of Yugoslav Banks Person: Milos Milosavljevic Position: General director Organization: JugoBanka S SBAR VP VP NPA WHNP NP PP PP PP NPA NPA NPA NPA NPA GPE Person ORG Person ORG , , is of of of its on by the the the also who Milos general Banks headed Yugoslav Yugoslav director received president President Slobodan Milosevic Thursday JugoBanka Association Milosavljevic representatives Name finding Parsing Co-reference

Handcrafted rules (patterns) typically used in COTS products Trained methods now commonly used in research For names and non-nested entity mentions, tagging models can be used For example: Person-Name-Start, Organization-Name-Continue, Not-A-Name Coreference (assembling mentions into entities) requires combining non-local information Greedy strategies are common, processing each mention against all entities seen so far. Graph partitioning algorithms are also being explored. Relations and events are typically predicted using classifier models based on contextual features: For “the Bush ranch in Texas”, features would include “ranch” as a Facility word, “Texas” as GPE, and “in” as the linking word A Typical Approach Information Extraction Mention Detection Mentions Co-Reference Entities, Events Relation Extraction Relations

Evaluation in ACE • Names were scored (in MUC) as a detection task • Hits, misses, false alarms • Scoring entities requires finding the optimal mapping between the entities (sets of mentions) output by the system and those in the answer key • Systems receive partial credit for mapped entities based on the value of correctly-predicted mentions • Relation and event scoring in turn depend on the entity mapping • Value measure in ACE much lower than MUC F-measure

Increasing Target Complexity Human Performance System Performance • More complex tasks depend increasingly on meaning and context • Knowledge representation issues matter more • More inference required • Harder for humans to define consistently • Harder for systems to recognize

Reliable vs Challenge • High accuracy for machines only when information is local • Examples of local information • UN Secretary General Kofi Annan • Lincoln was assassinated in 1865. • Examples of non-local information • A band of unknown gunmen opened fire on a St. Petersburg businessman while he was returning home with his family, killing the man and his son and wounding his wife and teenage daughter. • Whose wife? • Whose teenage daughter? • Who was the target? Who got in the way of the intended victim(s)? • High accuracy for machines (and humans) • for explicitly stated information • not for implicit information

Gauging Human Performance • Scoring meaning elements is much harder than scoring string detection • Different DB schemes may call for encoding the same fact as an entity attribute or as a relation • Precisely defining the target classes is difficult • “Country” seems like an easy concept, but what about “Palestine”, “Europe”, or “Kurdistan”? • Is “X shot and killed Y” one event, or two? • Human inter-annotator agreement rates, as reported recently by the LDC (Linguistic Data Consortium) • Entities: 88% • Relations: 66%

Current Performance • Best system scores from the ACE-2004 evaluation

Combining Speech Recognition and Extraction • Scores of the system that performed best on ASR input (~ 8% WER) in the ACE-2004 evaluation, compared with that system’s scores on clean transcripts • Loss in extraction value is more than proportional to the WER, since • Systems have to get multiple items correct • No punctuation • Tighter integration than a single recognition hypothesis should improve scores on ASR significantly

Observations • Trend toward learning algorithms that • Account for structure • Not just word sequences • Many challenges • Accounting for far more of document content • Not just a pre-defined inventory of entity/relation/event types • Speech/OCR output • ACE evaluation scheduled for fall, 2005

BBN’s Statistical Learning Approach Name Finding – IdentiFinder™ Relation Extraction – SIFT Question Answering Semi-Supervised Learning OntoBank

Named Entity (NE) Extraction Locations Identify every name mention of locations, persons, and organizations. Training Program training sentences Persons answers Organizations The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic. NE Models The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic. Entities Extractor Text • Since 1997 - Statistical approaches (BBN, NYU, MITRE) achieve state-of-the-art performance • Up to 1996 - no learning approach competitive with hand-built rules • Since 2003 – Ability to bring up a new language in <1 month via statistical approach • Since 2004 – Substantial reduction in training data requirments

A Hidden Markov Model • Bi-gram transition probabilities Structure of Model • One language model for each category plus one for other (not-a-name) • The categories are learned from training

IdentiFinder Status • Trained in multiple languages • English, Chinese, Arabic, Hindi, ... • Delivered to many other research sites who are collaborating with BBN • Deployed in BBN products • AudioIndexer™

Prior to 1990 - accuracy for non-statistical parsers around 65% Parsing de-emphasized in information extraction by 1995 Since 1995 - Statistical parsers (IBM, UPenn, Brown and BBN) achieve ~90% accuracy Parsing re-introduced into information extraction 1997 – 1st learning approach to relation extraction with state-of-the-are performance Extraction via Lexicalized Probabilistic CFGs Training Program training sentences answers Models Nawaz Sharif, who led Pakistan, was ousted October 12 by Pervez Musharraf, Pakistani Army General. Trees Speech Speech Recognition Parser Text

The Sentential Model • Search Criterion: find M such that p(M | W) is maximized • Since p(W) is constant, search for: • Model the probability as the product of the probabilities of generating each element in the tree

Example of Generating a Parse Tree S S was S VP was VP NP VP VP ousted SBAR PP S NP VP WHNP NP NP NP NP NP , was ousted Sharif who led Pakistan , Muscharraf 12 by , Pervez General October Nawaz Army Pakistani

Shallow Semantic Annotation Employee relation Coreference person-descriptor organization person Nance , who is also a paid consultant to ABC News , said ... • Training data consists ONLY of • Named entities (as in MUC NE) • Descriptor phrases (for MUCTE) • Descriptor references (for MUCTE) • Relation/events to be extracted (for MUCTR)

Integrating Syntactic Knowledge • Key idea is to exploit the Penn Treebank • Train the sentence-level model on syntactic trees from Treebank • For each sentence in the semantically annotated corpus • Parse the sentence constraining the search to find parses that are consistent with semantics • Augment the syntactic parse with semantic structure • Result is a corpus that is annotated both semantically and syntactically

Augmented Semantic Tree s Semantic label Syntax label per/np vp per-desc-of/sbar-lnk per-desc-ptr/sbar per-desc-ptr/vp per-desc-r/np emp-of/pp-lnk org-ptr/pp per-r/np whnp advp per-desc/np org-r/np per/nnp , wp vbz rb det vbn per-desc/nn to org-c/nnp org/nnp , vbd Nance , who is also a paid consultant to ABC News , said ...

Top-Down Generation Example • A head constituent for the S, in this case a VP. • Pre-modifier constituents for the S. In this case, there is only one: an NP (“Nawaz Sharif, who led Pakistan”) • A head part-of-speech tag for the NP, in this case NNP (the POS tag for “Sharif”). • A head word for the NP, in this case “Sharif”. • Word features for the head word of the NP, in this case capitalized. • A head constituent for the NP, in this case an NP (“Nawaz Sharif”) • Pre-modifier constituents for the NP. In this case, there are none. • Post-modifier constituents for the NP. First a comma, then an SBAR structure, and then a second comma are each generated in turn.

A Use for Extraction Technology:Question Answering

An Application in Question Answering • Factoid questions • When was Mozart born? • Amassing extended answers from multiple documents (and multiple languages) • Who was Mozart? • What is Aum Shirikyu? • What is sarin?

Application Overview Question profile is used to rank features. Question input The user selects facts to insert into the report.

Answering Definitional Questions • Select phrases by feature • Linguistic features • Appositives • Copula constructions • Surface structure patterns (handcrafted or learned) • Propositions • Semantic features from information extraction • Co-reference within document • Relations • Rank phrases, called kernel facts, via information retrieval • Uses context provided by the user in the document the user is writing • Remove redundancy • Cut off at target length of answer

Approach • Trained, language-independent algorithms for core NLP problems, e.g., • passage retrieval, • name tagging, • parsing and • co-reference

Answering Definitional Questions Question Classification Document Retrieval Redundancy Removal Question Treebank Name Tagging Name Annotation Parsing Hand-crafted Patterns Surface Structure Matching Linguistic Processing & Extraction of Kernel Facts Proposition Finding Learned Patterns Background Model Question Profile Co-reference Relation Extraction Kernel Fact Ranking New Component Linguistically Motivated Components of SERIF List of Responses

Example Chinese Facts • Copula (linking verb): • 盖茨是微软首席执行官 • Bill Gates IS the CEO of Microsoft • Descriptions from appositives • 英国首相布莱尔 • British Prime Minister Blair • 美国总统布什 • U.S. President Bush • Proposition • 布莱尔出任党魁 • Blair became the head of the party. • 切尼当选美利坚合众国副总统 • Cheney was elected to be the vice president of U.S.A • Relation • The relation between Anan and UN is ROLE/MANAGEMENT

New Ideas Semi-supervised learning Automatic discovery of language structure OntoBank New level of semantic modeling Annotate all explicitly stated predicate-argument relations Every major predicate/term (verb, noun, adjective, name) disambiguated & mapped to an ontology Terms that a reasoning system or other analytic tool can use, NOT uninterpreted strings of text Co-references resolved (pronouns, noun references, and names linked together)

Breaking the Barrier of Supervised Training Data for Extraction • Basic idea • Document sources themselves provide vast amounts of data about language • Automatically learn language properties from the sources • Example: word classes • Initially every word is in its own class • Build classes by merging that pair of classes with highest mutual information • Potential • Reduce required supervised training • Magnify effect of supervised training • Automatic way to smooth discriminative models from word-based to back-off classes • Automatic adaptation • Case Study: name extraction

New Training Method for Extraction • Combine • Discriminative learning algorithms • Perceptron • Small supervised training with massive amounts of unsupervised training • Clustering of 100M words of WSJ • Active learning • To maximize impact of early training data • Effect • 25% reduction in error compared to well-trained HMM • 1/8th the supervised training (~3 hours) to achieve F of 90 Name Extraction Annotation rate: 5,000 words/hour.

Apply Clustering • Hypothesis: • Words occurring in similar contexts refer to similar things • Criterion: • Cluster words with similar distributions over their following words • Minimize average future entropy of a bigram class model (similar to Brown et al, 1990) Example Clusters from 100M words of WSJ

Impact of Discriminative Training Perceptron alone worse than HMM in this test.

Impact of Unsupervised Clustering Using clusters with perceptron better than HMM in this test.

Impact of Active Learning Combination of discriminative models + clusters + active learning best in this test.

Arabic Results 25% reduction in error 5-fold reduction in training to achieve max of HMM

Annotation Tool • Tool under development incorporating active learning with discriminative models and clustering • An initial keyword-based annotation phase • User supplies a list of examples • System selects sentences containing those words for annotation • Repeated active learning phases • System trains a model on all sentences annotated so far • This discriminative model combines word cluster features with the training examples. • System selects a new batch of sentences for annotation that contain instances that the model is maximally unsure about • Tool automatically implements test-on-train measures of annotation consistency

Example Sentence in Context, automatically tagged Annotation Classes List of Sentences to Review Command to Retrain Model and Select New Sentences via Active Learning List of Completed Sentences

Conclusions • New result • Combines • Discriminative models • Unsupervised word clustering & • Active learning • Reduces error rate in name extraction by 25% in English and Arabic • Reduces training requirement by factor of 5-8 in English and Arabic • Integrated into new annotation tool • Integrated into Serif extraction engine, which • Extracts entities and relations • Performs entity disambiguation/data base normalization

OntoBank:Toward a New Generation of Representation & Understanding

Technical Challenge: Representation & Modeling Syntaxas features RELATION ACTOR OBJECT TIME LEADS Sharif-1 Pakistan past-time OUST Musharraf-23 Sharif-1 10/12/99 RANK Musharraf-23 General 10/12/99 NATIONALITY Musharraf-23 Pakistani Vugraph from Weischedel, Marcus, and Hovy • Current Levels of Representation • Words and/or syntax as features • Desired level -- Logical structure as features Words as features Sequencesas features Ontology-based logicas features

Models, Resources & Applications

Language Processing for Information Extraction