Toward Unified Models of Information Extraction and Data Mining

Toward Unified Models of Information Extraction and Data Mining Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst Joint work with Aron Culotta, Charles Sutton, Ben Wellner, Khashayar Rohanimanesh, Wei Li

Goal: Improving our abilityto mine actionable knowledgefrom unstructured text.

Pages Containing the Phrase“high tech job openings”

foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 Extracting Job Openings from the Web

A Portal for Job Openings

Job Openings: Category = High Tech Keyword = Java Location = U.S.

Data Mining the Extracted Job Information

IE fromChinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * Free Soft.. Microsoft Microsoft TITLE ORGANIZATION * founder * CEO VP * Stallman NAME Veghte Bill Gates Richard Bill

Larger Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Prediction Outlier detection Decision support

Problem: Combined in serial juxtaposition, IE and KD are unaware of each others’ weaknesses and opportunities. KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Emerging Patterns Prediction Outlier detection Decision support

Discriminatively-trained undirected graphical models Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into! Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Probabilistic Model Documentcollection Actionableknowledge Prediction Outlier detection Decision support

Outline a • The need for unified IE and DM. • Review of Conditional Random Fields for IE. • Preliminary steps toward unification: • Joint Co-reference Resolution (Graph Partitioning) • Joint Labeling of Cascaded Sequences (Belief Propagation) • Joint Segmentation and Co-ref (Iterated Conditional Samples.) • Conclusions

Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S S S transitions t - 1 t t+1 ... ... observations ... Generates: State sequence Observation sequence O O O t - t +1 t 1 o1 o2 o3 o4 o5 o6 o7 o8 Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models Given a sequence of observations: Yesterday Rich Caruana spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) YesterdayRich Caruanaspoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Rich Caruana

We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

Problems with Richer Representationand a Joint Model These arbitrary features are not independent. • Multiple levels of granularity (chars, words, phrases) • Multiple dependent modalities (words, formatting, layout) • Past & future Two choices: Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S S S S S S t - 1 t t+1 t - 1 t t+1 O O O O O O t - t +1 t - t +1 t 1 t 1

Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o): • Can examine features, but not responsible for generating them. • Don’t have to explicitly model their dependencies. • Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

From HMMs to Conditional Random Fields [Lafferty, McCallum, Pereira 2001] St-1 St St+1 Joint ... ... Ot-1 Ot Ot+1 Conditional St-1 St St+1 ... Ot-1 Ot Ot+1 ... where (A super-special case of Conditional Random Fields.) Set parameters by maximum likelihood, using optimization method on dL.

Conditional Random Fields [Lafferty, McCallum, Pereira 2001] 1. FSM special-case: linear chain among unknowns, parameters tied across time steps. St St+1 St+2 St+3 St+4 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates")

Training CRFs Feature count using correct labels Feature count using predicted labels - - Smoothing penalty

Linear-chain CRFs vs. HMMs • Comparable computational efficiency for inference • Features may be arbitrary functions of any or all observations • Parameters need not fully specify generation of observations; can require less training data • Easy to incorporate domain knowledge

Main Point #1 Conditional probability sequence models give great flexibility regarding features used, and have efficient dynamic-programming-based algorithms for inference.

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov Labels: CRF • Non-Table • Table Title • Table Header • Table Data Row • Table Section Data Row • Table Footnote • ... (12 in all) Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Features: • Percentage of digit chars • Percentage of alpha chars • Indented • Contains 5+ consecutive spaces • Whitespace in this line aligns with prev. • ... • Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct Table segments, F1 HMM 65 % 64 % Stateless MaxEnt 85 % - D error = 85% D error = 77% CRF w/out conjunctions 52 % 68 % CRF 95 % 92 %

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004] D error 40%

Main Point #2 Conditional Random Fields were more accurate in practice than a generative model ... on a research paper extraction task, ... and others, including - a table extraction task - noun phrase segmentation - named entity extraction - …

Outline a • The need for unified IE and DM. • Review of Conditional Random Fields for IE. • Preliminary steps toward unification: • Joint Co-reference Resolution(Graph Partitioning) • Joint Labeling of Cascaded Sequences (Belief Propagation) • Joint Segmentation and Co-ref (Iterated Conditional Samples.) • Conclusions a

IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Database Load DB Documentcollection Train extraction models Query, Search Data mining Prediction Outlier detection Decision support Label training data

Coreference Resolution AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty" Output Input News article, with named-entity "mentions" tagged Number of entities, N = 3 #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice #3 President Bush Bush Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . he . . . . . . . . . . . . . . . . . . . Condoleezza Rice . . . . . . . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . . . . . President Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) Mention (4) Y/N? . . . Powell . . . . . . Mr Powell . . . N Two words in common 29 Y One word in common 13 Y "Normalized" mentions are string identical 39 Y Capitalized word in common 17 Y > 50% character tri-gram overlap 19 N < 25% character tri-gram overlap -34 Y In same sentence 9 Y Within two sentences 8 N Further than 3 sentences apart -1 Y "Hobbs Distance" < 3 11 N Number of entities in between two mentions = 0 12 N Number of entities in between two mentions > 4 -3 Y Font matches 1 Y Default -19 OVERALL SCORE = 98 > threshold=0

The Problem Pair-wise merging decisions are being made independently from each other . . . Mr Powell . . . affinity = 98 Y . . . Powell . . . N affinity = -104 They should be made in relational dependence with each other. Y affinity = 11 . . . she . . . Affinity measures are noisy and imperfect.

Issues: 1) Generative model makes it difficult to use complex features. A Generative Model Solution [Russell 2001], [Pasula et al 2002] (Applied to citation matching, and object correspondence in vision) N id context words id surname distance fonts gender age . . . 2) Number of entities is hard-coded into the model structure, but we are supposed to predict num entities! Thus we must modify model structure during inference---MCMC. . . .

A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 45 . . . Powell . . . Y/N Y/N -30 Y/N 11 . . . she . . .

A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . -(45) . . . Powell . . . N N -(-30) Y +(11) -4 . . . she . . .

A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . +(45) . . . Powell . . . Y N -(-30) Y +(11) -infinity . . . she . . .

A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . +(45) . . . Powell . . . Y N -(-30) N -(11) . . . she . . . 64

Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . -106 -30 -134 11 . . . Condoleezza Rice . . . . . . she . . . 10

Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . -106 -30 -134 11 . . . Condoleezza Rice . . . . . . she . . . 10 = -22

Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . -106 -30 -134 11 . . . Condoleezza Rice . . . . . . she . . . 10 = 314

Markov Random Fields for Co-reference • Train edge weight function by maximum likelihood • (Can approximate gradient by Gibbs sampling, or by stochastic gradient ascent, e.g. voted perceptron). • Given labeled training data in which partitions are given, learn an affinity measure for which partitioning will re-produce those partitions. • Interested in better algorithms for graph partitioning • Standard algorithms (e.g. Fiducia Mathesis) do not apply with negative edge weights. • The action is in the interplay between positive and negative edges. • Currently using modified version of “Correlational Clustering" [Bansal, Blum Chawala, 2002]---a very simple greedy algorithm.

Co-reference Experimental Results [McCallum & Wellner, 2003] Proper noun co-reference, among nouns having coreferents DARPA ACE broadcast news transcripts, 117 stories MUC-style F1 Single-link threshold 91.65 % Best prev match [Morton] 90.98 % MRFs 93.96 % Derror=28% DARPA MUC-6 newswire article corpus, 30 stories MUC-style F1 Single-link threshold 60.83% Best prev match [Morton] 88.83 % MRFs 91.59 % Derror=24%

Toward Unified Models of Information Extraction and Data Mining