Unified Models for Information Extraction and Data Mining

Toward Unified Models of Information Extraction and Data Mining Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst Joint work with Aron Culotta, Wei Li, Khashayar Rohanimanesh, Charles Sutton, Ben Wellner

Goal: Improving our abilityto mine actionable knowledgefrom unstructured text.

Larger Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Prediction Outlier detection Decision support

Problem: Combined in serial juxtaposition, IE and KD are unaware of each others’ weaknesses and opportunities. KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Emerging Patterns Prediction Outlier detection Decision support

Discriminatively-trained undirected graphical models Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into! Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Probabilistic Model Documentcollection Actionableknowledge Prediction Outlier detection Decision support

Outline a • The need for unified IE and DM. • Review of Conditional Random Fields for IE. • Preliminary steps toward unification: • Joint Co-reference Resolution (Graph Partitioning) • Joint Labeling of Cascaded Sequences (Belief Propagation) • Joint Segmentation and Co-ref (Iterated Conditional Samples.) • Conclusions

Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S S S transitions t - 1 t t+1 ... ... observations ... Generates: State sequence Observation sequence O O O t - t +1 t 1 o1 o2 o3 o4 o5 o6 o7 o8 Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Usually a multinomial over atomic, fixed alphabet

From HMMs to Conditional Random Fields [Lafferty, McCallum, Pereira 2001] St-1 St St+1 Joint ... ... Ot-1 Ot Ot+1 Conditional St-1 St St+1 ... Ot-1 Ot Ot+1 ... where (A super-special case of Conditional Random Fields.) Set parameters by maximum likelihood, using optimization method on dL.

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov Labels: CRF • Non-Table • Table Title • Table Header • Table Data Row • Table Section Data Row • Table Footnote • ... (12 in all) Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Features: • Percentage of digit chars • Percentage of alpha chars • Indented • Contains 5+ consecutive spaces • Whitespace in this line aligns with prev. • ... • Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct Table segments, F1 HMM 65 % 64 % Stateless MaxEnt 85 % - D error = 85% D error = 77% CRF w/out conjunctions 52 % 68 % CRF 95 % 92 %

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004] D error 40%

Main Point #2 Conditional Random Fields were more accurate in practice than a generative model ... on a research paper extraction task, ... and others, including - a table extraction task - noun phrase segmentation - named entity extraction - …

Outline a • The need for unified IE and DM. • Review of Conditional Random Fields for IE. • Preliminary steps toward unification: • Joint Labeling of Cascaded Sequences (Belief Propagation)Charles Sutton • Joint Co-reference Resolution (Graph Partitioning)Aron Culotta • Joint Labeling for Semi-Supervision (Graph Partitioning)Wei Li • Joint Segmentation and Co-ref (Iterated Conditional Samples.)Andrew McCallum a

1. Jointly labeling cascaded sequencesFactorial CRFs Named-entity tag Noun-phrase boundaries Part-of-speech English words Joint prediction of part-of-speech and noun-phrase in newswire, equivalent accuracy with only 50% of the training data. Inference: Tree reparameterization [Sutton, Khashayar, McCallum, ICML 2004] [Wainwright et al, 2002]

1b. Jointly labeling distant mentionsSkip-chain CRFs … Mr. Ted Green said today … … Mary saw Green at … 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterization [Sutton, McCallum, 2004] [Wainwright et al, 2002]

2. Joint co-reference among all pairsAffinity Matrix CRF . . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N -99 Y/N 25% reduction in error on co-reference of proper nouns in newswire. 11 . . . she . . . Inference: Correlational clustering graph partitioning [McCallum, Wellner, IJCAI WS 2003] [Bansal, Blum, Chawla, 2002]

3. Joint Labeling for Semi-SupervisionAffinity Matrix CRF with prototypes y1 y2 x1 45 Y/N x3 Y/N -99 Y/N 50% reduction in error ondocument classificationwith labeled and unlabeleddata. 11 x2 Inference: Correlational clustering graph partitioning [Li, McCallum, 2003] [Bansal, Blum, Chawla, 2002]

4. Joint segmentation and co-reference Extraction from and matching of research paper citations. o s World Knowledge Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. c Co-reference decisions y y p Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. Databasefield values c y c Citation attributes s s Segmentation o o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Variant of Iterated Conditional Modes [Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003] [Besag, 1986]

To Charles

Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , B. Laurel (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Laurel , The Art of Human-Computer Interface Design , 355-366 , 1990 .

Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , B. Laurel (ed) , Addison-Wesley , 1990 . Citation Segmentation and Coreference Brenda Laurel . Interface Agents: Metaphors with Character , in Laurel , The Art of Human-Computer Interface Design , 355-366 , 1990 . • Segment citation fields

Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , B. Laurel (ed) , Addison-Wesley , 1990 . Y/N Brenda Laurel . Interface Agents: Metaphors with Character , in Laurel , The Art of Human-Computer Interface Design , 355-366 , 1990 . • Segment citation fields • Resolve coreferent papers

Incorrect Segmentation Hurts Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , B. Laurel (ed) , Addison-Wesley , 1990 . ? Brenda Laurel . Interface Agents: Metaphors with Character , in Laurel , The Art of Human-Computer Interface Design , 355-366 , 1990 .

Incorrect Segmentation Hurts Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , B. Laurel (ed) , Addison-Wesley , 1990 . ? Brenda Laurel . Interface Agents: Metaphors with Character , in Laurel , The Art of Human-Computer Interface Design , 355-366 , 1990 . Solution: Perform segmentation and coreference jointly. Use segmentation uncertainty to improve coreference and use coreference to improve segmentation.

Segmentation + Coreference Model s CRF Segmentation Observed citation o

Segmentation + Coreference Model c Citation attributes s CRF Segmentation Observed citation o

Segmentation + Coreference Model o s c c c Citation attributes s s CRF Segmentation Observed citation o o

Segmentation + Coreference Model o s c y y pairwise coref c c Citation attributes y s s CRF Segmentation Observed citation o o

Such a highly connected graph makes exact inference intractable, so…

Approximate Inference 1 m1(v2) m2(v3) v1 v2 v3 m2(v1) m3(v2) messages passed between nodes • Loopy Belief Propagation v4 v5 v6

v1 v2 v3 v4 v5 v6 v7 v9 v8 Approximate Inference 1 m1(v2) m2(v3) v1 v2 v3 m2(v1) m3(v2) messages passed between nodes • Loopy Belief Propagation • Generalized Belief Propagation v4 v5 v6 messages passed between regions Here, a message is a conditional probability table passed among nodes.But when message size grows exponentially with region size!

= held constant Approximate Inference 2 v2 v1 v3 • Iterated Conditional Modes (ICM) [Besag 1986] v6i+1 =argmax P(v6i| v \ v6i) v4 v5 v6 v6i

= held constant Approximate Inference 2 v2 v1 v3 • Iterated Conditional Modes (ICM) [Besag 1986] v5j+1 =argmax P(v5j| v \ v5j) v4 v5 v6 v5j

= held constant Approximate Inference 2 v2 v1 v3 • Iterated Conditional Modes (ICM) [Besag 1986] v4k+1 =argmax P(v4k| v \ v4k) v4 v5 v6 v4k But greedy, and easily falls into local minima.

= held constant Approximate Inference 2 v2 v1 v3 • Iterated Conditional Modes (ICM) [Besag 1986] • Iterated Conditional Sampling (ICS) (our proposal; related work?) Instead of passing only argmax, sample of argmaxes of P(v4k| v\ v4k) i.e. an N-best list (the top N values) v4k+1 =argmax P(v4k| v \ v4k) v4 v5 v6 v4k v2 v1 v3 Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once. Here, a “message” grows onlylinearlywith region size and N! v4 v5 v6

Sample = N-best List from CRF Segmentation Do exact inference over these linear-chain regions o s c Pass N-best List to coreference p y y prototype p pairwise vars c c y s s o o

Sample = N-best List from Viterbi Parameterized by N-Best lists c c y pairwise vars s s o o

Sample = N-best List from Viterbi When calculating similarity with another citation, have more opportunity to find correct, matching fields. c c y s s o o

Results on 4 Sections of CiteSeer Citations Coreference F1 performance • Average error reduction is 35%. • “Optimal” makes best use of N-best list by using true labels. • Indicates that even more improvement can be obtained

Conclusions • Conditional Random Fields combine the benefits of • Conditional probability models (arbitrary features) • Markov models (for sequences or other relations) • Success in • Factorial finite state models • Coreference analysis • Semi-supervised Learning • Segmentation uncertainty aiding coreference • Future work: • Structure learning. • Further tight integration of IE and Data Mining • Application to Social Network Analysis.

End of Talk

Application Project:

Application Project: Cites Research Paper

Application Project: Expertise Cites Grant Research Paper Person Conf- erence University Groups

Software Infrastructure MALLET: Machine Learning for Language Toolkit • ~60k lines of Java • Document classification, information extraction, clustering, co-reference, POS tagging, shallow parsing, relational classification, … • Many ML basics in common, convenient framework: • naïve Bayes, MaxEnt, Boosting, SVMs, Dirichlets, Conjugate Gradient • Advanced ML algorithms: • Conditional Random Fields, Maximum Margin Markov Networks, BFGS, Expectation Propogatation, Tree-Reparameterization, … • Unlike other toolkits (e.g. Weka) MALLET scales to millions of features, 100k’s training examples, as needed for NLP. Released as Open Source Software. http://mallet.cs.umass.edu In use at UMass, MIT, CMU, UPenn,

End of Talk

Unified Models for Information Extraction and Data Mining