Motivation

Interactive Information Extractionand Social Network AnalysisAndrew McCallumInformation Extraction and Synthesis LaboratoryUMass Amherst

Motivation • Capture confidence of records in extracted database • Alerts data mining to possible errors in database

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004] Finite State Lattice output sequence y y y y y t+2 t+3 t - 1 t t+1 ORG OTHER Lattice ofFSM states . . . PERSON TITLE observations x x x x x t t +2 +3 t - t +1 t 1 input sequence said Arden Bement NSF Director …

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004] Constrained Forward-Backward output sequence y y y y y t+2 t+3 t - 1 t t+1 ORG OTHER Lattice ofFSM states . . . PERSON TITLE observations x x x x x t t +2 +3 t - t +1 t 1 input sequence said Arden Bement NSF Director …

Forward-Backward Confidence Estimationimproves accuracy/coverage ourforward-backwardconfidence optimal traditionaltoken-wiseconfidence no use ofconfidence

Application of Confidence Estimation • Interactive Information Extraction: • To correct predictions, direct user to least confident field

Interactive Information Extraction • IE algorithm calculatesconfidence scores • UI uses confidence scores toalert user to possible errors • IE algorithm takes corrections into account andpropagates correctionto other fields

User Correction • User Corrects a field, e.g. dragging Stanley to the First Name field x1 x2 x3 x4 x5 First Name Last Name Address Line Charles Street Charles Stanley 100 y1 y2 y3 y4 y5

Remove Paths • User Corrects a field, e.g. dragging Stanley to the First Name field x1 x2 x3 x4 x5 First Name Last Name Address Line Charles Street Charles Stanley 100 y1 y2 y3 y4 y5

Constrained Viterbi • Viterbi algorithm is constrained to pass through the designated state. x1 x2 x3 x4 x5 First Name Last Name Address Line Charles Street Charles Stanley 100 y1 y2 y3 y4 y5 Adjacent field changed: Correction Propagation

Constrained Viterbi • After fixing least confident field,constrained Viterbi automatically reduces error by another 23%. • Recent work reduces annotation effort further • simplifies annotation to multiple-choice A) B)

Labeling for Extraction Seminar: How to Organize your Life by Jane Smith, Stevenson & Smith Mezzanine Level, Papadapoulos Sq 3:30 pm Thursday March 31 In this seminar we will learn how to use CALO to... Click, drag, adjust, label, Click, drag, adjust, label, ... Painful: Difficult even for paid labelers Complex tools User feedback “in the wild”as labeling Labeling for Classification Seminar: How to Organize your Life by Jane Smith, Stevenson & Smith Mezzanine Level, Papadapoulos Sq 3:30 pm Thursday March 31 In this seminar we will learn how to use CALO to... Seminar announcement Todo request Other Easy: Often found in user interfaces e.g. CALO IRIS, Apple Mail

Interface presents top hypothesized segmentations Jane Smith , Stevenson& Smith Mezzanine Level ,PapadopoulosSq. JaneSmith , Stevenson& Smith Mezzanine Level , PapadopoulosSq. Jane Smith , Stevenson& SmithMezzanine Level , Papadopoulos Sq. user corrects labels, not segmentations Multiple-choice Annotation forLearning Extractors “in the wild” [Culotta, McCallum 2005] Task: Information Extraction.Fields:NAME COMPANY ADDRESS (and others) Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Multiple-choice Annotation forLearning Extractors “in the wild” [Culotta, McCallum 2005] Task: Information extraction.Fields:NAME COMPANY ADDRESS (and others) Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq. Interface presents top hypothesized segmentations Jane Smith , Stevenson& Smith Mezzanine Level ,PapadopoulosSq. JaneSmith , Stevenson& Smith Mezzanine Level , PapadopoulosSq. Jane Smith , Stevenson& SmithMezzanine Level , Papadopoulos Sq. user corrects labels, not segmentations

Multiple-choice Annotation forLearning Extractors “in the wild” [Culotta, McCallum 2005] Task: Information extraction.Fields:NAME COMPANY ADDRESS (and others) Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq. Interface presents top hypothesized segmentations Jane Smith , Stevenson & SmithMezzanine Level ,PapadopoulosSq. JaneSmith , Stevenson & SmithMezzanine Level , Papadopoulos Sq. JaneSmith ,Stevenson & SmithMezzanine Level , Papadopoulos Sq. 29% percent reduction in user actions needed to train

Piecewise Training in Factorial CRFsfor Transfer Learning [Sutton, McCallum, 2005] Emailed seminar ann’mt entities Email English words GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. 60k words training. Too little labeled training data.

Piecewise Training in Factorial CRFsfor Transfer Learning [Sutton, McCallum, 2005] Train on “related” task with more data. Newswire named entities Newswire English words 200k words training. CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

Piecewise Training in Factorial CRFsfor Transfer Learning [Sutton, McCallum, 2005] At test time, label email with newswire NEs... Newswire named entities Email English words

Piecewise Training in Factorial CRFsfor Transfer Learning [Sutton, McCallum, 2005] …then use these labels as features for final task Emailed seminar ann’mt entities Newswire named entities Email English words

Piecewise Training in Factorial CRFsfor Transfer Learning [Sutton, McCallum, 2005] Use joint inference at test time. Seminar Announcement entities Newswire named entities English words An alternative to hierarchical Bayes. Needn’t know anything about parameterization of subtask. Accuracy No transfer < Cascaded Transfer < Joint Inference Transfer

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.

String Edit Distance • Distance between sequences x and y: • “cost” of lowest-cost sequence of edit operations that transform string x into y.

String Edit Distance • Distance between sequences x and y: • “cost” of lowest-cost sequence of edit operations that transform string x into y. • Applications • Database Record Deduplication Apex International Hotel Grassmarket Street Apex Internat’l Grasmarket Street Records are duplicates of the same hotel?

String Edit Distance • Distance between sequences x and y: • “cost” of lowest-cost sequence of edit operations that transform string x into y. • Applications • Database Record Deduplication • Biological Sequences AGCTCTTACGATAGAGGACTCCAGA AGGTCTTACCAAAGAGGACTTCAGA

String Edit Distance • Distance between sequences x and y: • “cost” of lowest-cost sequence of edit operations that transform string x into y. • Applications • Database Record Deduplication • Biological Sequences • Machine Translation Il a achete une pomme He bought an apple

String Edit Distance • Distance between sequences x and y: • “cost” of lowest-cost sequence of edit operations that transform string x into y. • Applications • Database Record Deduplication • Biological Sequences • Machine Translation • Textual Entailment He bought a new car last night He purchased a brand new automobile yesterday evening

Align two strings William W. Cohon Willleam Cohen x1 = x2 = W i l l i a m _ W . _ C o h o n W i l l l e a m _ C o h e n Lowest cost alignment copy copy copy copy copy copy copy copy copy copy copy subst subst insert delete delete delete 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 operation cost Total cost = 6 = Levenshtein Distance Levenshtein Distance [1966] Edit operations copy Copy a character from x to y (cost 0) insert Insert a character into y (cost 1) delete Delete a character from y (cost 1) subst Substitute one character for another (cost 1)

Levenshtein Distance Edit operations copy Copy a character from x to y (cost 0) insert Insert a character into y (cost 1) delete Delete a character from y (cost 1) subst Substitute one character for another (cost 1) Dynamic program W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2 3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4 5 l 4 3 2 1 01 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4 3 2 2 2 2 4 m 7 6 5 4 3 3 3 32 D(i,j) = score of best alignment from x1... xi to y1... yj. insert D(i-1,j-1) + (xi≠yj ) D(i,j) = min D(i-1,j) + 1 D(i,j-1) + 1 subst total cost = distance

1 1 Learn these costs from training data 2 2 repeated delete is cheaper Levenshtein Distancewith Markov Dependencies Edit operations Cost after a c i d s copy Copy a character from x to y 0 0 0 0 insert Insert a character into y 1 1 1 delete Delete a character from y 1 1 1 subst Substitute one character for another 1 1 1 1 W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2 3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4 5 l 4 3 2 1 01 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4 3 2 2 2 2 4 m 7 6 5 4 3 3 3 32 subst copy delete insert 3D DP table

string 1 alignment string 2 x1 W i l l i a m _ W . _ C o h o n W i l l l e a m _ C o h e n a.i1 a.e a.i2 1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16 copy copy copy copy copy copy copy copy copy copy copy subst subst insert delete delete delete 1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14 x2 complete data likelihood Match score = incomplete data likelihood (sum over all alignments consistent with x1 and x2) Given training set of matching string pairs, objective fn is Ristad & Yianilos (1997) Essentially a Pair-HMM, generating a edit/state/alignment-sequence and two strings Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely.

Ristad & Yianilos Regrets • Limited features of input strings • Examine only single character pair at a time • Difficult to use upcoming string context, lexicons, ... • Example: “Senator John Green” “John Green” • Limited edit operations • Difficult to generate arbitrary jumps in both strings • Example: “UMass” “University of Massachusetts”. • Trained only on positive match data • Doesn’t include information-rich “near misses” • Example: “ACM SIGIR” ≠ “ACM SIGCHI” So, consider model trained by conditional probability

Conditional Probability (Sequence) Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(y|x) instead of P(y,x): • Can examine features, but not responsible for generating them. • Don’t have to explicitly model their dependencies.

Linear-chain ^ Conditional yt-1 yt yt+1 ... xt-1 xt xt+1 ... where (A super-special case of Conditional Random Fields.) Set parameters by maximum likelihood, using optimization method on L. Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] From HMMs to Conditional Random Fields [Lafferty, McCallum, Pereira 2001] yt-1 yt yt+1 Joint ... ... xt-1 xt xt+1

Want to train from set of string pairs, each labeled one of {match, non-match} match “William W. Cohon” “Willlleam Cohen” non-match “Bruce D’Ambrosio” “Bruce Croft” match “Tommi Jaakkola” “Tommi Jakola” match “Stuart Russell” “Stuart Russel” non-match “Tom Deitterich” “Tom Dean” CRF String Edit Distance string 1 alignment string 2 x1 W i l l i a m _ W . _ C o h o n W i l l l e a m _ C o h e n a.i1 a.e a.i2 1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16 copy copy copy copy copy copy copy copy copy copy copy subst subst insert delete delete delete 1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14 x2 joint complete data likelihood conditional complete data likelihood

CRF String Edit Distance FSM subst copy delete insert

conditional incomplete data likelihood CRF String Edit Distance FSM subst copy match m = 1 delete insert Start subst copy non-match m = 0 delete insert

CRF String Edit Distance FSM x1 = “Tommi Jaakkola” x2 = “Tommi Jakola” subst copy Probability summed over all alignments in match states 0.8 match m = 1 delete insert Start subst copy Probability summed over all alignments in non-match states 0.2 non-match m = 0 delete insert

CRF String Edit Distance FSM x1 = “Tom Dietterich” x2 = “Tom Dean” subst copy Probability summed over all alignments in match states 0.1 match m = 1 delete insert Start subst copy Probability summed over all alignments in non-match states 0.9 non-match m = 0 delete insert

The complete log likelihood Parameter Estimation Given training set of string pairs and match/non-match labels, objective fn is the incomplete log likelihood Expectation Maximization • E-step: Estimate distribution over alignments, , using current parameters • M-step: Change parameters to maximize the complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS) This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.

Efficient Training • Dynamic programming table is 3D;|x1| = |x2| = 100, |S| = 12, .... 120,000 entries • Use beam search during E-step[Pal, Sutton, McCallum 2005] • Unlike completely observed CRFs, objective function is not convex. • Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.

What Alignments are Learned? x1 = “Tommi Jaakkola” x2 = “Tommi Jakola” T o m m i J a a k k o l a T o m m i J a k o l a subst copy match m = 1 delete insert Start subst copy non-match m = 0 delete insert

What Alignments are Learned? x1 = “Bruce Croft” x2 = “Tom Dean” subst copy match m = 1 delete insert Start B r u c e C r o f t T o m D e a n subst copy non-match m = 0 delete insert

What Alignments are Learned? x1 = “Jaime Carbonell” x2 = “Jamie Callan” subst copy match m = 1 delete insert Start J a i m e C a r b o n e l l J a m i e C a l l a n subst copy non-match m = 0 delete insert

Summary of Advantages • Arbitrary features of the input strings • Examine past, future context • Use lexicons, WordNet • Extremely flexible edit operations • Single operation may make arbitrary jumps in both strings, of size determined by input features • Discriminative Training • Maximize ability to predict match vs non-match

Experimental Results:Data Sets • Restaurant name, Restaurant address • 864 records, 112 matches • E.g. “Abe’s Bar & Grill, E. Main St” “Abe’s Grill, East Main Street” • People names, UIS DB generator • synthetic noise • E.g. “John Smith” vs “Snith, John” • CiteSeer Citations • In four sections: Reason, Face, Reinforce, Constraint • E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...” “Russell & Norvig, “Artificial Intelligence: An Intro...”

Experimental Results:Features • same, different • same-alphabetic, different alphbetic • same-numeric, different-numeric • punctuation1, punctuation2 • alphabet-mismatch, numeric-mismatch • end-of-1, end-of-2 • same-next-character, different-next-character

Experimental Results:Edit Operations • insert, delete, substitute/copy • swap-two-characters • skip-word-if-in-lexicon • skip-parenthesized-words • skip-any-word • substitute-word-pairs-in-translation-lexicon • skip-word-if-present-in-other-string

Experimental Results [Bilenko & Mooney 2003] F1 (average of precision and recall) Restaurant address 0.686 0.712 0.380 0.532 Restaurant name 0.290 0.354 0.365 0.433 CiteSeer Reason Face Reinf Constraint 0.927 0.952 0.893 0.924 0.938 0.9660.9070.941 0.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913 Distance metric Levenshtein Learned Leven. Vector Learned Vector

Experimental Results [Bilenko & Mooney 2003] F1 (average of precision and recall) Distance metric Levenshtein Learned Leven. Vector Learned Vector CRF Edit Distance Restaurant name 0.290 0.354 0.365 0.433 0.448 Restaurant address 0.686 0.712 0.380 0.532 0.783 CiteSeer Reason Face Reinf Constraint 0.927 0.952 0.893 0.924 0.938 0.9660.9070.941 0.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913 0.964 0.918 0.9170.976

Experimental Results Data set: person names, with word-order noise added Without skip-if-present-in-other-string With skip-if-present-in-other-string F1 0.856 0.981

Motivation

Motivation

Presentation Transcript

Motivation

MOTIVATION

Motivation

Motivation

MOTIVATION

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation:

MOTIVATION

Motivation

MOTIVATION

MOTIVATION

Motivation

Motivation

Motivation

Motivation

Motivation:

Motivation

Motivation