Record Linkage and String Distance Metrics in Data Integration

Distance functions and IE - 3 William W. Cohen CALD

Announcements • No meeting this Wed March 24 • March 25 Thus – talk from Carlos Guestrin on max-margin Markov nets • Newell-Simon Hall 1507 at 9:30am • no wait! – make that Wean Hall 4625 • Writeups: • today: “distance metrics for text” – three papers

Record linkage: definition • Record linkage: determine if pairs of data records describe the same entity • I.e., find record pairs that are co-referent • Entities: usually people (or organizations or…) • Data records: names, addresses, job titles, birth dates, … • Main applications: • Joining two heterogeneous relations • Removing duplicates from a single relation • Storing results of information extraction in a database, or answering queries that involve information extracted from different places • Key step: measuring similarity of two strings • TFIDF metric (WHIRL) • Edit distance (Monge-Elkan)

The data integration problem

Levenshtein distance - example • distance(“William Cohen”, “Willliam Cohon”) s gap alignment t op cost

= D(s,t) Computing Levenshtein distance D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete D(i,j)= min

c o h e n d o r f m 0 0 0 0 0 0 0 0 0 c 1 0 0 0 0 0 0 0 0 c 0 0 0 0 0 0 0 0 0 o 0 2 1 0 0 0 2 1 0 h 0 1 4 3 2 1 1 1 0 n 0 0 3 3 5 4 3 2 1 s 0 0 2 2 4 4 3 2 1 k 0 0 1 1 3 3 3 2 1 i 0 0 0 0 2 2 2 2 1 dist=5 Smith-Waterman distance

D(i-1,j) - A IS(i-1,j) - B Best score in which si is aligned with a ‘gap’ IS(i,j) = max Best score in which tj is aligned with a ‘gap’ D(i,j-1) - A IT(i,j-1) - B IT(i,j) = max Affine gap distances - 3 D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete D(i,j) = max

Record linkage: definition • Record linkage: determine if pairs of data records describe the same entity • I.e., find record pairs that are co-referent • Entities: usually people (or organizations or…) • Data records: names, addresses, job titles, birth dates, … • Main applications: • Joining two heterogeneous relations • Removing duplicates from a single relation • Storing results of information extraction in a database, or answering queries that involve information extracted from different places • Key step: measuring similarity of two strings • TFIDF metric (WHIRL) • Edit distance (Monge-Elkan)

Explode p(X1,X2,X3): find all DB tuples <p,a1,a2,a3> for p and bind Xi to ai. Constrain X~Y: if X is bound to a and Y is unbound, find DB column C to which Y should be bound pick a term t in X, find proper inverted index for t in C, and bind Y to something in that index Keep track of t’s used previously, and don’t allow Y to contain one. Inference in WHIRL

String distance metrics so far... • Term-based (e.g. TF/IDF as in WHIRL) • Distance depends on set of words contained in both s and t – so sensitive to spelling errors. • Usually weight words to account for “importance” • Fast comparison: O(n log n) for |s|+|t|=n • Edit-distance metrics • Distance is shortest sequence of edit commands that transform s to t. • No notion of word importance • More expensive: O(n2) • Other metrics • Jaro metric & variants • Monge-Elkan’s recursive string matching • etc? • Which metrics work best, for which problems?

Jaro metric • Jaro metric is (apparently) tuned for personal names: • Given (s,t) define c to be common in s,t if it si=c, tj=c, and |i-j|<min(|s|,|t|)/2. • Define c,d to be a transposition if c,d are common and c,d appear in different orders in s and t. • Jaro(s,t) = average of #common/|s|, #common/|t|, and 0.5#transpositions/#common • Variant: weight errors early in string more heavily • Fast to compute

Jaro metric

Winkler-Jaro metric

String distance metrics so far... • Term-based (e.g. TF/IDF as in WHIRL) • Distance depends on set of words contained in both s and t – so sensitive to spelling errors. • Usually weight words to account for “importance” • Fast comparison: O(n log n) for |s|+|t|=n • Edit-distance metrics • Distance is shortest sequence of edit commands that transform s to t. • No notion of word importance • More expensive: O(n2) • Other metrics • Jaro metric & variants • Monge-Elkan’s recursive string matching • etc? • Which metrics work best, for which problems?

So which metric should you use? SecondString (Cohen, Ravikumar, Fienberg): • Java toolkit of string-matching methods from AI, Statistics, IR and DB communities • Tools for evaluating performance on test data • Exploratory tool for adding, testing, combining string distances • e.g. SecondString implements a generic “Winkler rescorer” which can rescale any distance function with range of [0,1] • URL – http://secondstring.sourceforge.net • Distribution also includes several sample matching problems.

SecondString distance functions • Edit-distance like: • Levenshtein – unit costs • untuned Smith-Waterman • Monge-Elkan (tuned Smith-Waterman) • Jaro and Jaro-Winkler • Less ad hoc Jaro variants • Term-based • TFIDF • Jaccard distance:

SecondString distance functions • Edit-distance like: • Levenshtein – unit costs • untuned Smith-Waterman • Monge-Elkan (tuned Smith-Waterman) • Jaro and Jaro-Winkler

Results - Edit Distances Monge-Elkan is the best on average....

Edit distances

SecondString distance functions • Term-based, for sets of terms S and T: • TFIDF distance • Jaccard distance: • Language models: construct PS and PT anduse

SecondString distance functions • Term-based, for sets of terms S and T: • TFIDF distance • Jaccard distance • Jensen-Shannon distance • smoothing toward union of S,T reduces cost of disagreeing on common terms • unsmoothed PS, Dirichlet smoothing, Jelenik-Mercer • “Simplified Fellegi-Sunter”

Results – Token Distances

SecondString distance functions • Hybrid term-based & edit-distance based: • Monge-Elkan’s “recursive matching scheme”, segmenting strings at token boundaries (rather than separators like commas) • SoftTFIDF • Like TFIDF but consider not just tokens in both S and T, but tokens in S “close to” something in T (“close to” relative to some distance metric) • Downweight close tokens slightly

Results – Hybrid distances

Results - Overall

Prospective test on two clustering tasks

An anomolous dataset

An anomalous dataset: census

An anomalous dataset: census Why?

Record Linkage and String Distance Metrics in Data Integration

Record Linkage and String Distance Metrics in Data Integration

Presentation Transcript

Time – Distance Functions

Distance functions and IE

1-3 Distance and Midpoints

CPN Distance/Similarity Functions

IE 302 Recitation 3

Distance Functions for Sequence Data and Time Series

Distance Functions for Polygons/Trajectories

Distance functions and IE -2

3 Dimension (Distance)

Chapter 3 Functions and Files

Chapter 3: Functions and Graphs 3.3: Quadratic Functions

Chapter 3: Functions and Graphs 3.1: Functions

Distance functions and IE – 4?

Distance functions and IE – 5

Distance functions and IE -2

Chapter 3: Functions and Graphs 3.1: Functions