370 likes | 485 Vues
Learn about the concept of record linkage to determine if data records describe the same entity, applications of measuring similarity between strings, and various string distance metrics for data integration, including TFIDF and edit distance. Explore examples and tools like SecondString for evaluating and combining string distances.
E N D
Distance functions and IE - 3 William W. Cohen CALD
Announcements • No meeting this Wed March 24 • March 25 Thus – talk from Carlos Guestrin on max-margin Markov nets • Newell-Simon Hall 1507 at 9:30am • no wait! – make that Wean Hall 4625 • Writeups: • today: “distance metrics for text” – three papers
Record linkage: definition • Record linkage: determine if pairs of data records describe the same entity • I.e., find record pairs that are co-referent • Entities: usually people (or organizations or…) • Data records: names, addresses, job titles, birth dates, … • Main applications: • Joining two heterogeneous relations • Removing duplicates from a single relation • Storing results of information extraction in a database, or answering queries that involve information extracted from different places • Key step: measuring similarity of two strings • TFIDF metric (WHIRL) • Edit distance (Monge-Elkan)
Levenshtein distance - example • distance(“William Cohen”, “Willliam Cohon”) s gap alignment t op cost
= D(s,t) Computing Levenshtein distance D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete D(i,j)= min
c o h e n d o r f m 0 0 0 0 0 0 0 0 0 c 1 0 0 0 0 0 0 0 0 c 0 0 0 0 0 0 0 0 0 o 0 2 1 0 0 0 2 1 0 h 0 1 4 3 2 1 1 1 0 n 0 0 3 3 5 4 3 2 1 s 0 0 2 2 4 4 3 2 1 k 0 0 1 1 3 3 3 2 1 i 0 0 0 0 2 2 2 2 1 dist=5 Smith-Waterman distance
D(i-1,j) - A IS(i-1,j) - B Best score in which si is aligned with a ‘gap’ IS(i,j) = max Best score in which tj is aligned with a ‘gap’ D(i,j-1) - A IT(i,j-1) - B IT(i,j) = max Affine gap distances - 3 D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete D(i,j) = max
Record linkage: definition • Record linkage: determine if pairs of data records describe the same entity • I.e., find record pairs that are co-referent • Entities: usually people (or organizations or…) • Data records: names, addresses, job titles, birth dates, … • Main applications: • Joining two heterogeneous relations • Removing duplicates from a single relation • Storing results of information extraction in a database, or answering queries that involve information extracted from different places • Key step: measuring similarity of two strings • TFIDF metric (WHIRL) • Edit distance (Monge-Elkan)
Explode p(X1,X2,X3): find all DB tuples <p,a1,a2,a3> for p and bind Xi to ai. Constrain X~Y: if X is bound to a and Y is unbound, find DB column C to which Y should be bound pick a term t in X, find proper inverted index for t in C, and bind Y to something in that index Keep track of t’s used previously, and don’t allow Y to contain one. Inference in WHIRL
String distance metrics so far... • Term-based (e.g. TF/IDF as in WHIRL) • Distance depends on set of words contained in both s and t – so sensitive to spelling errors. • Usually weight words to account for “importance” • Fast comparison: O(n log n) for |s|+|t|=n • Edit-distance metrics • Distance is shortest sequence of edit commands that transform s to t. • No notion of word importance • More expensive: O(n2) • Other metrics • Jaro metric & variants • Monge-Elkan’s recursive string matching • etc? • Which metrics work best, for which problems?
Jaro metric • Jaro metric is (apparently) tuned for personal names: • Given (s,t) define c to be common in s,t if it si=c, tj=c, and |i-j|<min(|s|,|t|)/2. • Define c,d to be a transposition if c,d are common and c,d appear in different orders in s and t. • Jaro(s,t) = average of #common/|s|, #common/|t|, and 0.5#transpositions/#common • Variant: weight errors early in string more heavily • Fast to compute
String distance metrics so far... • Term-based (e.g. TF/IDF as in WHIRL) • Distance depends on set of words contained in both s and t – so sensitive to spelling errors. • Usually weight words to account for “importance” • Fast comparison: O(n log n) for |s|+|t|=n • Edit-distance metrics • Distance is shortest sequence of edit commands that transform s to t. • No notion of word importance • More expensive: O(n2) • Other metrics • Jaro metric & variants • Monge-Elkan’s recursive string matching • etc? • Which metrics work best, for which problems?
So which metric should you use? SecondString (Cohen, Ravikumar, Fienberg): • Java toolkit of string-matching methods from AI, Statistics, IR and DB communities • Tools for evaluating performance on test data • Exploratory tool for adding, testing, combining string distances • e.g. SecondString implements a generic “Winkler rescorer” which can rescale any distance function with range of [0,1] • URL – http://secondstring.sourceforge.net • Distribution also includes several sample matching problems.
SecondString distance functions • Edit-distance like: • Levenshtein – unit costs • untuned Smith-Waterman • Monge-Elkan (tuned Smith-Waterman) • Jaro and Jaro-Winkler • Less ad hoc Jaro variants • Term-based • TFIDF • Jaccard distance:
SecondString distance functions • Edit-distance like: • Levenshtein – unit costs • untuned Smith-Waterman • Monge-Elkan (tuned Smith-Waterman) • Jaro and Jaro-Winkler
Results - Edit Distances Monge-Elkan is the best on average....
SecondString distance functions • Term-based, for sets of terms S and T: • TFIDF distance • Jaccard distance: • Language models: construct PS and PT anduse
SecondString distance functions • Term-based, for sets of terms S and T: • TFIDF distance • Jaccard distance • Jensen-Shannon distance • smoothing toward union of S,T reduces cost of disagreeing on common terms • unsmoothed PS, Dirichlet smoothing, Jelenik-Mercer • “Simplified Fellegi-Sunter”
SecondString distance functions • Hybrid term-based & edit-distance based: • Monge-Elkan’s “recursive matching scheme”, segmenting strings at token boundaries (rather than separators like commas) • SoftTFIDF • Like TFIDF but consider not just tokens in both S and T, but tokens in S “close to” something in T (“close to” relative to some distance metric) • Downweight close tokens slightly