Date : 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia -ling, Koh

Frequency-aware Similarity Measures Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: JiunJia, Chiou 1

Outline • Introduction • Composing similarity • Exploiting frequencies • Partitioning strategies • Experiment • Conclusion 2

Introduction • Propose a novel comparison method thatpartitions the data usingvalue frequencyinformation and then automaticallydeterminessimilarity measures for each individual partition. • Use by partitioning compared record pairs according tofrequencies of attribute values. Partition 1contains all pairs with rare names. Partition 2 allpairs with medium frequent names. Partition 3 all pairs with frequent names.

Introduction Motivation: • Schufa, a credit rating agency that stores data of about 66 million citizens, which are in turn reported by banks , insurance agencies, etc. queries about the rating of an individual must beresponded to as precisely aspossible. • To ensure the quality of the data, it is necessary to detect and fuse duplicates.

Introduction • Why Arnold Schwarzenegger is Always a Duplicate ? In a person table with U.S. citizens , this nameis a very rare name. If we find several Arnold Schwarzeneggersin it, it is very likely that these are duplicates. • they argue that address and date-of-birth similarity are less important than for rows with frequent names. person's name, birth date, address

Introduction • Determining the similarity (or distance) of two records in a database is a well-known, but challenging problem. • The problemcomprises two main difficulties: 1.typos outdated values sloppy data or query entries. 2. The amount of data might be very large, thus prohibiting exhaustive comparisons. devising sophisticated similarity measures Efficient algorithms and indexes that avoid comparing each entry with all other entries.

Composing Similarity • Base Similarity Measures Define: Simp(r1,r2) Simp: (R x R) → [0 ,1] ⊂ R each responsible for calculating the similarity of a specific attribute p of the compared records r1 and r2 from a set R of records. Ex: SimName : Jaro-Winkler distance SimBirthDate : relative distance SimAddress : Euclidean distance Also test for equality (e.g., for email addresses) or boolean values(e.g., for gender).

m: the number of matching characters. t: half the number of transpositions. Jaro-Winkler distance Jaro–Winkler distance dw : dj:the Jaro distance for strings s1 and s2 :the length of common prefix at the start of the string up to a maximum of 4 characters p : a constant scaling factor p should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is p = 0.1

Jaro-Winkler distance: s1:MARTHA s2 : MARHTA m = 6 , | s1 | = 6 , | s2 | = 6 t= =1 (H/T&T/H) dj=()=0.944 , standard weight p = 0.1 s1:MARTHA s2 : MARHTA =3 dw= 0.944 + (3 * 0.1(1 − 0.944)) = 0.961 --------------------------------------------------------------------------------------- s1:DWAYNE s2 : DUANE m = 4 , | s1 | = 6 , | s2 | = 5 t = 0 dj=()=0.822 , standard weight p = 0.1 s1:DWAYNE s2 : DUANE =1 dw = 0.822+ (1* 0.1(1 − 0.822)) = 0.84

Composing Similarity • Composition of Base Similarity Measures Integrate the base similarity measures into an overall judgementto calculate the overall similarity of two records. the classes are isSimilar and isDissimilar • The features are the results of the base similarity measures. • Toderive a general model: employ machine learning techniques and have enough training data for supervised learning methods. logistic regression, decision trees, SVM

logistic regression SVM(support vector machine) Decision Tree

Exploiting frequencies • Frequency Function Determine the value frequencies of the selected attributes for two compared records. Define a frequency function f : R x R → N (FirstName&LastName) Goal :partition the dataaccording to the name frequencies. • Several data quality problems: 1.swapping of first and last name 2.typos (e. g., Arnold , Arnnold) 3. combining two attributes (e. g., Schwarzenegger is more distinguishing than Arnold)

FirstNamefrequency Josh : 3 Kevin: 1 Jack: 5 ... ... ... … … … LastNamefrequency powell: 2 johnson : 0 wills: 5 powell: 1 johnson : 1 wills: 1 powell: 4 johnson : 3 wills: 0 LastNamefrequency Powell: 1 Johnson: 0 Wills: 5 ... ... ... … … … FirstNamefrequency Josh : 2 Kevin : 2 Jack: 2 Josh : 4 Kevin : 6 Jack: 5

Exploiting frequencies • Frequency-enriched Models exploit frequency distributions is to alter the models that we learned with the machine learning techniques 1. manually add rules to the models 2. integrate the frequencies directly into the machine learning models. Ex: logistic regression, "if the frequency of the name value is below10, then increase the weight of the name similarity by 10% and appropriately decrease the weights of the other similarity functions". Drawback : Manually defining such rules is cumbersome and error-prone where M is the maximum frequency in the data set.

Partitioning strategies • partition compared record pairs into n partitions using the determined frequencies. • Number of partition: Too large in small partitions: Overfitting 0 10 Too small in large partitions: discovering frequency-specific differences 0 100

Partitioningstrategies • Define partitions: • The entire frequency space is divided into non-overlapping, continuous partitions by a set of thresholds: Ɵ0= 0 and Ɵn= M + 1, where M is the maximum frequency in the data set. • Defined as frequency ranges Ii : • A partition covers a set of record pairs. A record pair(r1,r2) falls into a partition [Ɵi, Ɵi+1) iffthe frequency function value for this pair lies in the partition's range:

Partitioningstrategies • Random partitioning: randomly pick several thresholds Ɵi∈ {0,…….,M + 1} The number of thresholds in each partitioning is also randomly chosen. maximum of 20 partitions in one partitioning. • Equi-depth partitioning: divide the frequency space into e partitions. Each partition contains the same number of tuples from the original data set R. e ∈ {2,…….,20} 1partition 20 partition e:9

Partitioningstrategies • Greedy partitioning: define a list of threshold candidates C = {Ɵ0,……, Ɵn} by dividing the frequency space into segments with the same number of tuples (similar to equi-depth partitioning, but with fixed, large e = 50). Process: 1.learning a partition for the first candidate thresholds [Ɵ0,Ɵ1). 2.learn a second partition that extends the current partition by moving its upper threshold to the next threshold candidate: [Ɵ0, Ɵ2). 3. …………………… [Ɵ0, Ɵ3). …… • compare both partitions using F-measure.

Partitioningstrategies • Greedy partitioning: (continue) • If the extended partition achieves better performance, the process is repeated for the next threshold slot. • If not, the smaller partition is kept and a new partitioning is started at its upper threshold; another iteration starts with this new partition. • This process is repeated until all threshold candidates have been processed.

predict actual P=5/8=0.625 R=5/6=0.83 F==0.71 0 1 0 2 P=10/13=0.77 R=10/11=0.91 F==0.834 0 3 5≠20 2 3 5+15=20 P=10/15=0.67 R=10/14=0.71 F==0. 6894 2 4 5+15+0=20 2 5

Partitioning strategies • Genetic Partitioning Algorithm • Initialization: Create an initial population consisting of several random partitionings. These partitioningsare created as described above with the random partitioning approach. • Growth: Learn one composite similarity function for each partition in the current set of partitionings. • Selection: For each partition, determine the maximum F-measure that can be achieved by choosing an appropriate threshold for the similarity function. Select the partitionings with highest weighted F- measure, then select the top five partitionings.

Partitioning strategies • Reproduction: build pairs of the selected best individuals and combine them to create new individuals. a)Recombination: First create the union of the thresholds of both partitionings. For each threshold, randomly decide whether to keep it in the result partition or not. Both decisions have equal chances. b) Mutation: Randomly decide whether to add another new (also ran-domlypicked) threshold and whether to delete a (randomlypicked) threshold from the current threshold list. Define a minimum partition size (set this value to 20 record pairs ). Randomly created partitionings with too small partitions are discarded.

Ɵ0 Ɵ1 Ɵ2 Ɵ3 Top 5 → [ 0 , 1 ), [ 1 , 3 ), [ 3 , 4 ) → [ 0 , 2 ), [ 2 , 4 ), [ 4 , 5 )

Partitioning strategies • Termination: • The resulting partitions are evaluated and added to the set of evaluated partitions. • The selection/reproduction phases are repeated until a certain number of iterations is reached or until no significant improvement can be measured. • Require a minimum F-measure improvement of 0.001 after 5 iterations.

Experiment Evaluation on Schufa Data Set data set consists of two parts: a person data set and a query data set. built record pairs of the form (query, correct result) or (query, incorrect result),

Experiment Evaluation on DBLP Data Set(bibliographic database for computer sciences) (1) Two papers from the same author, (2) Two papers from the same author with different name aliases (3) Two papers from different authors with the same name, (4) Two papers from different authors with different names. For each paper pair, the matching task is to decide whether the two papers were written by the same author.

Conclusion • With this paper, introduced a novel approach for im-proving composite similarity measures. • Divide a data set consisting of record pairs into partitions according to frequencies of selected attributes. • Learn optimal similarity measures for each partition. • Experiments on differentreal-world data sets showed that partitioning the data can improve learning results and that genetic partitioning performs better than several other partitioning strategies.

Thank you for your listening !

Date : 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia -ling, Koh