170 likes | 280 Vues
De-duplication systems aim to learn a function where large data set D is processed using labeled training data set to choose representative pairs efficiently. ALIAS de-duplicator employs Active Learning by iteratively selecting pairs for labeling. The method measures uncertainty using committees of classifiers and aims to improve performance over random selection, requiring significantly less training data.
E N D
Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey
Active Learning for de-duplication • De-duplication systems try to learn a function: • Where D is the data set. • f is learned using a labeled training data set • Normally, D is large, so many sets Lp are possible. • Choosing a representative & useful Lp is hard. • Instead of a fixed set Lp, in Active Learningthe learner interactively chooses pairs from DD to be labeled and added to Lp.
The ALIAS de-duplicator • Input • Set Dp of pairs of data records represented as feature vectors (features might include edit distance, soundex, etc). • Initial set Lp of some elements of Dp labeled as duplicates or non-duplicates. • Set T = Lp Loop until user satisfaction: • Train classifier Cusing T. • Use Cto choose a set S of instances from Dp for labeling. • Get labels for S from user, and set T = T S.
Active Learning • How do we choose the set S of instances to label? • Idea: Choose most uncertain instances. • We’re given that +’s and –’s can be separated by some point, and assume that probability of – or + is linear between labeled examples rand b. • The point m • maximally uncertain, • also the point that reduces our “confusion region” the most. • So choose m!
Measuring Uncertainty with Committees • Train a committee of several slightly different versions of a classifier. • Uncertainty(x) entropycommittee(x) • Form committees by • Randomizing model parameters • Partitioning training data • Partitioning attributes
Representativeness of an Instance • We need informative instances, not just uncertain ones. • Solution: samplenof the knmost uncertain instances, weighted by uncertainty. • k = 1 no sampling • kn = all data full-sampling • Why not use information gain?
Evaluation – Different Classifiers • Decision Trees & Naïve Bayes: • Committees of 5 via parameter randomization • SVMs • Uncertainty = distance from separator • Start with one dup, one non-dup, add a new training example each round (n = 1), partial sampling (k= 5). • Similarity functions – 3-Grams match, % overlapping words, approx. edit distance, special handling of #s/nulls. • Data sets: • Bibliography: 32131 citation pairs from Citeseer, 0.5% duplicates. • Address: 44850 pairs, 0.25% duplicates.
Conclusions • Active Learning improves performance over random selection. • Uses two orders of magnitude less training data. • Note: not due just to change in +/- mix. • In these experiments, Decision Trees outperformed SVMs and Naïve Bayes.