310 likes | 563 Vues
Data Matching. Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection By Peter Christen Presented by Joseph Park. Introduction.
E N D
Data Matching Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection By Peter Christen Presented by Joseph Park
Introduction • “Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases” • Also known as: • Record or data linkage • Entity resolution • Object identification • Field matching
Aims & Challenges • Three tasks: • Schema matching • Data matching • Data fusion • Challenges: • Lack of unique entity identifier and data quality • Computation complexity • Lack of training data (e.g. gold standards) • Privacy and confidentiality (health informatics & data mining)
Overview of Data Matching • Five major steps: • Data pre-processing • Indexing • Record pair comparison • Classification • Evaluation
Data Pre-processing Remove unwanted characters and words Expand abbreviations and correct misspellings Segment attributes into well-defined and consistent output attributes Verify the correctness of attribute values
Indexing Reduces computational complexity Generates candidate record pairs Common technique—Blocking
Record Pair Comparison Comparison vector – vector of numerical similarity values
Jaroand Winkler String Comparison • Jaro: • Combines edit distance and q-gram based comparison • Winkler: • Increases Jaro similarity for up to four agreeing initial chars
Record Pair Classification • Two-class or three-class classification: • Match or non-match • Match or non-match or potential match (requires clerical review) • Supervised and unsupervised • Active learning
Unsupervised Classification Threshold-based classification Probabilistic classification Cost-based classification Rule-based classification Clustering-based classification
Probabilistic Classification • Three-class based • Different weights assigned to different attributes • Newcombe & Kennedy – cardinalities • Comparison vectors, binary comparison • Conditionally independent attributes assumed
Active Learning Trains a model with small set of seed data Classifies comparison vectors not in training set as matches or non-matches Asks users for help on the most difficult to classify Adds manually classified to training data set Trains the next, improved, classification model Repeats until stopping criteria met