Data Matching

Data Matching Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection By Peter Christen Presented by Joseph Park

Introduction • “Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases” • Also known as: • Record or data linkage • Entity resolution • Object identification • Field matching

Aims & Challenges • Three tasks: • Schema matching • Data matching • Data fusion • Challenges: • Lack of unique entity identifier and data quality • Computation complexity • Lack of training data (e.g. gold standards) • Privacy and confidentiality (health informatics & data mining)

Overview of Data Matching • Five major steps: • Data pre-processing • Indexing • Record pair comparison • Classification • Evaluation

Diagram

Data Pre-processing Remove unwanted characters and words Expand abbreviations and correct misspellings Segment attributes into well-defined and consistent output attributes Verify the correctness of attribute values

Example of Data Pre-processing

Indexing Reduces computational complexity Generates candidate record pairs Common technique—Blocking

Example of Blocking

Record Pair Comparison Comparison vector – vector of numerical similarity values

Example of Record Pair Comparison

Jaroand Winkler String Comparison • Jaro: • Combines edit distance and q-gram based comparison • Winkler: • Increases Jaro similarity for up to four agreeing initial chars

Record Pair Classification • Two-class or three-class classification: • Match or non-match • Match or non-match or potential match (requires clerical review) • Supervised and unsupervised • Active learning

Example of Record Pair Classification

Unsupervised Classification Threshold-based classification Probabilistic classification Cost-based classification Rule-based classification Clustering-based classification

Probabilistic Classification • Three-class based • Different weights assigned to different attributes • Newcombe & Kennedy – cardinalities • Comparison vectors, binary comparison • Conditionally independent attributes assumed

Formulae

Example of Probabilistic Classification

Active Learning Trains a model with small set of seed data Classifies comparison vectors not in training set as matches or non-matches Asks users for help on the most difficult to classify Adds manually classified to training data set Trains the next, improved, classification model Repeats until stopping criteria met

Data Matching

Data Matching

Presentation Transcript

Matching

Pattern Matching with Acceleration Data

On Map-Matching Vehicle Tracking Data

Name matching for PATSTAT data

Matching Data for EHDI Tracking Program

Matching

Automating Schema Matching for Data Integration

Matching

Matching

Ontology-based Data Matching and Applications

Administrative Data Matching

Welcome Data Cleansing and Matching Workshop

Data matching Service – B2B Leo

B2B Data Matching Services | Leo Data Services

B2B Data Matching Services - B2B Data Services

Data Matching Service | Data Matching services | B2B Marketing Archives

Data Matching

Data Matching

Data Matching

Matching

Matching Data for EHDI Tracking Program

Data Matching Services - B2B Data Services