Record Linkage in a Distributed Environment

Record Linkage in a Distributed Environment Literature Review

Contents • Record linkage • Runtime reduction techniques • Blocking • Canopies • Sorted Neighborhood • Shift to parallel computing • Research directions

Record Linkage Problem • Determining if pairs of records refer to the same entity • E.g. Distinguishing between data belonging to… Yipeng, the NUS student and Yipeng, the son of PM Lee

Record Linkage Applications • Dedup Two Lists • Dedup Single List O(M*N) O(N2)

Dealing with Large Data • Pairwise comparison increasing expensive • Blocking techniques • Reduce the search space Amanda Amanda David Daniel

Canopies

Sorted Neighborhood Comparison Window: 2w−1

Dealing with Large Data • Pairwise comparison increasing expensive • Blocking techniques • Reduce the search space • Limitations • Single node computation • Localized data source • Conflicting in function Amanda Amanda David Daniel

Shift to Parallel Computing • Multi node computation • Data source flexibility • Complementary to blocking methods • Frontrunners: • P-Febrl(P Christen 2003), • P-Swoosh (H Kawai 2006), • Parallel Linkage (H Kim 2007)

Parallel Record Linkage Contributions • Peter Christen • Parallelized Febrl with MPI • Linear Speedup but did not Scaleup well • HidekiKawai • Designed P-swoosh in a simulated environment • Match based parallelism • 2x speedup with use of domain knowledge

Parallel Record Linkage Contributions • Hung-sik Kim, Dongwon Lee • Explored parallel record linkage for different input cases in MATLAB • Consistent Speedup • Not validated with very large datasets

MapReduce and Hadoop • Handles system level concerns… • E.g. Data distribution, fault tolerance, dynamic load balancing, portability and scalability • Convenient model for scaling record linkage • Beterscaleupon pairwisecomparisions (T Elsayed 2008) • Runtime increased linearly with dataset (R Vernica 2010)

Research Directions • Tailoring Hadoop for record linkage problems • E.g. Bin packing blocks of different sizes • Experimenting with different problem types • E.g. Bipartite data centers • Adapting existing parallel clustering algorithms onto the MapReducemodel

Conclusions • Parallelism a right step in the right direction • Complementary to existing approaches • Consistent with the object orientation • But… • Parallel design and implementation is difficult • MapReduce is a viable solution

Record Linkage in a Distributed Environment

Record Linkage in a Distributed Environment

Presentation Transcript

Probabilistic Record Linkage: A Short Tutorial

MANAGING RISK IN A DISTRIBUTED ENVIRONMENT

NCHS Record Linkage Activities

Record Linkage Survey

Record Linkage: A Database Approach

Simulation in a Distributed Computing Environment

Efficient Record Linkage in Large Data Sets

Geant4 in a Distributed Computing Environment

Record Linkage in a Distributed Environment

Issues with record linkage

Record linkage results

Blindfolded Record Linkage

Record linkage in Birth cohort Biobanks

Record Linkage in Stata

Probabilistic Record Linkage in Genealogical Research

NCHS Record Linkage Program

(De-Identified) Record Linkage

Security in a Distributed Resource Environment

Security in a Distributed Resource Environment

ESSnet DI WP2: Record Linkage