140 likes | 266 Vues
This literature review delves into record linkage methods in distributed computing contexts, highlighting critical concepts such as blocking, canopies, sorted neighborhood, and the transition to parallel computing. It discusses the challenges of pairwise comparisons in large datasets and various runtime reduction techniques, while assessing notable contributions in parallel record linkage frameworks like P-Febrl and P-Swoosh. The explored research directions suggest potential adaptations of Hadoop for record linkage problems, advocating for the seamless integration of parallelism to enhance efficiency and scalability.
E N D
Record Linkage in a Distributed Environment Literature Review
Contents • Record linkage • Runtime reduction techniques • Blocking • Canopies • Sorted Neighborhood • Shift to parallel computing • Research directions
Record Linkage Problem • Determining if pairs of records refer to the same entity • E.g. Distinguishing between data belonging to… Yipeng, the NUS student and Yipeng, the son of PM Lee
Record Linkage Applications • Dedup Two Lists • Dedup Single List O(M*N) O(N2)
Dealing with Large Data • Pairwise comparison increasing expensive • Blocking techniques • Reduce the search space Amanda Amanda David Daniel
Sorted Neighborhood Comparison Window: 2w−1
Dealing with Large Data • Pairwise comparison increasing expensive • Blocking techniques • Reduce the search space • Limitations • Single node computation • Localized data source • Conflicting in function Amanda Amanda David Daniel
Shift to Parallel Computing • Multi node computation • Data source flexibility • Complementary to blocking methods • Frontrunners: • P-Febrl(P Christen 2003), • P-Swoosh (H Kawai 2006), • Parallel Linkage (H Kim 2007)
Parallel Record Linkage Contributions • Peter Christen • Parallelized Febrl with MPI • Linear Speedup but did not Scaleup well • HidekiKawai • Designed P-swoosh in a simulated environment • Match based parallelism • 2x speedup with use of domain knowledge
Parallel Record Linkage Contributions • Hung-sik Kim, Dongwon Lee • Explored parallel record linkage for different input cases in MATLAB • Consistent Speedup • Not validated with very large datasets
MapReduce and Hadoop • Handles system level concerns… • E.g. Data distribution, fault tolerance, dynamic load balancing, portability and scalability • Convenient model for scaling record linkage • Beterscaleupon pairwisecomparisions (T Elsayed 2008) • Runtime increased linearly with dataset (R Vernica 2010)
Research Directions • Tailoring Hadoop for record linkage problems • E.g. Bin packing blocks of different sizes • Experimenting with different problem types • E.g. Bipartite data centers • Adapting existing parallel clustering algorithms onto the MapReducemodel
Conclusions • Parallelism a right step in the right direction • Complementary to existing approaches • Consistent with the object orientation • But… • Parallel design and implementation is difficult • MapReduce is a viable solution