1 / 29

Record Linkage in a Distributed Environment

Record Linkage in a Distributed Environment. Huang Yipeng Wing group meeting, 11 March 2011. Record Linkage. E.g. Distinguishing between data belonging to… <Yipeng, author of this presentation> and <Yipeng, son of PM Lee>. Determining if pairs of personal records refer to the same entity .

bozica
Télécharger la présentation

Record Linkage in a Distributed Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Record Linkagein a Distributed Environment Huang YipengWing group meeting, 11 March 2011

  2. Record Linkage E.g. Distinguishing between data belonging to…<Yipeng, author of this presentation> and <Yipeng, son of PM Lee> Determining if pairs of personal records refer to the same entity Introduction

  3. The Distributed Environment • Why? • Dealing with large data • Limitation of blocking • Advantages • Parallel computation • Data source flexibility • Complementary to blocking methods O(nC2) Amanda Amanda Amanda Amanda Introduction

  4. The Distributed Environment • MapReduce • Distributed environment for large data sets • Hadoop • Open source implementation • Convenient model for scaling Record Linkage • Protects users from system level concerns Introduction

  5. Research Problem Disconnect between generic parallel framework and specific Record Linkage problem The goal  Tailor Hadoop for Record Linkage tasks Introduction

  6. Outline Introduction Related Work Methodology Evaluation Conclusion

  7. Related Work • Record Linkage Literature • Blocking techniques • Parallel Record Linkage Literature • P-Febrl(P Christen 2003), • P-Swoosh (H Kawai 2006), • Parallel Linkage (H Kim 2007) • Hadoop Literature • Evaluation Metrics • Pairwise comparisons (T Elsayed 2008) Related Work

  8. Outline Introduction Related Work Methodology Evaluation Conclusion

  9. MapReduce Workflow Partitioner Methodology

  10. Implementation • Map • Purpose: • Parallelism • Data manipulation • Blocking • Reads lines of input and outputs <key, value> pairs. • Reduce • Purpose: • Parallelism • Record Linkage ops • Records with the same <key> in same Reduce(). • Linkage results Methodology

  11. Hash Partitioner 5416986 comparisons 210 comparisons Default implementation Hash(Key) mod N Good for uniformed data but not for skewed distributions Methodology

  12. Record Linkage Partitioner Goal: Have all nodes finish the reduce phase at the same time Attain a better runtime but retaining the same level of accuracy Methodology

  13. Domain principles Counting pairwise comparisons gives a more accurate picture of the true computational workload The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000) Methodology

  14. Record Linkage Workflow Round 1 Range partition based on comparison workload Round 2 Merge lost comparisons from Round 1 Round 3 Remove cross duplicates Methodology

  15. Round 1 Input 1. Calc avg comparison workload over N nodes Distribution 2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below. Map Phase 3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes Methodology

  16. Round 2 List X B Methodology 17

  17. Round 2 Only acts on lost comparisons Because input is indistinct, a 3rd round of deduplication may be needed. Methodology 18

  18. Outline Introduction Related Work Methodology Evaluation Conclusion Introduction

  19. Performance Metrics • Performance evaluation in absolute runtime, speedup & scaleupon a shared cluster. • “It’s what users care about” • Representative of real operations Evaluation

  20. Input Records <rec-359705-org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , 19090518, 38, 07 34366927, 6174819, 9> 10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution. Methodology

  21. Data sets • Synthetic data produced with Febrl data generator • Artificially skewed distribution Methodology

  22. Utilization Evaluation

  23. Utilization Evaluation

  24. Utilization A B C Evaluation

  25. Utilization A B Evaluation

  26. Round 2 J1 J2 ? J3 J4 J5 J6 Node Utilization 50-100% 27

  27. Results so far…. Evaluation

  28. Results so far…. • RL Workflow runtime • Similar to Hash-based runtime on small datasets • Better as the size of the dataset grows Evaluation

  29. Conclusion • Parallelism a right step in the right direction for record linkage • Complementary to existing approaches • Hadoop can be tailored for Record Linkage tasks • “Record Linkage” Partitioner / Workflow is just one an example of possible improvements Conclusion

More Related