1 / 44

Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets

Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets. Hector Garcia-Molina Stanford University. Work with: Omar Benjelloun, Qi Su, Jennifer Widom, Tyson Condie, Johnson Gong, Nicolas Pombourcq, David Menestrina, Steven Whang. Entity Resolution. e2. e1.

orrick
Télécharger la présentation

Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generic Entity Resolution:Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su, Jennifer Widom, Tyson Condie, Johnson Gong, Nicolas Pombourcq, David Menestrina, Steven Whang

  2. Entity Resolution e2 e1 N: a A: b CC#: c Ph: e N: a Exp: d Ph: e

  3. Applications • comparison shopping • mailing lists • classified ads • customer files • counter-terrorism e1 N: a A: b CC#: c Ph: e e2 N: a Exp: d Ph: e

  4. Outline • Why is ER challenging? • How is ER done? • Some ER work at Stanford • Confidences

  5. Challenges (1) • No keys! • Value matching • “Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”... • Record matching Nm: Tom Ad: 123 Main St Ph: (650) 555-1212 Ph: (650) 777-7777 Nm: Thomas Ad: 132 Main St Ph: (650) 555-1212

  6. Challenges (2) • Merging records Nm: Tom Ad: 123 Main St Ph: (650) 555-1212 Ph: (650) 777-7777 Nm: Thomas Ad: 132 Main St Ph: (650) 555-1212 Zp: 94305 Nm: Tom Nm: Thomas Ad: 123 Main St Ph: (650) 555-1212 Ph: (650) 777-7777 Zp: 94305

  7. Challenges (3) • Chaining Nm: Tom Wk: IBM Oc: laywer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer

  8. Challenges (4) • Un-merging Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K too young to make 500K at IBM!!

  9. Challenges (5) • Confidences in data Nm: Tom (0.9) Ad: 123 Main St (1.0) Ph: (650) 555-1212 (0.6) Ph: (650) 777-7777 (0.8) (0.8) • In value matching, match rules, merge: conf = ?

  10. Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences • Relationships • Exact vs. approximate • Generic vs application specific • Confidences

  11. Schema Differences Name: Tom Address: 123 Main St Ph: (650) 555-1212 Ph: (650) 777-7777 FirstName: Tom StreetName: Main St StreetNumber: 123 Tel: (650) 777-7777

  12. Pair-Wise Snaps vs. Clustering r1 r1 s7 r7 r9 r2 r2 s9 r3 r8 s10 r10 r4 r3 s8 r5 r4 r5 r6 r6

  13. De-Duplication vs. Fidelity Enhancement B S R S N

  14. Relationships r2 r1 r7 r5 father brother business business

  15. Using Relationships p2 p1 a1 a2 a3 a5 a4 p7 p5 papers authors same??

  16. Exact vs Approximate ER cameras resolved cameras ER products CDs resolved CDs ER books resolved books ER ... ...

  17. Exact vs Approximate ER terrorists terrorists sort by age match against ages 25-35 Widom 30

  18. Generic vs Application Specific • Match function M(r, s) • Merge function <r, s> => t

  19. Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences • Relationships • Exact vs. approximate • Generic vs application specific • Confidences

  20. Outline • Why is ER challenging? • How is ER done? • Some ER work at Stanford • Confidences

  21. Taxonomy • Pairwise snaps vs. clustering • De-duplication vs. fidelity enhancement • Schema differences No • Relationships No • Exact vs. approximate • Generic vs application specific • Confidences ... later on

  22. Model r3 r1 r2 Nm: Tom Wk: IBM Oc: laywer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer M(r1, r2) M(r4, r3) Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer r4:<r1, r2> <r4, r3>

  23. Correct Answer r1 s7 r2 s9 r3 s10 r4 s8 r5 r6 ER(R) = All derivable records..... Minus “dominated” records

  24. Question • What is best sequence of match, merge calls that give us right answer?

  25. Brute Force Algorithm • Input R: • r1 = [a:1, b:2] • r2 = [a:1, c: 4, e:5] • r3 = [b:2, c:4, f:6] • r4 = [a:7, e:5, f:6]

  26. Brute Force Algorithm • Input R: • r1 = [a:1, b:2] • r2 = [a:1, c: 4, e:5] • r3 = [b:2, c:4, f:6] • r4 = [a:7, e:5, f:6] • Match all pairs: • r1 = [a:1, b:2] • r2 = [a:1, c: 4, e:5] • r3 = [b:2, c:4, f:6] • r4 = [a:7, e:5, f:6] • r12 = [a:1, b:2, c:4, e:5]

  27. Brute Force Algorithm • Match all pairs: • r1 = [a:1, b:2] • r2 = [a:1, c: 4, e:5] • r3 = [b:2, c:4, f:6] • r4 = [a:7, e:5, f:6] • r12 = [a:1, b:2, c:4, e:5] • Repeat: • r1 = [a:1, b:2] • r2 = [a:1, c: 4, e:5] • r3 = [b:2, c:4, f:6] • r4 = [a:7, e:5, f:6] • r12 = [a:1, b:2, c:4, e:5] • r123 =[a:1, b:2, c:4, e:5, f:6]

  28. Question # 1 Can we delete r1, r2?

  29. Question # 2 Can we avoid comparisons?

  30. ICAR Properties • Idempotence: • M(r1, r1) = true; <r1, r1> = r1 • Commutativity: • M(r1, r2) = M(r2, r1) • <r1, r2> = <r2, r1> • Associativity • <r1, <r2, r3>> = <<r1, r2>, r3>

  31. More Properties • Representativity • If <r1, r2> = r3, thenfor any r4 such that M(r1, r4) is true we also have M(r3, r4) = true. r4 r1 r3 r2

  32. ICAR Properties  Efficiency • Commutativity • Idempotence • Associativity • Representativity • Can discard records • ER result independentof processing order

  33. Swoosh Algorithms • Record Swoosh • Merges records as soon as they match • Optimal in terms of record comparisons • Feature Swoosh • Remembers values seen for each feature • Avoids redundant value comparisons

  34. Swoosh Performance

  35. If ICAR Properties Do Not Hold? r12: [Joe Sr., 123 Main, Ph: 123, DL:X] r23: [Joe Jr., 123 Main, Ph: 123, DL:Y] r3: [Joe Jr., 123 Main, DL:Y] r1: [Joe Sr., 123 Main, DL:X] r2: [Joe, 123 Main, Ph:123]

  36. If ICAR Properties Do Not Hold? r12: [Joe Sr., 123 Main, Ph: 123, DL:X] r23: [Joe Jr., 123 Main, Ph: 123, DL:Y] r3: [Joe Jr., 123 Main, DL:Y] r1: [Joe Sr., 123 Main, DL:X] r2: [Joe, 123 Main, Ph:123] Full Answer: ER(R) = {r12, r23, r1, r2, r3} Minus Dominated: ER(R) = {r12, r23}

  37. If ICAR Properties Do Not Hold? r12: [Joe Sr., 123 Main, Ph: 123, DL:X] r23: [Joe Jr., 123 Main, Ph: 123, DL:Y] r3: [Joe Jr., 123 Main, DL:Y] r1: [Joe Sr., 123 Main, DL:X] r2: [Joe, 123 Main, Ph:123] Full Answer: ER(R) = {r12, r23, r1, r2, r3} Minus Dominated: ER(R) = {r12, r23} R-Swoosh Yields: ER(R) = {r12, r3} or {r1, r23}

  38. Swoosh Without ICAR Properties

  39. Distributed Swoosh P1 P2 P3 r1 r2 r3 r4 r5 r6 ...

  40. Distributed Swoosh P1 P2 P3 r1 r3 r4 r6 ... r1 r2 r4 r5 ... r2 r3 r5 r6 ...

  41. DSwoosh Performance

  42. Outline • Why is ER challenging? • How is ER done? • Some ER work at Stanford • Confidences

  43. Conclusion • ER is old and important problem • Our approach: generic • Confidences • challenging • two ways to tame: • thresholds • packages

  44. Thanks.

More Related