1 / 28

Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc .

Anchor Points Algorithms for Hamming and Edit Distance. Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc . Anand Rajaraman — Cambrian Ventures Pokey Rule — Stanford University Semih Salihoglu — Stanford University

truly
Télécharger la présentation

Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc .

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anchor Points Algorithms for Hamming and Edit Distance FotoAfrati— National Technical University of Athens Anish Das Sarma— ClearList Inc. AnandRajaraman— Cambrian Ventures Pokey Rule — Stanford University Semih Salihoglu — Stanford University Jeff Ullman — Stanford University

  2. Fuzzy Joins • Input:set of records R • Output:<reci, recj> pairs s.t.dist(reci, recj) ≤ d Input Output • Example Applications: • entity resolution, clustering, collaborative filtering

  3. Two Specific Distance Measures • 1. Hamming Distance • Input: bit strings R of length n • 2. Edit Distance • Input: strings R of length n over alphabet A

  4. Fuzzy Joins In One-Round MapReduce Map Reduce Per-Reducer-Memory-Cost Communication Cost

  5. Communication Cost vs Per-reducer Memory 22n Ball-Hashing Grouping (naïve) Anchor Points Splitting communication |R|=2n 2 O(nd/2) 2n-d+1 |R|=2n per-reducer memory

  6. Outline • Anchor Points Algorithm • Covering Code • Explicit Construction of Hamming Distance Covering Codes • Explicit Construction of Edit Distance Covering Codes

  7. Outline • Anchor Points Algorithm • Covering Code • Explicit Construction of Hamming Distance Codes • Explicit Construction of Edit Distance Codes

  8. Covering Code • Given set of strings R of length n, and radius k • Definition: <n, k> covering code C • for each s∈R, there is a c∈C, s.tdist(c, s) ≤ k k

  9. Example Covering Code • Example: Hamming Distance, n=5, k = 2 11111 R … … … … … … … … 00000

  10. Anchor Points Algorithm (1) Map Reduce r00000 r11111 Let C be an <n, k> covering code => (e.g. n=5, k=2) One reducer for each code word Map s to code words at distance ≤ k + d/2 => (e.g. d=2 => 2 + 2/2 = 3)

  11. Anchor Points Algorithm (2) • Triangle Inequality c ≤k + d/2 ≤k + d/2 ≤k w ≤d/2 ≤d/2 v u ≤d

  12. Cost of Anchor Points Algorithm c s5 s4 k + d/2 s11 s7 s6 s17 s1 s9 Reducer for code word c B(n, r): size of the ballof radius r Per-reducer memory: B(n, k + d/2) Communication: |C|B(n, k + d/2)

  13. Communication Cost vs Per-reducer Memory 22n Ball-Hashing Grouping (naïve) Anchor Points k=0 k=1 Splitting k=2 communication k=n |R|=2n 2 O(nd/2) 2n-d+1 |R|=2n per-reducer memory

  14. Outline • Anchor Points Algorithm • Covering Code • Explicit Construction of Hamming Distance Codes • Explicit Construction of Edit Distance Codes

  15. Some Known Hamming Distance Codes • Perfect <n, k> Code (i.e., smallest possible) : 2n/B(n, k) Hamming Codes • For any k: existence of n2n/B(n, k) => not Perfect • Problem: no explicit construction

  16. Cross Product Method (Explicit HD <n, k> Codes) • Start with <n/t, k/t> code D • Let C = D x D x … x D (t times) • Claim: C is a <n, k> covering code • Proof: s = s1 s2 s3 … st dist(s, c) ≤ k ≤k/t ≤k/t ≤k/t ≤k/t c = d1 d2 d3 … dt

  17. Example of Cross Product Method • n = 10, k = 4, t=2 => use a <5, 2> code D • D = {00000, 11111} 00000--00000 1100011100 11000--11100 ≤2+2=4 11111--00000 00000--11111 ≤2+1=3 11100--00001 1110000001 11111--11111

  18. Size of Cross Product Codes: Dk • Assume D is perfect (e.g., Hamming code) • Perfect <n, k> code: • Example: n, k=2, t=2 vs • For large n, small t => same asymptotic size

  19. Outline • Anchor Points Algorithm • Covering Code • Explicit Construction of Hamming Distance Covering Codes • Explicit Construction of Edit Distance Covering Codes

  20. Edit Distance Fuzzy Joins • strings of length n over alphabet A (i.e.,|A|n strings) Input Output • Covering codes algorithm works in the same way: • If C is a <n, k> edit distance code • Send s to all code words at distance k+d/2

  21. Differences with Hamming Distance • Length of code words might be different • E.g. 1 insertion, |c| = n+1 => insertion-1 code • E.g. 1 deletion, |c| = n-1 => deletion-1 code • Different code words might have different ball sizes • No known perfect codes or explicit construction aaba…a (n) aaaa..a (n) abba…a (n) baba…a (n) aaaaa…a (n+1) ababa…a (n+1) abaa…a (n) … … …

  22. Insertion-1 Codes • Let n=5, |A|=a=4, code words are of length 6 • Letters as integers from 0 to (a-1): e.g. 0230, 1124, … • Let si be the ith digit of s • sum(s) = • score(s) = sum(s) % (n+1)(a-1) (e.g., 6*3=18) • R = Any a-1 consecutive residues: • e.g. {0,1,2}, {12,13,14}, {16,17,0} • C = {003000, 303000, 003001, 003002, 200000, …} • |C| = **factor a worse than best possible**

  23. Example: s=23010, sum(s)=24, score(s)=6 230100 230010 203010 230010 023010 323010 230310 230130 233010 233010 X X Y Y X Y Y X Y X

  24. Edit Distance Codes

  25. Summary • Fuzzy Joins for Hamming and Edit Distance in One-round MR • Anchor Points Algorithm • Covering Codes • Flexible parallelism • Better communication cost than naive • Explicit construction of Hamming distance covering codes • Explicit Construction of Edit distance covering codes

  26. Open Questions • Fuzzy Joins in MR • Minimum communication for a given per-reducer memory for 1 round MR algorithms? • Know the answer for only Hamming Distance 1 • How about multi-round MR algorithms? • Covering Codes • Are there smaller codes? • Can we construct smaller codes explicitly? • What is the size of the smallest codes?

  27. Related Work • Fuzzy Joins in MR • Fuzzy Joins Using MapReduce, Afrati et. al., ICDE 2012 • Document Similarity Self-Join with MapReduce, Baraglia et. al., ICDM 2010 • Efficient Parallel Set-similarity Joins Using MapReduce, Vernica et. al., SIGMOD 2010 • Efficient Similarity Joins for Near Duplicate Detection, Xiao et. al., WWW 2008 • Covering Codes • Covering codes, Gary Cohen • On Asymmetric Coverings and Covering Numbers, Applegate et. al., Comb. Designs 2003 • Asymmetric Binary Covering Codes, Cooper et. al., Comb. Theory 2002

  28. Questions?

More Related