290 likes | 423 Vues
Anchor Points Algorithms for Hamming and Edit Distance. Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc . Anand Rajaraman — Cambrian Ventures Pokey Rule — Stanford University Semih Salihoglu — Stanford University
E N D
Anchor Points Algorithms for Hamming and Edit Distance FotoAfrati— National Technical University of Athens Anish Das Sarma— ClearList Inc. AnandRajaraman— Cambrian Ventures Pokey Rule — Stanford University Semih Salihoglu — Stanford University Jeff Ullman — Stanford University
Fuzzy Joins • Input:set of records R • Output:<reci, recj> pairs s.t.dist(reci, recj) ≤ d Input Output • Example Applications: • entity resolution, clustering, collaborative filtering
Two Specific Distance Measures • 1. Hamming Distance • Input: bit strings R of length n • 2. Edit Distance • Input: strings R of length n over alphabet A
Fuzzy Joins In One-Round MapReduce Map Reduce Per-Reducer-Memory-Cost Communication Cost
Communication Cost vs Per-reducer Memory 22n Ball-Hashing Grouping (naïve) Anchor Points Splitting communication |R|=2n 2 O(nd/2) 2n-d+1 |R|=2n per-reducer memory
Outline • Anchor Points Algorithm • Covering Code • Explicit Construction of Hamming Distance Covering Codes • Explicit Construction of Edit Distance Covering Codes
Outline • Anchor Points Algorithm • Covering Code • Explicit Construction of Hamming Distance Codes • Explicit Construction of Edit Distance Codes
Covering Code • Given set of strings R of length n, and radius k • Definition: <n, k> covering code C • for each s∈R, there is a c∈C, s.tdist(c, s) ≤ k k
Example Covering Code • Example: Hamming Distance, n=5, k = 2 11111 R … … … … … … … … 00000
Anchor Points Algorithm (1) Map Reduce r00000 r11111 Let C be an <n, k> covering code => (e.g. n=5, k=2) One reducer for each code word Map s to code words at distance ≤ k + d/2 => (e.g. d=2 => 2 + 2/2 = 3)
Anchor Points Algorithm (2) • Triangle Inequality c ≤k + d/2 ≤k + d/2 ≤k w ≤d/2 ≤d/2 v u ≤d
Cost of Anchor Points Algorithm c s5 s4 k + d/2 s11 s7 s6 s17 s1 s9 Reducer for code word c B(n, r): size of the ballof radius r Per-reducer memory: B(n, k + d/2) Communication: |C|B(n, k + d/2)
Communication Cost vs Per-reducer Memory 22n Ball-Hashing Grouping (naïve) Anchor Points k=0 k=1 Splitting k=2 communication k=n |R|=2n 2 O(nd/2) 2n-d+1 |R|=2n per-reducer memory
Outline • Anchor Points Algorithm • Covering Code • Explicit Construction of Hamming Distance Codes • Explicit Construction of Edit Distance Codes
Some Known Hamming Distance Codes • Perfect <n, k> Code (i.e., smallest possible) : 2n/B(n, k) Hamming Codes • For any k: existence of n2n/B(n, k) => not Perfect • Problem: no explicit construction
Cross Product Method (Explicit HD <n, k> Codes) • Start with <n/t, k/t> code D • Let C = D x D x … x D (t times) • Claim: C is a <n, k> covering code • Proof: s = s1 s2 s3 … st dist(s, c) ≤ k ≤k/t ≤k/t ≤k/t ≤k/t c = d1 d2 d3 … dt
Example of Cross Product Method • n = 10, k = 4, t=2 => use a <5, 2> code D • D = {00000, 11111} 00000--00000 1100011100 11000--11100 ≤2+2=4 11111--00000 00000--11111 ≤2+1=3 11100--00001 1110000001 11111--11111
Size of Cross Product Codes: Dk • Assume D is perfect (e.g., Hamming code) • Perfect <n, k> code: • Example: n, k=2, t=2 vs • For large n, small t => same asymptotic size
Outline • Anchor Points Algorithm • Covering Code • Explicit Construction of Hamming Distance Covering Codes • Explicit Construction of Edit Distance Covering Codes
Edit Distance Fuzzy Joins • strings of length n over alphabet A (i.e.,|A|n strings) Input Output • Covering codes algorithm works in the same way: • If C is a <n, k> edit distance code • Send s to all code words at distance k+d/2
Differences with Hamming Distance • Length of code words might be different • E.g. 1 insertion, |c| = n+1 => insertion-1 code • E.g. 1 deletion, |c| = n-1 => deletion-1 code • Different code words might have different ball sizes • No known perfect codes or explicit construction aaba…a (n) aaaa..a (n) abba…a (n) baba…a (n) aaaaa…a (n+1) ababa…a (n+1) abaa…a (n) … … …
Insertion-1 Codes • Let n=5, |A|=a=4, code words are of length 6 • Letters as integers from 0 to (a-1): e.g. 0230, 1124, … • Let si be the ith digit of s • sum(s) = • score(s) = sum(s) % (n+1)(a-1) (e.g., 6*3=18) • R = Any a-1 consecutive residues: • e.g. {0,1,2}, {12,13,14}, {16,17,0} • C = {003000, 303000, 003001, 003002, 200000, …} • |C| = **factor a worse than best possible**
Example: s=23010, sum(s)=24, score(s)=6 230100 230010 203010 230010 023010 323010 230310 230130 233010 233010 X X Y Y X Y Y X Y X
Summary • Fuzzy Joins for Hamming and Edit Distance in One-round MR • Anchor Points Algorithm • Covering Codes • Flexible parallelism • Better communication cost than naive • Explicit construction of Hamming distance covering codes • Explicit Construction of Edit distance covering codes
Open Questions • Fuzzy Joins in MR • Minimum communication for a given per-reducer memory for 1 round MR algorithms? • Know the answer for only Hamming Distance 1 • How about multi-round MR algorithms? • Covering Codes • Are there smaller codes? • Can we construct smaller codes explicitly? • What is the size of the smallest codes?
Related Work • Fuzzy Joins in MR • Fuzzy Joins Using MapReduce, Afrati et. al., ICDE 2012 • Document Similarity Self-Join with MapReduce, Baraglia et. al., ICDM 2010 • Efficient Parallel Set-similarity Joins Using MapReduce, Vernica et. al., SIGMOD 2010 • Efficient Similarity Joins for Near Duplicate Detection, Xiao et. al., WWW 2008 • Covering Codes • Covering codes, Gary Cohen • On Asymmetric Coverings and Covering Numbers, Applegate et. al., Comb. Designs 2003 • Asymmetric Binary Covering Codes, Cooper et. al., Comb. Theory 2002