HmSearch : An Efficient Hamming Distance Query Processing Algorithm

HmSearch: An Efficient Hamming Distance Query Processing Algorithm Xiaoyang Zhang1,Jianbin Qin1, Wei Wang1,Yifang Sun1, Jiaheng Lu2 1 University of New South Wales, Australia 2 Renmin University of China, Chnia

Motivation • Identify Near Duplicate Webpages • Chemical data Maps in to Binarycode simhash 0012345679ABCDEF 1012345679ABCDEF 012345679ABCDEF0 012345679ABCDEF1 Similar Similar

More Applications • Iris recognition • Image retrieval • C2LSH

Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

Hamming Distance Query • Hamming distance • Hamming distance query Number of positions at which the corresponding symbols are different for two equal length vectors. q: ABCD Hamming distance(R, S) = 1 v: ACCD Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k, find all vi in V, that hd (vi, Q) <= k

Basic Idea • General framework: • We can do k=1 efficiently (show later) • So we transform larger k problem to several small k=1 problem by partitioning • We do filtering by looking at each partition • We do verification at last So if k =1, can be filtered by looking at each part q the same v hd (q, v)<=1 hd(qleft, vleft)=0 or hd(qright, vright)=0 q v

Framework Dimension Rearrangement Data Query General Partitioning Scheme Partitioning Partitioning Generating Signatures Generating Signatures 1-variants and 1-deletion variants Candidates0 Indexing Filtering Enhanced Filtering Index Candidates1 Hierarchical Filtering and Verification Verification Results

Partitioning Lowerbound for partition strategy In our algorithm, we choose Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions, such that hd(qpart, vpart)<= When k= 0 or 1, m=1, hd = 0 When k is even, m = 1 Whenk>=2, hd <= 1 When k is odd, m = 2

Signature Generation • 1-variants • 1-deletion-variants Substituting each dimension with each domain value each time (plus itself) Substituting each dimension with ‘#’each time v=[1, 2, 3] and Σ (domain) =[1, 2, 3] 1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3], [1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2] v=[1, 2, 3] 1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #] OR We index all 1-val(v) and when q comes in, we search q in the index We index all 1-del-val(v) and when q comes in, we generate 1-del-val(q), and search all 1-del-val(q) in the index

Enhanced Filter (Even) Example Based on the Formula before When k (k>=1) is even, m = 1 q However, we find that If k (k>=1) is even, v is qualified for two situations: 1) m=1, where hd(vpart, qpart)=0 m=2, where hd(vpart, qpart)<=1 v If k =2, based on the formula before, m=1, hd(vpart, qpart)=1 So this v becomes a false positive Using enhanced filter, no situation applied sov is filtered

Enhanced Filter (Odd) Example Based on the Formula before When k (k>=1) is odd, m = 2 q However, we find that If k (k>=1) is odd, v is qualified for two situations: 1) m=2, where hd(vpart, qpart)<=1 and at least one of them = 0 2) m=3, where hd(vpart, qpart)<=1 v If k =3, based on the formula before, m=2, hd(vpart, qpart)=1 So this v becomes a false positive Using enhanced filter, no situation applied sov is filtered

Hierarchical Filtering and Verification 4 comparisons to calculate hd(v,q)=3 Significant bit v= [5, 0, 3, 6] q= [5, 2, 2, 5] diff So hd(v, q)>=2, filtered 1 0 1 0 XOR 1 0 0 1 0011 3rd OR More over, even if k=4 0110 2nd 0 0 0 1 0 1 1 1 XOR OR 1st 1 0 0 1 XOR 1 0 0 1 0000 0111 hd(v,q)=3 Σ=|8|, k=1 We can use binary operations to do a hierarchical filtering and verification

Hierarchical Filtering and Verification Number of 1 In cumdiff Significant bit v= [5, 0, 3, 6] q= [5, 2, 3, 5] diff cumdiff 0000 OR 1 0 1 0 XOR 1 0 1 1 0001 0001 1 <=1,conti. 3rd OR 0101 2 >1,filtered 0101 2nd 0 0 0 1 XOR 0 1 1 1 1st 1 0 0 1 1 0 0 1

Impact of Data Skewness Given k=2, then m = 1 and k’=1 Partition1 Partition2 Partition1 Partition2 Dim 1 2 3 4 5 6 Dim 1 2 5 4 3 6 We propose to reset the order andpartition Length to improve performance q 1 1 1 1 0 0 q 1 1 0 1 1 0 v1 1 1 1 0 0 0 v1 1 1 0 0 1 0 v2 0 0 0 2 0 0 v2 0 0 0 2 0 0 v3 2 0 2 0 0 0 v3 2 0 0 0 2 0 v4 3 0 0 0 0 0 v4 3 0 0 0 0 0 Only v1 is qualified All vectors are qualified

Greedy Dimension Rearrangement MaxFreq is the Max Frequency of any values in each dimension MaxFreq for Dim MaxFreq for partition 1 3 3 3 4 4 1 1 4 1 2 4 Partition1 Partition2 Partition1 Partition2 Dim Dim 5 1 2 6 3 4 1 2 3 4 5 6 v1 1 1 1 0 0 0 v1 0 1 1 0 1 0 v2 0 0 0 2 0 0 v2 0 0 0 0 0 2 v3 2 0 2 0 0 0 v3 0 2 0 0 2 0 v4 3 0 0 0 0 0 v4 0 3 0 0 0 0 Our goal: Minimize the global MaxFreq Achieve the goal

Conclusion • General Partition Scheme • 1-variants and 1-deleltion-variants • Techniques help boost the performance • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement

Experiment Settings • Environment • Intel Xeon X3330 2.664GHz CPU, 4GB RAM • Debian5.0.6 • AMD Operon™ 8378 2.4GHZ CPU, 96GB RAM (for Pubchem) • Ubuntu/Linaro 4.6.4-1 unbuntu5 • All complied with GCC 4.1.2 with –O3 • Dataset

Experiment Settings • Terms • EF, Enhanced Filtering • HB, Hierarchical Binary Filter • RD, Rearranging Dimensions • Our algorithms • HSD, HSV, our proposed algorithms, the former one using 1-deleltion-variants as signatures and the latter one using 1-varitnas as signatures • HSD-nEB, HSV-nEB, variations that remove EF and HB • HSD-nB, HSV-nB, variations that remove HB • HSD-nR, HSV-nR, variations that remove RD • Baseline algorithm • Scancount (Li et. ICDE08) • State-of-the-art algorithms • Google (Manku et. www07) • Hengine (Liu et. ICDE11)

Query time HSV has the best performance

Candidate Size HSV has the smallest candidate size

Effect of EF and HB EF and HB help improve the performance

Effect of RD RD boost the performance for PubChem Data

Index Size HSV and HSD have a larger candidate size

Thank you

HmSearch : An Efficient Hamming Distance Query Processing Algorithm