Three Essential Techniques for Similar Documents

Three Essential Techniques for Similar Documents • Shingling : convert documents, emails, etc., to sets. • Minhashing : convert large sets to short signatures, while preserving similarity. • Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

Candidate pairs : those pairs of signatures that we need to test for similarity. Locality- sensitive Hashing Minhash- ing The set of strings of length k that appear in the document Signatures : short integer vectors that represent the sets, and reflect their similarity The Big Picture Shingling Docu- ment

Minhashing

Four Types of Rows • Given columns C1 and C2, rows may be classified as: C1 C2 a 1 1 b 1 0 c 0 1 d 0 0 • Also, a = # rows of type a , etc. • Note Sim (C1, C2) = a /(a +b +c ).

Min-hashing • Define a hash function h as follows: • Permute the rows of the matrix randomly • Important: same permutation for all the vectors! • Let C be a column (= a document) • h(C) = the number of the first (in the permuted order) row in which column C has 1

Minhash Signatures • Use several (e.g., 100) independent hash functions to create a signature.

Input matrix Signature matrix M 1 4 1 0 1 0 2 1 2 1 1 0 0 1 3 2 2 1 4 1 0 1 0 1 7 1 1 2 1 2 0 1 0 1 6 3 0 1 0 1 2 6 5 7 1 0 1 0 4 5 1 0 1 0 Minhashing Example

Surprising Property • The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). • Both are a /(a +b +c )! • Why? • Look down the permuted columns C1 and C2 until we see a 1. • If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.

Similarity for Signatures • The similarity of signatures is the fraction of the hash functions in which they agree. • Because of the minhash property, the similarity of columns is the same as the expected similarity of their signatures.

Input matrix Signature matrix M 1 4 1 0 1 0 2 1 2 1 1 0 0 1 3 2 2 1 4 1 0 1 0 1 7 1 1 2 1 2 0 1 0 1 6 3 0 1 0 1 2 6 5 7 1 0 1 0 4 5 1 0 1 0 Min Hashing – Example Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0

Implementation – (1) • Suppose 1 billion rows. • Hard to pick a random permutation from 1…billion. • Representing a random permutation requires 1 billion entries. • Accessing rows in permuted order leads to thrashing.

Implementation – (2) • Simulate the effect of random permutation by a random hash function • A good approximation to permuting rows: pick 100 (?) hash functions. • h1 , h2 ,… • For rows r and s, if hi (r ) < hi (s), then r appears before s in permutation i. • We will use the same name for the hash function and the corresponding min-hash function

Implementation –(3) • For each column c and each hash function hi , keep a “slot” M (i, c ). • Initialize to infinity • Intent: M (i, c ) will become the smallest value of hi (r ) for which column c has 1 in row r. • I.e., hi (r ) gives order of rows for i th permuation. • Sort the input matrix so it is ordered by rows • So can iterate by reading rows sequentially from disk

Implementation – (4) for each row r for each column c if c has 1 in row r for each hash function hido ifhi (r ) is a smaller value than M (i, c ) then M (i, c ) := hi (r );

Example Sig1 Sig2 h(1) = 1 1 - g(1) = 3 3 - Row C1 C2 1 1 0 2 0 1 3 1 1 4 1 0 5 0 1 h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0

Implementation – (5) • Often, data is given by column, not row. • E.g., columns = documents, rows = shingles. • If so, sort matrix once so it is by row. • And always compute hi (r ) only once for each row.

Locality-Sensitive Hashing

Finding Similar Pairs • Suppose we have, in main memory, data representing a large number of objects. • May be the objects themselves . • May be signatures as in minhashing. • We want to compare each to each, finding those pairs that are sufficiently similar.

Checking All Pairs is Hard • While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. • Example: 106 columns implies 5*1011 column-comparisons. • At 1 microsecond/comparison: 6 days.

Locality-Sensitive Hashing • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated. • For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

Candidate Generation From Minhash Signatures • Pick a similarity threshold s, • e..g., s=0.8 • Goal: Find documents with Jaccard similarity at least s. • Columns c and d are a candidate pair if their signatures agree in at least fraction s of the rows. • I.e., M (i, c ) = M (i, d ) for at least fraction s values of i.

LSH for Minhash Signatures • Big idea: hash columns of signature matrix M several times. • Arrange that (only) similar columns are likely to hash to the same bucket. • Candidate pairs are those that hash at least once to the same bucket.

Partition Into Bands r rows per band b bands One signature Matrix M

Columns 2 and 6 are probably identical. Columns 6 and 7 are surely different. Buckets Matrix M b bands r rows

Partition into Bands – (2) • Divide matrix M into b bands of r rows. • For each band, hash its portion of each column to a hash table with k buckets. • Make k as large as possible. • Candidatecolumn pairs are those columns that hash to the same bucket for ≥ 1 band. • Tune b and r to catch most similar pairs, but few nonsimilar pairs.

Simplifying Assumption • There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band. • Hereafter, we assume that “same bucket” means “identical in that band.” • Assumption needed only to simplify analysis, not for correctness of algorithm

Example of Bands • 100 min-hash signatures/document • Suppose 100,000 columns • Therefore, signatures take 40Mb. • 5,000,000,000 pairs of signatures can take a while to compare. • Let’s choose b=20, r=5 • 20 bands, 5 signatures per band • Goal: find pairs of documents that are at least 80% similar

Suppose C1, C2 are 80% Similar • Probability C1, C2 identical in one particular band: (0.8)5 = 0.328. • Probability C1, C2 are not similar in any of the 20 bands: (1-0.328)20 = .00035 . • i.e., about 1/3000th of the 80%-similar column pairs are false negatives. • We would find 99.965% pairs of truly similar documents

Suppose C1, C2 Only 30% Similar • Probability C1, C2 identical in any one particular band: (0.3)5 = 0.00243 . • Probability C1, C2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.0043 = 0.0486 . • In other words, approximately 4.86% pairs of docs with similarity 30% end up becoming candidate pairs • False positives

LSH Involves a Tradeoff • Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. • Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up.

Probability = 1 if s > t No chance if s < t Analysis of LSH – What We Want Probability of sharing a bucket t Similarity s of two sets

What One Band of One Row Gives You Remember: probability of equal hash-values = similarity Probability of sharing a bucket t Similarity s of two sets

b Bands, r rows/band

No bands identical At least one band identical 1 - ( 1 - sr )b t ~ (1/b)1/r Some row of a band unequal All rows of a band are equal What b Bands of r Rows Gives You Probability of sharing a bucket t Similarity s of two sets

Example: b = 20; r = 5

LSH Summary • Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. • Check in main memory that candidate pairs really do have similar signatures. • Optional: In another pass through data, check that the remaining candidate pairs really represent similar sets .

Three Essential Techniques for Similar Documents • Shingling : convert documents, emails, etc., to sets. • Minhashing : convert large sets to short signatures, while preserving similarity. • Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

Key Idea Behind LSH • Given documents (i.e., shingle sets) D1 and D2 • If we can find a hash function h such that: • if sim(D1,D2) is high, then with high probability h(D1) = h(D2) • if sim(D1,D2) is low, then with high probability h(D1) !=h(D2) • Then we could hash documents into buckets, and expect that “most” pairs of near duplicate documents would hash into the same bucket • Compare pairs of docs in each bucket to see if they are really near-duplicates

Minhashing • Clearly, the hash function depends on the similarity metric • Not all similarity metrics have a suitable hash function • Fortunately, there is a suitable hash function for Jaccard similarity • Min-hashing

Theory of LSH

Theory of LSH • We have used LSH to find similar documents • In reality, columns in large sparse matrices with high Jaccard similarity • e.g., customer/item purchase histories • Can we use LSH for other distance measures? • e.g., Euclidean distances, Cosine distance • Let’s generalize what we’ve learned!

Families of Hash Functions • A “hash function” is any function that takes two elements and says whether or not they are “equal” (really, are candidates for similarity checking). • Shorthand: h(x) = h(y) means “h says x and y are equal.” • A family of hash functions is any set of functions as in (1). • An example of a family of hash functions • For min-hash signatures, we got a minhash function for each permutation of rows • A (large) set of related hash functions generated by some mechanism • We should be able to efficiently pick a hash function at random from such a family

LS Families of Hash Functions • Suppose we have a space S of points with a distance measure d. • A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for any x and y in S : • If d(x,y) < d1, then prob. over all h in H, that h(x) = h(y) is at least p1. • If d(x,y) > d2, then prob. over all h in H, that h(x) = h(y) is at most p2.

LS Families: Illustration High probability; at least p1 Low probability; at most p2 ??? d2 d1

Example: LS Family • Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permutations. • Then Prob[h(x)=h(y)] = 1-d(x,y). • Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance rather than similarities.

Then probability that minhash values agree is > 2/3 If distance < 1/3 (so similarity > 2/3) Example: LS Family – (2) • Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.

Three Essential Techniques for Similar Documents