Three Essential Techniques for Similar Documents
740 likes | 767 Vues
Learn shingling, minhashing, and locality-sensitive hashing to compare and analyze similar documents efficiently. Understand how to convert documents to sets, create short signatures, and focus on likely similar pairs.
Three Essential Techniques for Similar Documents
E N D
Presentation Transcript
Three Essential Techniques for Similar Documents • Shingling : convert documents, emails, etc., to sets. • Minhashing : convert large sets to short signatures, while preserving similarity. • Locality-sensitive hashing : focus on pairs of signatures likely to be similar.
Candidate pairs : those pairs of signatures that we need to test for similarity. Locality- sensitive Hashing Minhash- ing The set of strings of length k that appear in the doc- ument Signatures : short integer vectors that represent the sets, and reflect their similarity The Big Picture Shingling Docu- ment
Four Types of Rows • Given columns C1 and C2, rows may be classified as: C1 C2 a 1 1 b 1 0 c 0 1 d 0 0 • Also, a = # rows of type a , etc. • Note Sim (C1, C2) = a /(a +b +c ).
Min-hashing • Define a hash function h as follows: • Permute the rows of the matrix randomly • Important: same permutation for all the vectors! • Let C be a column (= a document) • h(C) = the number of the first (in the permuted order) row in which column C has 1
Minhash Signatures • Use several (e.g., 100) independent hash functions to create a signature.
Input matrix Signature matrix M 1 4 1 0 1 0 2 1 2 1 1 0 0 1 3 2 2 1 4 1 0 1 0 1 7 1 1 2 1 2 0 1 0 1 6 3 0 1 0 1 2 6 5 7 1 0 1 0 4 5 1 0 1 0 Minhashing Example
Surprising Property • The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). • Both are a /(a +b +c )! • Why? • Look down the permuted columns C1 and C2 until we see a 1. • If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.
Similarity for Signatures • The similarity of signatures is the fraction of the hash functions in which they agree. • Because of the minhash property, the similarity of columns is the same as the expected similarity of their signatures.
Input matrix Signature matrix M 1 4 1 0 1 0 2 1 2 1 1 0 0 1 3 2 2 1 4 1 0 1 0 1 7 1 1 2 1 2 0 1 0 1 6 3 0 1 0 1 2 6 5 7 1 0 1 0 4 5 1 0 1 0 Min Hashing – Example Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0
Implementation – (1) • Suppose 1 billion rows. • Hard to pick a random permutation from 1…billion. • Representing a random permutation requires 1 billion entries. • Accessing rows in permuted order leads to thrashing.
Implementation – (2) • Simulate the effect of random permutation by a random hash function • A good approximation to permuting rows: pick 100 (?) hash functions. • h1 , h2 ,… • For rows r and s, if hi (r ) < hi (s), then r appears before s in permutation i. • We will use the same name for the hash function and the corresponding min-hash function
Implementation –(3) • For each column c and each hash function hi , keep a “slot” M (i, c ). • Initialize to infinity • Intent: M (i, c ) will become the smallest value of hi (r ) for which column c has 1 in row r. • I.e., hi (r ) gives order of rows for i th permuation. • Sort the input matrix so it is ordered by rows • So can iterate by reading rows sequentially from disk
Implementation – (4) for each row r for each column c if c has 1 in row r for each hash function hido ifhi (r ) is a smaller value than M (i, c ) then M (i, c ) := hi (r );
Example Sig1 Sig2 h(1) = 1 1 - g(1) = 3 3 - Row C1 C2 1 1 0 2 0 1 3 1 1 4 1 0 5 0 1 h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0
Implementation – (5) • Often, data is given by column, not row. • E.g., columns = documents, rows = shingles. • If so, sort matrix once so it is by row. • And always compute hi (r ) only once for each row.
Finding Similar Pairs • Suppose we have, in main memory, data representing a large number of objects. • May be the objects themselves . • May be signatures as in minhashing. • We want to compare each to each, finding those pairs that are sufficiently similar.
Checking All Pairs is Hard • While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. • Example: 106 columns implies 5*1011 column-comparisons. • At 1 microsecond/comparison: 6 days.
Locality-Sensitive Hashing • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated. • For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.
Candidate pairs : those pairs of signatures that we need to test for similarity. Locality- sensitive Hashing Minhash- ing The set of strings of length k that appear in the doc- ument Signatures : short integer vectors that represent the sets, and reflect their similarity The Big Picture Shingling Docu- ment
Candidate Generation From Minhash Signatures • Pick a similarity threshold s, • e..g., s=0.8 • Goal: Find documents with Jaccard similarity at least s. • Columns c and d are a candidate pair if their signatures agree in at least fraction s of the rows. • I.e., M (i, c ) = M (i, d ) for at least fraction s values of i.
LSH for Minhash Signatures • Big idea: hash columns of signature matrix M several times. • Arrange that (only) similar columns are likely to hash to the same bucket. • Candidate pairs are those that hash at least once to the same bucket.
Partition Into Bands r rows per band b bands One signature Matrix M
Columns 2 and 6 are probably identical. Columns 6 and 7 are surely different. Buckets Matrix M b bands r rows
Partition into Bands – (2) • Divide matrix M into b bands of r rows. • For each band, hash its portion of each column to a hash table with k buckets. • Make k as large as possible. • Candidatecolumn pairs are those columns that hash to the same bucket for ≥ 1 band. • Tune b and r to catch most similar pairs, but few nonsimilar pairs.
Simplifying Assumption • There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band. • Hereafter, we assume that “same bucket” means “identical in that band.” • Assumption needed only to simplify analysis, not for correctness of algorithm
Example of Bands • 100 min-hash signatures/document • Suppose 100,000 columns • Therefore, signatures take 40Mb. • 5,000,000,000 pairs of signatures can take a while to compare. • Let’s choose b=20, r=5 • 20 bands, 5 signatures per band • Goal: find pairs of documents that are at least 80% similar
Suppose C1, C2 are 80% Similar • Probability C1, C2 identical in one particular band: (0.8)5 = 0.328. • Probability C1, C2 are not similar in any of the 20 bands: (1-0.328)20 = .00035 . • i.e., about 1/3000th of the 80%-similar column pairs are false negatives. • We would find 99.965% pairs of truly similar documents
Suppose C1, C2 Only 30% Similar • Probability C1, C2 identical in any one particular band: (0.3)5 = 0.00243 . • Probability C1, C2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.0043 = 0.0486 . • In other words, approximately 4.86% pairs of docs with similarity 30% end up becoming candidate pairs • False positives
LSH Involves a Tradeoff • Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. • Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up.
Probability = 1 if s > t No chance if s < t Analysis of LSH – What We Want Probability of sharing a bucket t Similarity s of two sets
What One Band of One Row Gives You Remember: probability of equal hash-values = similarity Probability of sharing a bucket t Similarity s of two sets
No bands identical At least one band identical 1 - ( 1 - sr )b t ~ (1/b)1/r Some row of a band unequal All rows of a band are equal What b Bands of r Rows Gives You Probability of sharing a bucket t Similarity s of two sets
LSH Summary • Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. • Check in main memory that candidate pairs really do have similar signatures. • Optional: In another pass through data, check that the remaining candidate pairs really represent similar sets .
Three Essential Techniques for Similar Documents • Shingling : convert documents, emails, etc., to sets. • Minhashing : convert large sets to short signatures, while preserving similarity. • Locality-sensitive hashing : focus on pairs of signatures likely to be similar.
Candidate pairs : those pairs of signatures that we need to test for similarity. Locality- sensitive Hashing Minhash- ing The set of strings of length k that appear in the doc- ument Signatures : short integer vectors that represent the sets, and reflect their similarity The Big Picture Shingling Docu- ment
Key Idea Behind LSH • Given documents (i.e., shingle sets) D1 and D2 • If we can find a hash function h such that: • if sim(D1,D2) is high, then with high probability h(D1) = h(D2) • if sim(D1,D2) is low, then with high probability h(D1) !=h(D2) • Then we could hash documents into buckets, and expect that “most” pairs of near duplicate documents would hash into the same bucket • Compare pairs of docs in each bucket to see if they are really near-duplicates
Minhashing • Clearly, the hash function depends on the similarity metric • Not all similarity metrics have a suitable hash function • Fortunately, there is a suitable hash function for Jaccard similarity • Min-hashing
Theory of LSH • We have used LSH to find similar documents • In reality, columns in large sparse matrices with high Jaccard similarity • e.g., customer/item purchase histories • Can we use LSH for other distance measures? • e.g., Euclidean distances, Cosine distance • Let’s generalize what we’ve learned!
Families of Hash Functions • A “hash function” is any function that takes two elements and says whether or not they are “equal” (really, are candidates for similarity checking). • Shorthand: h(x) = h(y) means “h says x and y are equal.” • A family of hash functions is any set of functions as in (1). • An example of a family of hash functions • For min-hash signatures, we got a minhash function for each permutation of rows • A (large) set of related hash functions generated by some mechanism • We should be able to efficiently pick a hash function at random from such a family
LS Families of Hash Functions • Suppose we have a space S of points with a distance measure d. • A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for any x and y in S : • If d(x,y) < d1, then prob. over all h in H, that h(x) = h(y) is at least p1. • If d(x,y) > d2, then prob. over all h in H, that h(x) = h(y) is at most p2.
LS Families: Illustration High probability; at least p1 Low probability; at most p2 ??? d2 d1
Example: LS Family • Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permutations. • Then Prob[h(x)=h(y)] = 1-d(x,y). • Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance rather than similarities.
Then probability that minhash values agree is > 2/3 If distance < 1/3 (so similarity > 2/3) Example: LS Family – (2) • Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.