1 / 74

Three Essential Techniques for Similar Documents

Learn shingling, minhashing, and locality-sensitive hashing to compare and analyze similar documents efficiently. Understand how to convert documents to sets, create short signatures, and focus on likely similar pairs.

rachaelc
Télécharger la présentation

Three Essential Techniques for Similar Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Three Essential Techniques for Similar Documents • Shingling : convert documents, emails, etc., to sets. • Minhashing : convert large sets to short signatures, while preserving similarity. • Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

  2. Candidate pairs : those pairs of signatures that we need to test for similarity. Locality- sensitive Hashing Minhash- ing The set of strings of length k that appear in the doc- ument Signatures : short integer vectors that represent the sets, and reflect their similarity The Big Picture Shingling Docu- ment

  3. Minhashing

  4. Four Types of Rows • Given columns C1 and C2, rows may be classified as: C1 C2 a 1 1 b 1 0 c 0 1 d 0 0 • Also, a = # rows of type a , etc. • Note Sim (C1, C2) = a /(a +b +c ).

  5. Min-hashing • Define a hash function h as follows: • Permute the rows of the matrix randomly • Important: same permutation for all the vectors! • Let C be a column (= a document) • h(C) = the number of the first (in the permuted order) row in which column C has 1

  6. Minhash Signatures • Use several (e.g., 100) independent hash functions to create a signature.

  7. Input matrix Signature matrix M 1 4 1 0 1 0 2 1 2 1 1 0 0 1 3 2 2 1 4 1 0 1 0 1 7 1 1 2 1 2 0 1 0 1 6 3 0 1 0 1 2 6 5 7 1 0 1 0 4 5 1 0 1 0 Minhashing Example

  8. Surprising Property • The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). • Both are a /(a +b +c )! • Why? • Look down the permuted columns C1 and C2 until we see a 1. • If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.

  9. Similarity for Signatures • The similarity of signatures is the fraction of the hash functions in which they agree. • Because of the minhash property, the similarity of columns is the same as the expected similarity of their signatures.

  10. Input matrix Signature matrix M 1 4 1 0 1 0 2 1 2 1 1 0 0 1 3 2 2 1 4 1 0 1 0 1 7 1 1 2 1 2 0 1 0 1 6 3 0 1 0 1 2 6 5 7 1 0 1 0 4 5 1 0 1 0 Min Hashing – Example Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0

  11. Implementation – (1) • Suppose 1 billion rows. • Hard to pick a random permutation from 1…billion. • Representing a random permutation requires 1 billion entries. • Accessing rows in permuted order leads to thrashing.

  12. Implementation – (2) • Simulate the effect of random permutation by a random hash function • A good approximation to permuting rows: pick 100 (?) hash functions. • h1 , h2 ,… • For rows r and s, if hi (r ) < hi (s), then r appears before s in permutation i. • We will use the same name for the hash function and the corresponding min-hash function

  13. Implementation –(3) • For each column c and each hash function hi , keep a “slot” M (i, c ). • Initialize to infinity • Intent: M (i, c ) will become the smallest value of hi (r ) for which column c has 1 in row r. • I.e., hi (r ) gives order of rows for i th permuation. • Sort the input matrix so it is ordered by rows • So can iterate by reading rows sequentially from disk

  14. Implementation – (4) for each row r for each column c if c has 1 in row r for each hash function hido ifhi (r ) is a smaller value than M (i, c ) then M (i, c ) := hi (r );

  15. Example Sig1 Sig2 h(1) = 1 1 - g(1) = 3 3 - Row C1 C2 1 1 0 2 0 1 3 1 1 4 1 0 5 0 1 h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0

  16. Implementation – (5) • Often, data is given by column, not row. • E.g., columns = documents, rows = shingles. • If so, sort matrix once so it is by row. • And always compute hi (r ) only once for each row.

  17. Locality-Sensitive Hashing

  18. Finding Similar Pairs • Suppose we have, in main memory, data representing a large number of objects. • May be the objects themselves . • May be signatures as in minhashing. • We want to compare each to each, finding those pairs that are sufficiently similar.

  19. Checking All Pairs is Hard • While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. • Example: 106 columns implies 5*1011 column-comparisons. • At 1 microsecond/comparison: 6 days.

  20. Locality-Sensitive Hashing • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated. • For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

  21. Candidate pairs : those pairs of signatures that we need to test for similarity. Locality- sensitive Hashing Minhash- ing The set of strings of length k that appear in the doc- ument Signatures : short integer vectors that represent the sets, and reflect their similarity The Big Picture Shingling Docu- ment

  22. Candidate Generation From Minhash Signatures • Pick a similarity threshold s, • e..g., s=0.8 • Goal: Find documents with Jaccard similarity at least s. • Columns c and d are a candidate pair if their signatures agree in at least fraction s of the rows. • I.e., M (i, c ) = M (i, d ) for at least fraction s values of i.

  23. LSH for Minhash Signatures • Big idea: hash columns of signature matrix M several times. • Arrange that (only) similar columns are likely to hash to the same bucket. • Candidate pairs are those that hash at least once to the same bucket.

  24. Partition Into Bands r rows per band b bands One signature Matrix M

  25. Columns 2 and 6 are probably identical. Columns 6 and 7 are surely different. Buckets Matrix M b bands r rows

  26. Partition into Bands – (2) • Divide matrix M into b bands of r rows. • For each band, hash its portion of each column to a hash table with k buckets. • Make k as large as possible. • Candidatecolumn pairs are those columns that hash to the same bucket for ≥ 1 band. • Tune b and r to catch most similar pairs, but few nonsimilar pairs.

  27. Simplifying Assumption • There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band. • Hereafter, we assume that “same bucket” means “identical in that band.” • Assumption needed only to simplify analysis, not for correctness of algorithm

  28. Example of Bands • 100 min-hash signatures/document • Suppose 100,000 columns • Therefore, signatures take 40Mb. • 5,000,000,000 pairs of signatures can take a while to compare. • Let’s choose b=20, r=5 • 20 bands, 5 signatures per band • Goal: find pairs of documents that are at least 80% similar

  29. Suppose C1, C2 are 80% Similar • Probability C1, C2 identical in one particular band: (0.8)5 = 0.328. • Probability C1, C2 are not similar in any of the 20 bands: (1-0.328)20 = .00035 . • i.e., about 1/3000th of the 80%-similar column pairs are false negatives. • We would find 99.965% pairs of truly similar documents

  30. Suppose C1, C2 Only 30% Similar • Probability C1, C2 identical in any one particular band: (0.3)5 = 0.00243 . • Probability C1, C2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.0043 = 0.0486 . • In other words, approximately 4.86% pairs of docs with similarity 30% end up becoming candidate pairs • False positives

  31. LSH Involves a Tradeoff • Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. • Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up.

  32. Probability = 1 if s > t No chance if s < t Analysis of LSH – What We Want Probability of sharing a bucket t Similarity s of two sets

  33. What One Band of One Row Gives You Remember: probability of equal hash-values = similarity Probability of sharing a bucket t Similarity s of two sets

  34. b Bands, r rows/band

  35. No bands identical At least one band identical 1 - ( 1 - sr )b t ~ (1/b)1/r Some row of a band unequal All rows of a band are equal What b Bands of r Rows Gives You Probability of sharing a bucket t Similarity s of two sets

  36. Example: b = 20; r = 5

  37. LSH Summary • Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. • Check in main memory that candidate pairs really do have similar signatures. • Optional: In another pass through data, check that the remaining candidate pairs really represent similar sets .

  38. Three Essential Techniques for Similar Documents • Shingling : convert documents, emails, etc., to sets. • Minhashing : convert large sets to short signatures, while preserving similarity. • Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

  39. Candidate pairs : those pairs of signatures that we need to test for similarity. Locality- sensitive Hashing Minhash- ing The set of strings of length k that appear in the doc- ument Signatures : short integer vectors that represent the sets, and reflect their similarity The Big Picture Shingling Docu- ment

  40. Key Idea Behind LSH • Given documents (i.e., shingle sets) D1 and D2 • If we can find a hash function h such that: • if sim(D1,D2) is high, then with high probability h(D1) = h(D2) • if sim(D1,D2) is low, then with high probability h(D1) !=h(D2) • Then we could hash documents into buckets, and expect that “most” pairs of near duplicate documents would hash into the same bucket • Compare pairs of docs in each bucket to see if they are really near-duplicates

  41. Minhashing • Clearly, the hash function depends on the similarity metric • Not all similarity metrics have a suitable hash function • Fortunately, there is a suitable hash function for Jaccard similarity • Min-hashing

  42. Theory of LSH

  43. Theory of LSH • We have used LSH to find similar documents • In reality, columns in large sparse matrices with high Jaccard similarity • e.g., customer/item purchase histories • Can we use LSH for other distance measures? • e.g., Euclidean distances, Cosine distance • Let’s generalize what we’ve learned!

  44. Families of Hash Functions • A “hash function” is any function that takes two elements and says whether or not they are “equal” (really, are candidates for similarity checking). • Shorthand: h(x) = h(y) means “h says x and y are equal.” • A family of hash functions is any set of functions as in (1). • An example of a family of hash functions • For min-hash signatures, we got a minhash function for each permutation of rows • A (large) set of related hash functions generated by some mechanism • We should be able to efficiently pick a hash function at random from such a family

  45. LS Families of Hash Functions • Suppose we have a space S of points with a distance measure d. • A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for any x and y in S : • If d(x,y) < d1, then prob. over all h in H, that h(x) = h(y) is at least p1. • If d(x,y) > d2, then prob. over all h in H, that h(x) = h(y) is at most p2.

  46. LS Families: Illustration High probability; at least p1 Low probability; at most p2 ??? d2 d1

  47. Example: LS Family • Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permutations. • Then Prob[h(x)=h(y)] = 1-d(x,y). • Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance rather than similarities.

  48. Then probability that minhash values agree is > 2/3 If distance < 1/3 (so similarity > 2/3) Example: LS Family – (2) • Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.

More Related