1 / 10

CS276A Text Information Retrieval, Mining, and Exploitation

CS276A Text Information Retrieval, Mining, and Exploitation. Supplemental Min-wise Hashing Slides [Brod97,Brod98] (Adapted from Rajeev Motwani’s CS361A slides). Set Similarity. Set Similarity (Jaccard measure)

Télécharger la présentation

CS276A Text Information Retrieval, Mining, and Exploitation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS276AText Information Retrieval, Mining, and Exploitation Supplemental Min-wise Hashing Slides [Brod97,Brod98] (Adapted from Rajeev Motwani’s CS361A slides)

  2. Set Similarity • Set Similarity (Jaccard measure) • View sets as columns of a matrix; one row for each element in the universe. aij = 1 indicates presence of item i in set j • Example C1C2 0 1 1 0 1 1 simJ(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1

  3. Identifying Similar Sets? • Signature Idea • Hash columns Ci to signature sig(Ci) • simJ(Ci,Cj) approximated by simH(sig(Ci),sig(Cj)) • Naïve Approach • Sample P rows uniformly at random • Define sig(Ci) as P bits of Ci in sample • Problem • sparsity  could easily miss interesting part of columns • sample would get only 0’s in columns

  4. Key Observation • For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 • Overload notation: A = # of rows of type A • Claim

  5. Min Hashing • Randomly permute rows • Hash h(Ci) = index of first row with 1 in column Ci • Suprising Property • Why? • Both are A/(A+B+C) • Look down columns Ci, Cj until first non-Type-D row • h(Ci) = h(Cj)  type A row

  6. Min-Hash Signatures • Pick – P random row permutations • MinHash Signature sig(C) = list of P indexes of first rows with 1 in column C • Similarity of signatures • Let simH(sig(Ci),sig(Cj)) = fraction of permutations where MinHash values agree • Observe E[simH(sig(Ci),sig(Cj))] = simJ(Ci,Cj)

  7. Example Signatures S1 S2 S3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 C1 C2 C3 R1 1 0 1 R2 0 1 1 R3 1 0 0 R4 1 0 1 R5 0 1 0 Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00 Exercise: What similarities do we get if we also picked a min-hash for Perm 4=(21453) ?

  8. Implementation Trick • Permuting rows even once is prohibitive • Row Hashing • Pick P hash functions hk: {1,…,n}{1,…,O(n)} • Ordering under hk gives random row permutation • One-pass Implementation • For each Ci and hk, keep “slot” for min-hash value • Initialize all slot(Ci,hk) to infinity • Scan rows in arbitrary order looking for 1’s • Suppose row Rj has 1 in column Ci • For each hk, • if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j)

  9. Example C1 C2 R1 1 0 R2 0 1 R3 1 1 R4 1 0 R5 0 1 C1 slotsC2 slots h(1) = 1 1 - g(1) = 3 3 - h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0

  10. Comparing Signatures • Signature Matrix S • Rows = Hash Functions • Columns = Columns • Entries = Signatures • Compute – Pair-wise similarity of signature columns

More Related