1 / 6

Near-duplicates detection

Comparison of the two algorithms seen in class Romain Colle. Near-duplicates detection. Description of algorithms. 1 st pass through the data : Both algorithms compute a signature for each document, and perform LSH on these signatures.

albin
Télécharger la présentation

Near-duplicates detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparison of the two algorithms seen in class Romain Colle Near-duplicates detection

  2. Description of algorithms • 1st pass through the data : Both algorithms compute a signature for each document, and perform LSH on these signatures. • 2nd pass through the data : Verification of the relevance of the duplicates pairs found (Jaccard similarity). • Algorithm SH uses Shingles + MinHashing to compute the signatures. • Algorithm SK uses sketches of projections on random hyperplanes to compute the signatures.

  3. Experimentation method • Run both algorithms on the data set (WebBase), and compute precision. • Remove duplicates pairs found from the data set. • Generate and insert large amounts of (near-) duplicates documents (~10% of the data set). • Run both algorithms on the new dataset, and compute precision and recall.

  4. Results (original data set)

  5. Results (modified dataset)

  6. Conclusion • Algorithm SK rocks ! • However, it is computationally more expensive • Tradeoff between speed and recall/precision (given that algorithm SH performs quite well)

More Related