Detecting Near-Duplicates for Web Crawling

Detecting Near-Duplicates for Web Crawling Presentation By: Fernando Arreola Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma

Outline • De-duplication • Goal of the Paper • Why is De-duplication Important? • Algorithm • Experiment • Related Work • Tying it Back to Lecture • Paper Evaluation • Questions

De-duplication • The process of eliminating near-duplicateweb documents in a generic crawl • Challenge of near-duplicates: • Identifying exact duplicates is easy • Use checksums • How to identify near-duplicate? • Near-duplicates are identical in content but have differences in small areas • Ads, counters, and timestamps

Goal of the Paper • Present near-duplicate detection system which improves web crawling • Near-duplicate detection system includes: • Simhash technique • Technique used to transform a web-page to an f-bit fingerprint • Solution to Hamming Distance Problem • Given f-bit fingerprint find all fingerprints in a given collection which differ by at most k-bit positions

Why is De-duplication Important? • Elimination of near duplicates: • Saves network bandwidth • Do not have to crawl content if similar to previously crawled content • Reduces storage cost • Do not have to store in local repository if similar to previously crawled content • Improves quality of search indexes • Local repository used for building search indexes not polluted by near-duplicates

Algorithm: Simhash Technique • Convert web-page to set of features • Using Information Retrieval techniques • e.g. tokenization, phrase detection • Give a weight to each feature • Hash each feature into a f-bit value • Have a f-dimensional vector • Dimension values start at 0 • Update f-dimensional vector with weight of feature • If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature • If i-th bit of hash value is one -> add the weight of the feature to the i-thvector value • Vector will have positive and negative components • Sign (+/-) of each component are bits for the fingerprint

Algorithm: Simhash Technique (cont.) • Very simple example • One web-page • Web-page text: “Simhash Technique” • Reduced to two features • “Simhash” -> weight = 2 • “Technique” -> weight = 4 • Hash features to 4-bits • “Simhash” -> 1101 • “Technique” -> 0110

Algorithm: Simhash Technique (cont.) • Start vector with all zeroes 0 0 0 0

Algorithm: Simhash Technique (cont.) • Apply “Simhash” feature (weight = 2) feature’s f-bit value calculation 2 0 1 0 + 2 0 2 1 0 + 2 -2 0 0 0 - 2 0 2 1 0 + 2

Algorithm: Simhash Technique (cont.) • Apply “Technique” feature (weight = 4) feature’s f-bit value calculation -2 2 0 2 - 4 2 6 1 2 + 4 2 -2 -2 + 4 1 -2 2 2 - 4 0

Algorithm: Simhash Technique (cont.) • Final vector: • Sign of vector values is -,+,+,- • Final 4-bit fingerprint = 0110 -2 6 2 -2

Algorithm: Solution to Hamming Distance Problem • Problem: Given f-bit fingerprint (F) find all fingerprints in a given collection which differ by at most k-bit positions • Solution: • Create tables containing the fingerprints • Each table has a permutation (π) and a small integer (p) associated with it • Apply the permutation associated with the table to its fingerprints • Sort the tables • Store tables in main-memory of a set of machines • Iterate through tables in parallel • Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F) • For the fingerprints that matched, check if they differ from πi(F) in at most k-bits

Algorithm: Solution to Hamming Distance Problem (cont.) • Simple example • F = 0100 1101 • K = 3 • Have a collection of 8 fingerprints • Create two tables

Algorithm: Solution to Hamming Distance Problem (cont.)

Algorithm: Solution to Hamming Distance Problem (cont.) Sort Sort

Algorithm: Solution to Hamming Distance Problem (cont.) • F = 0100 1101 π(F) = 1101 0100 π(F) = 0101 0011 Match!

Algorithm: Solution to Hamming Distance Problem (cont.) • With k =3, only fingerprint in first table is a near-duplicate of the F fingerprint F

Algorithm: Compression of Tables • Store first fingerprint in a block (1024 bytes) • XOR the current fingerprint with the previous one • Append to the block the Huffman code for the position of the most significant 1 bit • Append to the block the bits after the most significant 1 bit • Repeat steps 2-4 until block is full • Comparing to the query fingerprint • Use last fingerprint (key) in the block and perform interpolation search to decompress appropriate block

Algorithm: Extending to Batch Queries • Problem: Want to get near-duplicates for batch of query fingerprints – not just one • Solution: • Use Google File System (GFS) and MapReduce • Create two files • File F has the collection of fingerprints • File Q has the query fingerprints • Store the files in GFS • GFS breaks up the files into chunks • Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q • MapReduce allows for a task to be created per chunk • Iterate through chunks in parallel • Each task produces output of near-duplicates found • Produce sorted file from output of each task • Remove duplicates if necessary

Experiment: Parameters • 8 Billion web pages used • K = 1 …10 • Manually tagged pairs as follows: • True positives • Differ slightly • False positives • Radically different pairs • Unknown • Could not be evaluated

Experiment: Results • Accuracy • Low k value -> a lot of false negatives • High k value -> a lot of false positives • Best value -> k = 3 • 75% of near-duplicates reported • 75% of reported cases are true positives • Running Time • Solution Hamming Distance: O(log(p)) • Batch Query + Compression: • 32GB File & 200 tasks -> runs under 100 seconds

Related Work • Clustering related documents • Detect near-duplicates to show related pages • Data extraction • Determine schema of similar pages to obtain information • Plagiarism • Detect pages that have borrowed from each other • Spam • Detect spam before user receives it

Tying it Back to Lecture • Similarities • Indicated importance of de-duplication to save crawler resources • Brief summary of several uses for near-duplicate detection • Differences • Lecture focus: • Breadth-first look at algorithms for near-duplicate detection • Paper focus: • In-depth look of simhash and Hamming Distance algorithm • Includes how to implement and effectiveness

Paper Evaluation: Pros • Thorough step-by-step explanation of the algorithm implementation • Thorough explanation on how the conclusions were reached • Included brief description of how to improve simhash + Hamming Distance algorithm • Categorize web-pages before running simhash, create algorithm to remove ads or timestamps, etc.

Paper Evaluation: Cons • No comparison • How much more effective or faster is it than other algorithms? • By how much did it improve the crawler? • Limited batch queries to a specific technology • Implementation required use of GFS • Approach not restricted to certain technology might be more applicable

Any Questions? ???

Detecting Near-Duplicates for Web Crawling