280 likes | 392 Vues
This paper explores the prevalence of page replication on the web, finding that over 48% of pages have copies. It defines various forms of replication, such as simple copying, aliases, and symbolic links. The challenges posed by subgraph isomorphism and slight differences in content are also discussed. The authors propose an efficient algorithm for detecting similar web collections based on a computable similarity measure using fingerprinting techniques. Applications include improving web crawling efficiency and reducing bandwidth consumption.
E N D
Finding ReplicatedWeb Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina
Statistics (Preview) More than 48% of pages have copies!
Reasons for replication Actual replication • Simple copying or Mirroring Apparent replication • Aliases (multiple site names) • Symbolic links • Multiple mount points
Challenges • Subgraph isomorphism: NP • Hundreds of millions of pages • Slight differences between copies
Outline • Definitions • Web graph, collection • Identical collection • Similar collection • Algorithm • Applications • Results
Web graph • Node: web page • Edge: link between pages • Node label: page content (excluding links)
Identical web collection • Collection: induced subgraph • Identical collection: one-to-one (equi-size)
Collection similarity • Coincides with intuitively similar collections • Computable similarity measure
Collection similarity • Page content
Page content similarity • Fingerprint-based approach (chunking) • Shingles [Broders et al., 1997] • Sentence [Brin et al., 1995] • Word [Shivakumar et al., 1995] • Many interesting issues • Threshold value • Iceberg query
Collection similarity • Link structure
Collection similarity • Size
Collection similarity • Size vs. Cardinality
Ra a a a b b b |Ra| = Ls = Ld = |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to Rb
Ra a a a Rb b b b |Ra| Ls = Ld |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to
Algorithm • Based on the property we identified • Input: set of pages collected from web • Output: set of similar collections • Complexity: O(n log n)
Rid Pid 1 10375 1 38950 1 14545 2 1026 2 18633 Algorithm • Step 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours web pages Step 1
Rid Rid Pid Pid 1 1 10375 10375 1 1 38950 38950 1 1 14545 14545 2 2 1026 1026 Algorithm • Step 2: link structure check R1 Link R2 (Copy of R1) Pid Pid 1 2 1 3 2 6 2 10 Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2|
Algorithm • Step 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U {<Ra, Rb>} Union-Find(S) • Step 2-3: 10 hours
Experiment • 25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) => Total number of pages : 35,000 + 15,000 random pages • Result: 180 collections • 149 “good” collections • 31 “problem” collections
Applications • Web crawling & archiving • Save network bandwidth • Save disk storage
Application (web crawling) • Before experiment: 48% • With our technique: 13% crawledpages replicationinfo initialcrawl offline copydetection secondcrawl
Related work • Collection similarity • Altavista [Bharat et al., 1999] • Page similarity • COPS [Brin et al., 1995]: sentence • SCAM [Shivakumar et al., 1995]: word • Altavista [Broder et al., 1997]: shingle
Summary • Computable similarity measure • Efficient replication-detection algorithm • Application to real-world problems