1 / 28

Efficient Detection of Web Page Replication: Algorithms and Applications

This paper explores the prevalence of page replication on the web, finding that over 48% of pages have copies. It defines various forms of replication, such as simple copying, aliases, and symbolic links. The challenges posed by subgraph isomorphism and slight differences in content are also discussed. The authors propose an efficient algorithm for detecting similar web collections based on a computable similarity measure using fingerprinting techniques. Applications include improving web crawling efficiency and reducing bandwidth consumption.

alvaro
Télécharger la présentation

Efficient Detection of Web Page Replication: Algorithms and Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding ReplicatedWeb Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina

  2. Replication is common!

  3. Statistics (Preview) More than 48% of pages have copies!

  4. Reasons for replication Actual replication • Simple copying or Mirroring Apparent replication • Aliases (multiple site names) • Symbolic links • Multiple mount points

  5. Challenges • Subgraph isomorphism: NP • Hundreds of millions of pages • Slight differences between copies

  6. Outline • Definitions • Web graph, collection • Identical collection • Similar collection • Algorithm • Applications • Results

  7. Web graph • Node: web page • Edge: link between pages • Node label: page content (excluding links)

  8. Identical web collection • Collection: induced subgraph • Identical collection: one-to-one (equi-size)

  9. Collection similarity • Coincides with intuitively similar collections • Computable similarity measure

  10. Collection similarity • Page content 

  11. Page content similarity • Fingerprint-based approach (chunking) • Shingles [Broders et al., 1997] • Sentence [Brin et al., 1995] • Word [Shivakumar et al., 1995] • Many interesting issues • Threshold value • Iceberg query

  12. Collection similarity • Link structure 

  13. Collection similarity • Size

  14. Collection similarity  • Size vs. Cardinality

  15. Growth strategy

  16. Ra a a a b b b |Ra| = Ls = Ld = |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to Rb

  17. Ra a a a Rb b b b |Ra|  Ls = Ld  |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to

  18. Algorithm • Based on the property we identified • Input: set of pages collected from web • Output: set of similar collections • Complexity: O(n log n)

  19. Rid Pid 1 10375 1 38950 1 14545 2 1026 2 18633 Algorithm • Step 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours web pages Step 1

  20. Rid Rid Pid Pid 1 1 10375 10375 1 1 38950 38950 1 1 14545 14545 2 2 1026 1026 Algorithm • Step 2: link structure check R1 Link R2 (Copy of R1) Pid Pid 1 2 1 3 2 6 2 10 Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2|

  21. Algorithm • Step 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U {<Ra, Rb>} Union-Find(S) • Step 2-3: 10 hours

  22. Experiment • 25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) => Total number of pages : 35,000 + 15,000 random pages • Result: 180 collections • 149 “good” collections • 31 “problem” collections

  23. Results

  24. Applications • Web crawling & archiving • Save network bandwidth • Save disk storage

  25. Application (web crawling) • Before experiment: 48% • With our technique: 13% crawledpages replicationinfo initialcrawl offline copydetection secondcrawl

  26. Applications (web search)

  27. Related work • Collection similarity • Altavista [Bharat et al., 1999] • Page similarity • COPS [Brin et al., 1995]: sentence • SCAM [Shivakumar et al., 1995]: word • Altavista [Broder et al., 1997]: shingle

  28. Summary • Computable similarity measure • Efficient replication-detection algorithm • Application to real-world problems

More Related