Finding Replicated Web Collections

Finding ReplicatedWeb Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina

Replication is common!

Statistics (Preview) More than 48% of pages have copies!

Reasons for replication Actual replication • Simple copying or Mirroring Apparent replication • Aliases (multiple site names) • Symbolic links • Multiple mount points

Challenges • Subgraph isomorphism: NP • Hundreds of millions of pages • Slight differences between copies

Outline • Definitions • Web graph, collection • Identical collection • Similar collection • Algorithm • Applications • Results

Web graph • Node: web page • Edge: link between pages • Node label: page content (excluding links)

Identical web collection • Collection: induced subgraph • Identical collection: one-to-one (equi-size)

Collection similarity • Coincides with intuitively similar collections • Computable similarity measure

Collection similarity • Page content 

Page content similarity • Fingerprint-based approach (chunking) • Shingles [Broders et al., 1997] • Sentence [Brin et al., 1995] • Word [Shivakumar et al., 1995] • Many interesting issues • Threshold value • Iceberg query

Collection similarity • Link structure 

Collection similarity • Size

Collection similarity  • Size vs. Cardinality

Growth strategy

Ra a a a b b b |Ra| = Ls = Ld = |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to Rb

Ra a a a Rb b b b |Ra|  Ls = Ld  |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to

Algorithm • Based on the property we identified • Input: set of pages collected from web • Output: set of similar collections • Complexity: O(n log n)

Rid Pid 1 10375 1 38950 1 14545 2 1026 2 18633 Algorithm • Step 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours web pages Step 1

Rid Rid Pid Pid 1 1 10375 10375 1 1 38950 38950 1 1 14545 14545 2 2 1026 1026 Algorithm • Step 2: link structure check R1 Link R2 (Copy of R1) Pid Pid 1 2 1 3 2 6 2 10 Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2|

Algorithm • Step 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U {<Ra, Rb>} Union-Find(S) • Step 2-3: 10 hours

Experiment • 25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) => Total number of pages : 35,000 + 15,000 random pages • Result: 180 collections • 149 “good” collections • 31 “problem” collections

Results

Applications • Web crawling & archiving • Save network bandwidth • Save disk storage

Application (web crawling) • Before experiment: 48% • With our technique: 13% crawledpages replicationinfo initialcrawl offline copydetection secondcrawl

Applications (web search)

Related work • Collection similarity • Altavista [Bharat et al., 1999] • Page similarity • COPS [Brin et al., 1995]: sentence • SCAM [Shivakumar et al., 1995]: word • Altavista [Broder et al., 1997]: shingle

Summary • Computable similarity measure • Efficient replication-detection algorithm • Application to real-world problems

Finding Replicated Web Collections

Finding Replicated Web Collections

Presentation Transcript

Finding Information on the web

Scaleable Replicated Databases

Finding replicated web collections

Mobile Replicated Data

Automatically securing web 2.0 applications through replicated execution

Ripley: Automatically Securing Web 2.0 Applications Through Replicated Execution

Replicated State Machines

Connecting Community to Collections with Online Finding Aids

Finding Optimal Probabilistic Generators for XML Collections

Replicated Data Protocols

Replicated Data Management

Finding Optimal Probabilistic Generators for XML Collections

Replicated Binary Designs

Replicated Databases

Finding It on the Web

Replicated Distributed Programs

Replicated Distributed Systems

Replicated Stratified Sampling

Finding a Professional Web Designer

Finding Associations in Collections of Text

Scaleable Replicated Databases

Replicated Databases