1 / 29

Temporal Shingling for Version Identification in Web Archives

Ralf Schenkel. Temporal Shingling for Version Identification in Web Archives. Web Archiving. Web. Crawler. Scheduler. Archive. Version Consolidation. Limited recrawl frequency (politeness, crawler load) Limited number of snapshots per page Version consolidation not a primary issue.

vilina
Télécharger la présentation

Temporal Shingling for Version Identification in Web Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ralf Schenkel Temporal Shingling for Version Identification in Web Archives

  2. Web Archiving Web Crawler Scheduler Archive VersionConsolidation Limited recrawl frequency (politeness, crawler load) Limited number of snapshots per page Version consolidation not a primary issue ECIR 2010, Milton Keynes, UK

  3. Human-Assisted Harvesting: EverLast Client(Browser, Proxy) Client(Browser, Proxy) Client(Browser, Proxy) Client(Browser, Proxy) Client(Browser, Proxy) [Anand et al., JCDL09] Web Client(Browser, Proxy) Archive VersionConsolidation • Main difficulties: • Huge number of snapshots for popular pages • Identify important snapshots to reduce archive size ECIR 2010, Milton Keynes, UK

  4. Main Topic: Extract Timeline of a Page • Input: series of timestamped snapshots S(p) • Extract timeline T(p):Times where „important“ change occurred, i.e., where content at time t is significantly different from existing content ECIR 2010, Milton Keynes, UK

  5. Main Topic: Extract Timeline of a Page 09:02 09:06 09:08 09:14 09:18 • Input: series of timestamped snapshots S(p) • Extract timeline T(p):Times where „important“ change occurred, i.e., where content at time t is significantly different from existing content Are all updates equally important? 09:00 09:04 09:10 09:12 09:16 Timeline of p ECIR 2010, Milton Keynes, UK

  6. Example 1: Advertisements ECIR 2010, Milton Keynes, UK

  7. Example 1: Advertisements ECIR 2010, Milton Keynes, UK

  8. Example 1: Advertisements ECIR 2010, Milton Keynes, UK

  9. Example 2: User Comments General approach:Exploit that transient changes(with short lifetime on the page)are usually not important ECIR 2010, Milton Keynes, UK

  10. Towards an Algorithm • Quantify similaritys(pi,pj)[0,1] of two snapshots of page p at times ti and tj • Divergence of two snapshots of page p: • Significant change in content if • Choice of threshold α depends on the application • Archives may want to keep even small corrections • Sometimes not interested in transient changes (ads) ECIR 2010, Milton Keynes, UK

  11. Algorithm for Timeline Extraction divergence>α divergence≤α divergence≤α divergence>α 09:08 09:10 09:16 09:00 09:02 09:04 09:06 09:08 09:10 09:12 09:14 09:16 09:16 09:18 S(p) 09:00 09:02 09:06 T(p) ECIR 2010, Milton Keynes, UK

  12. Measuring Page Similarity: Shingling 3-shingles <I,was,in> <was,in,a> <in,a,hotel> <a,hotel,recently> … [Broder et al., Computer Networks 29, 1997] Shingles SH(p) of a page p: set of word-level k-grams I was in a hotel recently. In the lobby was one of thosesigns that consists of a black board with lots of holesin it and a set of plastic letters. ECIR 2010, Milton Keynes, UK

  13. Measuring Page Similarity: SpotSigs [Theobald et al., SIGIR 2008] Consider special kinds of shingles: c-grams (w/o stopwords) in distance d to a stopword antecedent I was in a hotel recently. In the lobby was one of thosesigns that consists of a black board with lots of holesin it and a set of plastic letters. <was,hotel,recently> <a,hotel,recently> <the,lobby,one> <was,one,signs> … Example: c=2, d=1, antecedents in red ECIR 2010, Milton Keynes, UK

  14. Towards Temporal Shingles Importance of content is related to the time it stays on the page • More weight for long-lasting shingles • Less weight for transient shingles ECIR 2010, Milton Keynes, UK

  15. Temporal Shingling lifetime( ) = unlimited lifetime( )= 8 min 09:00 09:02 09:04 09:06 09:08 09:10 09:12 09:14 09:16 09:18 Lifetime-aware set of shingles: Lifetime-aware snapshot similarity: Lifetime of a shingle: Time between first occurrence and disappearance ECIR 2010, Milton Keynes, UK

  16. Experimental Evaluation • No standard benchmark available • Collected snapshots of the index page of 30 news portals every two minutes for a week 5000 snapshots of each page updated at high frequency and contain transient noise (advertisements, comments, …) • Ground truth GT(p) of important changes of page p difficult to get (much more difficult than deciding if two pages are near-duplicates!) ECIR 2010, Milton Keynes, UK

  17. Automated Ground Truth from RSS Feed • For each news site, read corresponding RSS feed every two minutes contains links to news articles • Normalize links, remove links outside the site • Drop any links not available in the snapshots • Initial assumption: timestamp of new article in the RSS feed corresponds to important change on the site (often wrong, and not all feeds have timestamps) • Use approximation: new version starts whenever • a link from the RSS feed appears on the page for the first time • a link from the RSS feed link disappears from the page ECIR 2010, Milton Keynes, UK

  18. Number of versions per page Average 234.8 versions per page ECIR 2010, Milton Keynes, UK

  19. Lifetime of versions Average lifetime 2500 seconds, median much lower ECIR 2010, Milton Keynes, UK

  20. Lifetime of versions on heise.de ECIR 2010, Milton Keynes, UK

  21. Quality Metrics for Timeline T(p) [Variant of freshness measure from Cho et al, TODS 28(4), 2003] • Binary measure for correctness of version of p in the timeline at time t • Coverage for a page p • Coverage for a set P of pages ECIR 2010, Milton Keynes, UK

  22. Space Overhead for Timeline T(p) • Measures size ratio of timeline compared to groundtruth ECIR 2010, Milton Keynes, UK

  23. Example: Coverage and Overhead 09:00 09:02 09:06 09:08 09:10 09:16 09:00 09:04 09:10 09:12 09:16 Ground Truth 09:00 09:02 09:04 09:06 09:08 09:10 09:12 09:14 09:16 09:18 ECIR 2010, Milton Keynes, UK

  24. Standard Parameters • Shingling: k=4 • SpotSigs: • List of antecedents from the original paper (for English) • Parameters from the original paper: c=3, d=2 • Lifetime parameter for temporal methods: θ=1200s • Evaluated 50 divergence thresholds α from 0 to 1 ECIR 2010, Milton Keynes, UK

  25. Results with Standard Parameters ECIR 2010, Milton Keynes, UK

  26. Varying Lifetime Thresholds: Shingling ECIR 2010, Milton Keynes, UK

  27. Varying Lifetime Thresholds: SpotSigs ECIR 2010, Milton Keynes, UK

  28. Partial Shingling (θ=1200) Reduce storage requirements and runtimeof the algorithm by consideringrandom subsets of shingles ECIR 2010, Milton Keynes, UK

  29. Summary • Introduced the problem of timeline extraction for a Web page • Introduced temporal variants of Shingling and SpotSigs to reduce influence of transient content • Temporal shingling yields similar archive coverage as standard shingling with only 50% archive size Thank you! ECIR 2010, Milton Keynes, UK

More Related