1 / 84

Lazy Preservation: Reconstructing Websites from the Web Infrastructure

Lazy Preservation: Reconstructing Websites from the Web Infrastructure. Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA Dissertation Defense October 19, 2007. Outline. Motivation Lazy preservation and the Web Infrastructure

Télécharger la présentation

Lazy Preservation: Reconstructing Websites from the Web Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCownAdvisor: Michael L. Nelson Old Dominion UniversityComputer Science DepartmentNorfolk, Virginia, USADissertation Defense October 19, 2007

  2. Outline • Motivation • Lazy preservation and the Web Infrastructure • Web repositories • Responses to 10 research questions • Contributions and Future Work

  3. Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

  4. Preservation: Fortress Model 5 easy steps for preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” Slide from: http://www.cs.odu.edu/~mln/pubs/differently.ppt Image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

  5. …I was doing a little “maintenance” on one of my sites and accidentally deleted my entire database of about 30 articles. After I finished berating myself for being so stupid, I realized that my hosting company would have a backup, so I sent an  email asking them to restore the database. Their reply stated that backups were “coming soon”…OUCH!

  6. Web Infrastructure

  7. Lazy Preservation • How much preservation can be had for free? (Little to no effort for web producer/publisher before website is lost) • High-coverage preservation of works of unknown importance • Built atop unreliable, distributed members which cannot be controlled • Usually limited to crawlable web

  8. Dissertation Objective To demonstrate the feasibility of using the WI as a preservation service – lazy preservation – and to evaluate how effectively this previously unexplored service can be utilized for reconstructing lost websites.

  9. Research Questions (Dissertation p. 3) • What types of resources are typically stored in the WI search engine caches, and how up-to-date are the caches? • How successful is the WI at preserving short-lived web content? • How much overlap is there with what is found in search engine caches and the Internet Archive? • What interfaces are necessary for a member of the WI (a web repository) to be used in website reconstruction? • How does a web-repository crawler work, and how can it reconstruct a lost website from the WI?

  10. Research Questions cont. • What types of websites do people lose, and how successful have they been recovering them from the WI? • How completely can websites be reconstructed from the WI? • What website attributes contribute to the success of website reconstruction? • Which members of the WI are the most helpful for website reconstruction? • What methods can be used to recover the server-side components of websites from the WI?

  11. WI Preliminaries:Web Repositories

  12. Internet Archive? How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

  13. Cached Image

  14. Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

  15. Types of Web Repositories • Depth of holdings • Flat – only maintain last version of resource crawled • Deep – maintain multiple versions, each with a timestamp • Access to holdings • Dark – no outside access to resources • Light – minimal access restrictions

  16. Accessing the WI • Screen-scraping the web user interface (WUI) • Application programming interface (API) • WUIs and APIs do not always produce the same responses; the APIs may be pulling from smaller indexes1 1McCown & Nelson, Agreeing to Disagree: Search Engines and their Public Interfaces, JCDL 2007

  17. Research Questions 1-3: Characterizing the WI • Experiment 1: Observe the WI finding and caching new web content that is decaying. • Experiment 2: Examine the contents of the WI by randomly sampling URLs

  18. Timeline of Web Resource

  19. Web Caching Experiment • May – Sept 2005 • Create 4 websites composed of HTML, PDFs, and images • http://www.owenbrau.com/ • http://www.cs.odu.edu/~fmccown/lazy/ • http://www.cs.odu.edu/~jsmit/ • http://www.cs.odu.edu/~mln/lazp/ • Remove pages each day • Query GMY every day using identifiers McCown et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

  20. Observations • Internet Archive found nothing • Google was the most useful web repository from a preservation perspective • Quick to find new content • Consistent access to cached content • Lost content reappeared in cache long after it was removed • Images are slow to be cached, and duplicate images are not cached

  21. Experiment: Sample Search Engine Caches • Feb 2007 • Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo • Randomly selected 1 result from first 100 • Download resource and cached page • Check for overlap with Internet Archive McCown and Nelson, Characterization of Search Engine Caches, Archiving 2007.

  22. Distribution of Top Level Domains

  23. 976 KB 977 KB 215 KB 1 MB Cached Resource Size Distributions

  24. Cache Freshness and Staleness Fresh Stale Fresh time crawled and cached changed on web server crawled and cached Staleness = max(0, Last-modified HTTP header – cached date)

  25. Cache Staleness • 46% of resource had Last-Modified header • 71% also had cached date • 16% were at least 1 day stale

  26. Overlap with Internet Archive

  27. Overlap with Internet Archive Ave of 46% URLs from search engines were archived

  28. Research Question 4 of 10:Repository Interfaces Minimum interface requirement: What resource r do you have stored for the URI u?“ r getResource(u)

  29. Deep Repositories What resource r do you have stored for the URI u at datestamp d?“ r getResource(u, d)

  30. Lister Queries What resources R do you have stored from the site s? R getAllUris(s)

  31. Other Interface Commands • Get list of dates D stored for URI u D getResourceList(u) • Get crawl date d for URI u d  getCrawlDate(u)

  32. Research Question 5 of 10:Web-Repository Crawling

  33. Web-repository Crawler

  34. Written in Perl • First version completed in Sept 2005 • Made available to the public in Jan 2006 • Run as a command line program warrick.pl --recursive --debug --output-file log.txt http://foo.edu/~joe/ • Or on-line using the Brass queuing system http://warrick.cs.odu.edu/

  35. Research Question 6 of 10:Warrick usage

  36. Ave 38.2%

  37. Research Questions 7 and 8:Reconstruction Effectiveness • Problem with usage data: Difficult to determine how successful reconstructions actually are • Brass tells Warrick to recover all resources, even if not part of “current” website • When were websites actually lost? • Were URLs spelled correctly? Spam? • Need actual website to compare against reconstruction, especially if wanting to determine which factors determine website’s recoverability

  38. Measuring the Difference Apply Recovery Vector for each resource (rc, rm, ra) changed missing added Compute Difference Vector for website

  39. Reconstruction Diagram added 20% changed 33% missing 17% identical 50%

More Related