1 / 31

Factors Affecting Website Reconstruction from the Web Infrastructure

Factors Affecting Website Reconstruction from the Web Infrastructure. Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007. Outline. Web-repository crawling with Warrick

Télécharger la présentation

Factors Affecting Website Reconstruction from the Web Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion UniversityComputer Science DepartmentNorfolk, Virginia, USAJCDL 2007 Vancouver, BCJune 20, 2007

  2. Outline • Web-repository crawling with Warrick • How successful is a reconstruction? • Reconstruction experiment • Significant findings

  3. Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

  4. Crawling the Crawlers

  5. Cached Image

  6. Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

  7. McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007. • McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007. • McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006. • McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006. Available at http://warrick.cs.odu.edu/

  8. Measuring the Difference Apply Recovery Vector for each resource (rc, rm, ra) changed missing added Compute Difference Vector for website

  9. Some Difference Vectors D = (changed, missing, added) (0,0,0) – Perfect recovery (1,0,0) – All resources are recovered but changed (0,1,0) – All resources are lost (0,0,1) – All recovered resources are at new URIs

  10. How Much Change is a Bad Thing? Lost Recovered

  11. How Much Change is a Bad Thing? Lost Recovered

  12. Assigning Penalties Penalty Adjustment (Pc, Pm, Pa) Apply to each resource Or Difference vector

  13. 0 1 Less successful More successful Defining Success success = 1 – dmEquivalent to percent of recovered resources

  14. Reconstruction Experiment • 300 websites chosen randomly from Open Directory Project (dmoz.org) • Crawled and reconstructed each website every week for 14 weeks • Examined change rates, age, decay, growth, recoverability

  15. Success of website recovery each week *On average, we recovered 61% of a website on any given week.

  16. Recovery of Textual Resources

  17. Recovery by TLD

  18. Birth and Decay

  19. Recovery of HTML Resources

  20. Recovery by Age

  21. Statistics for Repositories

  22. External backlinks Internal backlinks Google’s PageRank Hops from root page Path depth MIME type Query string params Age Resource birth rate TLD Website size Size of resources Which Factors Are Significant?

  23. Mild Correlations • Hops and • website size (0.428) • path depth (0.388) • Age and # of query params (-0.318) • External links and • PageRank (0.339) • Website size (0.301) • Hops (0.320)

  24. Regression Analysis • No surprises: all variables are significant, but overall model only explains about half of the observations • Three most significant variables: PageRank, hops and age (R-squared = 0.1496)

  25. Conclusions • Most of the sampled websites were relatively stable • One third of the websites never lost a single resource • Half of the websites never added any new resources • The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other) • How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs

  26. Thank You Sorry, Dad… You lost me in the first two minutes. Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/

  27. Injecting Server Components into Crawlable Pages Erasure codes HTML pages Recover at least m blocks

  28. Web Server Static files(html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Recoverable config Perlscript Dynamicpage Database Not Recoverable

More Related