1 / 64

Capturing the Web

Capturing the Web. Using Archive-It to Preserve Digital Content. Michael Silver msilver@farpan.net Netspeed 2011 Friday October 21, 2011 Calgary, Alberta. This Session Will. Describe the challenges and goals of the Heritage Community Foundation archiving project at U of A Libraries

jace
Télécharger la présentation

Capturing the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Capturing the Web Using Archive-It to Preserve Digital Content Michael Silver msilver@farpan.net Netspeed 2011 Friday October 21, 2011 Calgary, Alberta

  2. This Session Will • Describe the challenges and goals of the Heritage Community Foundation archiving project at U of A Libraries • Introduce the Archive-It subscription service • Discuss some example use cases • Share some lessons I learned from the HCF project

  3. The Project

  4. Heritage Community Foundation The Heritage Community Foundation was a charitable trust committed to connecting people with heritage. Among its activities, the HCF developed the Alberta Online Encyclopedia – a collection of over 80 websites relating to Alberta’s historical, natural, cultural, scientific and technological heritage.

  5. Heritage Community Foundation Heritage Community Foundation ceased operations on June 30, 2009 after ten years of activity. The Alberta Online Encyclopedia – a collection of over 80 websites relating to Alberta’s historical, natural, cultural, scientific and technological heritage – was gifted to the University of Alberta Libraries on June 1, 2009.

  6. HCF Problem Areas • JavaScript-based content display (URLs constructed in client-side code) • Multimedia (Flash, audio and video) • Form-based access to content • Older operating system and application software • Compromised systems

  7. Changing Nature of the Web • Web sites change, both in terms of content and technology • Most sites contained some code that would not function correctly using current versions of PHP or MySQL

  8. Code Updates • A review of the code on a selected site indicated it would take three years to bring the code up to the U of A Libraries’ standards! • Some sites broken in the live site. One simple change reinstated access to literally thousands of images!

  9. Bottom Line • Content at risk in terms of both access and preservation • Resources not available to review existing websites and bring them up to current standards and software packages

  10. Answer • Use Archive-It to preserve digital content

  11. Archive-It

  12. Archive-It • Subscription service offered by the Internet Archive • Uses open-source software developed by the Internet Archive (Heretrix crawler and archive access projects) • Provides additional web-based control of crawl configurations • Provides storage and bandwidth

  13. Cost • Annual subscription provides a budget based on different usage measures (collections, seeds, documents, data). • Beware of sticker-shock! The changing nature of Web technologies requires constant updates to and maintenance of the software. • Contact Archive-It staff to explore pricing options.

  14. HCF Project Plan • Identify issues with sites using an in-house crawler built using Perl • Identify code changes that prevent archiving or display of content • Implement changes on sites • Test and verify

  15. Using Archive-It

  16. Archive-It • Subscription service provided by The Internet Archive • Similar to the Wayback Machine • Wayback Machine crawls the web • Archive-It allows • Selection of sites, including update schedules • Description of content

  17. Archive-It Collections • A subscription provides a specific number of active collections • Active collections will be automatically crawled and updated according to user-defined selection • Inactive or dormant collections are still available for viewing but are not actively crawled

  18. Seed URLs • Each subscription limited to a number of active seed URLs • Seed URLs provide the starting points for crawls • System will identify and follow all links that are in-scope

  19. Scope • The scope for a collection is determined by the interaction between the seed URLs and explicit scope settings • Initial scope determined by seed URL • Scope can be manually adjusted to include or exclude files based on host, file extension or other user-entered rules

  20. Budgets • In addition to the limits on active collections and seeds, Archive-It budgets resource usage: • Number of archived documents • Total archived data

  21. Document Budget • Each archived file counts as a document. A single Web page will have multiple files due to images, external style sheets or JavaScript files. • Every crawl of a file counts against the document budget regardless of whether the file has changed or not.

  22. Data Budget • The data budget is measured by the amount of data crawled. • Files which have not changed since the previous crawl do not count against the data budget – but they still count against the document budget!

  23. Test Crawls • Important to use them to control the use of document and data budgets! • Especially important in identifying crawler traps • Significant improvements in functionality have been made to the test crawl feature

  24. Web Preservation

  25. Naïve concept of content preservation • Related to static items • Focused on maintaining packaged, standalone items (book, manuscript, electronic files) • Often related to physical deterioration or obsolescence

  26. Web Challenges • Dynamic content • Content rarely provided as a package • Often relies on underlying technology not easily replicated in a static package (e.g., database-driven websites)

  27. Preservation process • Preservation generally consists of a series of steps • Selection • Acquisition • Organization • Availability • Preservation

  28. Selection – Seed URL • Seed URL selection • What site or part of a site is targeted? • Example: The seed URL http://www.abheritage.ca/alberta/en/index.html • will not capture French version at http://www.abheritage.ca/alberta/fr/index.html

  29. Selection by file type • Scope by file type • What files within that site are of interest? • If only PDF files or videos are desired, settings should be adjusted in scope or seed screen. • Alberta Education Curriculum Collection interested in PDF files only.

  30. Selection by rule • Scope by rule • What other content may be needed? • Some shared CSS files were missed in crawls. Adding a rule to include any CSS files fixed the problem.

  31. Example Seed URL: www.abheritage.ca/alberta/en/index.html

More Related