210 likes | 325 Vues
Capturing the web The Swedish experience www.kb.se/kw3. Kulturarw³. Content. The Archive priorities storage what we save Development IIPC Tools, format conclusion. Background Kulturarw 3 goals strategy Sweden on the net? Harvesting Software Fimding links problem Statistics
E N D
Capturing the web The Swedish experience www.kb.se/kw3 Kulturarw³
Content • The Archive • priorities • storage • what we save • Development • IIPC • Tools, format • conclusion • Background • Kulturarw3 • goals • strategy • Sweden on the net? • Harvesting • Software • Fimding links • problem • Statistics • What have we got?
Background • Legal deposit, 1661 • Latest revision 1993 • Only electronic documents in fixed form • CD-ROM, diskettes • New law • juli 1:st, 2002, exception from personal privacy law. • First Swedish web news paper lost • Printed newspapers since 1645 • Kulturarw3 started 1996 • Still waiting for new legal deposit law
Goals • All web pages in Sweden • pictures, video etc. • .se, .and other Top Level Domains • Electronic journals
Strategy: two choices • Select what is importantHow to know what will be considered important in the future?Labour intense • Everything using automatic softwareGets everything (well, not really)Less labour intense
Strategy • Take snapshots of the Swedish weba few times each year • Gets “all” • Needs less labour • Computer memory is cheap • However, large volumes makes quality control difficult • Selective harvestingabout 150 newspapers every day • In the future; events, eg electionsWith as little human intervention as possible.
Sweden on the web? http://www.kb.se/kbstart.htm Only the domain part relevant • .se • .nu, Niue popular in Sweden. ”nu” means now in Swedish • Others if the server is geographically located in Sweden • Language?
Harvesting software • A harvester (crawler, spider) collects web pages by automatically following links and saving pages • Open-source harvester: Heritrix • Main developer: Internet Archive (IA) • Written in Java. Active community. • Designed for archiving. not indexing. • Earlier: Modified version of Combine • From NetLab, Lund university. • Important!Indexing isn't archiving and archiving isn't indexing! • Collects also pictures, sound etc.
Problems • …or challenges if you are an optimist… • Scripts • Interactive pages • Password protected • Video/streaming material • Social sites
Statistics – what did we get? Bulk crawls (everything Swedish) • First sweep – 1997 , only .se- 6.8 million files- 160 GB data • A sweep 2007-2008 , .se and other tld:s- 270 million files- 11500 GB data
Statistics – what did we get? • Periodika (newspapers) • Started june 2002 • 88 miljoner URLer • 4.0 TB • About 40 000 URLs every day
More statistics Bulk (everything Swedish) • 823 100 web servers (including inlines) • 651 700 “swedish” - .se 50 % - .nu 21% - others 29% • 1549 different MIME-typer found. • Html about 50% • text/html + image/gif + image/jpeg + appl/pdf + text/plain about 97% of the documents. • A lot of garbage, miss-spellings etc.
Trends • Html: stable, 50-60% . Increasing lately • Jpeg: increasing, 11% (-97), 27% (05) • Gif: decreasing, 23% (-97), 11% (-05) • Pdf: increasing, 9:th to 4:th position
Accessing the archive Firsta priority is to access the archive using traditional web technologies. Surf, in “space” and time Free text search Nb, not using traditional library methods: cataloging etc.
Development • International Internet Preservation Consortium (IIPC) • Started by Internet Archive national libraries of: Sweden, Norway, Finland, Danmark, Iceland, UK, France, Italy, Canada, Australia och USA (LoC)Now many more • Develop common standards, tools and methods for web archiving. • Raise awareness
Development, standards • Archiving formats • Earlier formats • MIME (Multipart Mail Extension) • ARC • NedLib • WARC (Web ARChive file format) • File format for saving web materialeach web page is one record in a warc-fileA record contains metada and content • ISO 28500.
Development, Tools • Tools • Harvesting: Heritrix • Designed for archiving (NOT a modified indexer) • Open soure: Java, Linux etc. • Supported by IIPC • Mainly developed by Internet Archive with contributions • Will (is) support WARC. Supports ARC and MIME • Surfing tools • New Wayback Machine • WERA - surf with time line • WAXToolbar – support when using new WM • NutchWax • Free text search (with time line) • Curator tool • Possible for a new-technician to do collection and quality control
Advices • Use Open standards, open source → IIPC • Get users of the archive • Think big. Hundreds of tera bytes, billions of files • Accept that what you do is a best effort
Conclusion • The web is constantly changing continuous development. • Possible to get a reasonable picture of the web. But never complete! • Do something now
Questions? Comments? ? ? ?
Links • IIPC: www.netpreserve.org • Kulturarw3: www.kb.se/kw3 • Internet Archive: www.archive.org