1 / 20

Access and Analytics to the UK Web Archive

Access and Analytics to the UK Web Archive. Lewis Crawford, Web Archive Technical Lead The British Library. Introduction. This talk will cover: Background of the UK Web Archive Traditional access methods to Web Archives Full text search for resource discovery

Télécharger la présentation

Access and Analytics to the UK Web Archive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Access and Analytics to the UK Web Archive Lewis Crawford,Web Archive Technical Lead The British Library

  2. Introduction This talk will cover: • Background of the UK Web Archive • Traditional access methods to Web Archives • Full text search for resource discovery • Problems of scale – needles and haystacks

  3. Web Archiving: the basics • What Selecting, capturing, storing, preserving and managing access to snapshots of websites over time • How Use crawler software to download websites automatically • Selective or domain archiving • Provide access in a Web Archive • When Since mid 1990s • Who Heritage and memory organisations, eg (IIPC) • University libraries • Not-for-profit and commercial organisations, eg Internet Archive • Individual researchers • Why Global information resource • Artefact of cultural and technology change • Representative sample of the web: historical and sociological data that may not be found elsewhere • Part of national digital heritage - legal requirements

  4. UK Web Archive:

  5. Web archive as historical documents

  6. Multimedia based content

  7. 3D visualisation wall

  8. Full text search

  9. N-gram visualisation

  10. N-gram visualisation

  11. Media based results

  12. Semantic analysis

  13. Scale: needle and haystack • Google: “seen 1 trillion unique URLs” • more than a billion new pages are added to the web every day • The UK web domain • 9 million .uk domain names registered in December 2010 • ~ 1 million using other domain names • Growing at 11% - 14% per year • 40% estimated to be in scope for Legal Deposit • Estimated ~110TB each UK domain crawl Subject hierarchy visualisation UK Web Archive • ~ 10,000 websites collected since 2004 • ~ 40,000 instances

  14. The value of the haystacks – content visualisation

  15. Big Data analytics • Java Map/Reduce to use Tika to extract text and generate XML files for Solr ingest • Hive & Pig for ad hoc query analysis

  16. Search indexing process XML Media store Node 1 SOLR Dedicated Indexer Hadoop DIH Indexes new xml XML Image store Generate xml files SOLR Dedicated Indexer DIH Indexes new xml XML Document store Node 50 SOLR Dedicated Indexer Replication DIH Indexes new xml Replication SOLR Dedicated Search Retrieve (w)arcs and meta information Replication SOLR Dedicated Search (w)arcs Document Meta Service SOLR Dedicated Search Generate (w)arcs WCT Crawlers Meta Database Insert meta information Web Access

  17. Tag cloud analysis – General Election 2005 • Special Collection 2005 general election • 147 websites archived during and immediately after the UK general election campaign of 2005. • Tag clouds (or weighted lists) generated for websites belonging to key political parties • Shows the most frequently used words in the websites during the 2005 election campaign • Special collection 2010 general election now available

  18. The value of the haystacks – postcode-based access

  19. 1: Blue 2-5: Green 5+ Purple 50+ Yellow 100+ Red

  20. Questions? Thank you. • http://www.webarchive.org.uk • lewis.crawford@bl.uk • @relephantdata

More Related