Building a National Collection of the Historical UK Web for scholarly use

Building a National Collection of the Historical UK Web for scholarly use Helen Hockx-Yu Head of Web Archiving, British Library IIPC General Assembly, Paris, May April 2014

Scholarly interaction with web archives (1) • Archive-driven • Initiated by archival institutions • Aimed at understanding scholarly requirements and improving archival practice • Scholar-driven • Initiated by scholars with research interest related to web archiving or archived web material, including many “unknown” scholars • A number of active research groups emerging: Netlab, WebArt and DMI, IHR, OII, ODU… • Attention from the Web Science community • Project-based • Various scale, scope and funding sources • Developing web archiving or discipline specific solutions • Researchers and archiving institutions work as partners

Scholarly interaction with web archives (2) • Phase 1: Building collections • Scholars’ involvement in scoping collections, selecting and describing websites relevant to research interest • Creation of specific, (narrow) topical collections, e.g. “Religion, politics and law since 2005” in the UK Web Archive • Phase 2: Formulating research questions • Brain-storm sessions, workshops etc. • Shift of focus to web archives in entirety • Lack of awareness & baseline knowledge • Time & resource consuming • Challenging: you don’t know what you don’t know

Scholarly interaction: the “go-to” state • Independent use of web archives • Meet common scholarly requirements, support scholarly workflow • Base-line knowledge is self-explanatory, e.g. scope of the archive, its coverage and lacunae, how it was collected, and how a particular website was crawled • Clear interfaces and jargon-free descriptions in alignment with scholarly requirements • Open access • Including provision of downloadable derived or secondary datasets, e.g. http://data.webarchive.org.uk/opendata/ • Publication of work citing web archives

Selective archiving since 2003 • Permission-based • Open UK Web Archive http://www.webarchive.org.uk/ukwa/ • ~14,000 websites, ~64,000 instances • URL and full-text search • Curated collections • Many websites no longer available on the live web

6th April 2013… • Legal Deposit Libraries (Non-Print Works) Regulations 2013 • Extension of existing legal framework • Systematic collection of UK’s published output for heritage & preservation • By 6 UK Legal Deposit Libraries

JISC UK Web Domain dataset (1996-2014) • Collaboration between the Internet Archive (IA), the Joint Information Systems Committee (JISC) and the British Library • Extracted copies of UK websites from the Internet Archives collection • 1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs • 2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs (estimated) • Research agreement between JISC and IA, upholding IA’s Terms of Use • Access via IA’s Wayback Machine • Allows replication / extraction of derivative or secondary datasets • BL hosts the dataset on behalf of JISC

Completed work • Analytical Access to the Domain Dark Archive Project • Use cases & experimental UI • Demonstrating the Value of the UK Web Domain Dataset for Social Science Research • Analysis of link graph • Paper accepted for WebSci’14: Mapping the UK Webspace: Fifteen Years of British Universities on the Web • MA thesis by Jules Mataly: The Three Truths of Margaret Thatcher: Creating and Analysing • Secondary datasets under open licence • Format profile, Geoindex, Host Link Graph

Exploring Host Link Graph Courtesy of Peter Webster, Rainer Simon and Jules Mataly

Visualising links (to and from bl.uk) Interactive version How it is done

Big UK Domain Data for Arts and Humanities • Funded by the UK Arts and Humanities Research Council as one of the 21 “Big Data” projects • Collaboration between the Institution of Historical Research, Oxford Internet Institute, British Library and Aarhus University • Develop theoretical and methodological framework for the study of web archives • Build on ADDAA: researchers and the BL co-produce access tools • A major study of the history of UK web space from 1996 to 2013 + sub-projects covering a range of disciplines • Also an online training course and peer-reviewed journal articles.

New projects and initiatives • "ALEXANDRIA: Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives • 5-year project funded by the European Research Council • Develop new models and algorithms for retrieval, exploration, and analytics of web archives • Collaborate on common issues, eg, publications date versus crawl dates • RESAW, a Research Infrastructure for the Study of Archived Web Materials • Currently a coordinated, self-organising, and self-financing open network • Preparing application for EU’s Horizon 2020 framework

Benefits • Helps researchers understand the value of web archives and explore new ways of using these for scholarly research • Allows BL to obtain hands-on experience with indexing and processing large scale web archive datasets • Analytics and visualisations can be applied to our own Legal Deposit collection • Acts as test-bed for research and development projects • Enables BL to participate in various UK, European and international projects • Helps curators understand characteristics of large scale digital corpora • Improve the way we collet and store web archive

Some Issues • Ownership • Data quality • Different formats, ARC and WARCs • Partially de-duplicated • Context • No crawl log or information o data cap applied during crawl time • No detailed information on extraction mechanism • More general issues related to analytical access • Scepticism or suspicion about hidden algorithms behind analysis • Biases in data and how data collection decisions lead to variances in outputs • Need to manage expectations, analysis and visualisation as finished products and first steps • Ethical and privacy issues

Thank you!Questions?Getting in touch:Twitter: @ukwebarchiveEmail: web-archivist@bl.ukUK Web Archive: http://www.webarchive.org.uk

Building a National Collection of the Historical UK Web for scholarly use

Building a National Collection of the Historical UK Web for scholarly use

Presentation Transcript

Collection Building

RaDaR – a national web based data collection for rare kidney diseases

Permission obtained from the National Library of Medicine, Historical Images Collection.

Historical Use of Materials

The Use of Online Surveys for Data Collection

A Web of Historical Textuality

A Web of Knowledge for Historical Documents

Building a Framework for Historical Understanding

Building a Journal Collection

Building a Historical Argument

Building a Nanotechnology Collection at the University of Washington

A NATIONAL PUBLIC HEALTH LANGUAGE FOR THE UK

A Web of Knowledge for Historical Documents

Use of Historical Data

The Future of Scholarly Communication is a Web in the Clouds

A probability-based web panel for the UK:

Building on the National Investment: Promoting the Use of Wind Technology

Building a Nanotechnology Collection at the University of Washington

Building on the National Investment: Promoting the Use of Wind Technology

Building a Journal Collection

Historical Topographic Map Collection

Use case for National Gazetteer Web Services Project