1 / 12

A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing

Applying records management processes principles to the open government record. A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing. Overview. The National Archives’ Digital Strategy: An overview of the SKB project, including: The Problem

sanam
Télécharger la présentation

A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applying records management processes principles to the open government record A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing

  2. Overview • The National Archives’ Digital Strategy: • An overview of the SKB project, including: • The Problem • The Solution • Next Steps

  3. Introducing the UK Government Web Archive • More than 18,000 crawls of over 3,000 websites from 1996-2014 • Approximately 90tb of data, 3.5 billion resources • More than 875,000 ARC files • More than 20 million pageviews and 2-3 million visits per month

  4. Who are our users and what do they want? • User surveys on website: all banners and index pages • Established that UKGWA is regularly visited by a great variety of users. • The biggest area for dissatisfaction was found to be the existing search functions. • We constructed user stories so we could test the improvements.

  5. Full Text Search – its limitations Our full text search is very useful and very much used, but is • limited by how the live sites were at crawl time • noisy as it contains much duplicate or near-duplicate material • reliant on keyword matching • most useful when combined with specialist knowledge

  6. Semantic Search – What it allows • Aim was to improve access to information in the UKGWA by providing far richer information about what it contains • The semantic web is a start to tackling a limitation of the web • Becomes a dataset in its own right • Borrows from and contributes to the web • Technology open and machine-readable. APIs allow the data to be easily queried and integrated with other services • Awarded to a consortium led by Ontotext AD, the University of Sheffield and System Simulation

  7. UKGWA: a good candidate for semantic search? • Each resource already has a persistent HTTP URI • UKGWA is both limited anddiverse • Genericand domain-specific meanings can be attributed to otherwise loose terms, e.g: • Facts can be modelled and refined to show the linkages between entities and how they change over time • 2010 general election was opportunity to demonstrate concept

  8. Making UKGWA semantic – How? Image: Ontotext AD, University of Sheffield and System Simulation.

  9. What we learned and next steps • We will deliver it as an internal system to develop further • It’s not AI! 60-70% annotation accuracy not bad at this scale! • Concept can be difficult to explain, and even harder for those unfamiliar with computer science to use (SPARQL etc) prefix skb:<http://proton.semanticweb.org/skb-ont#> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?URL ?title where { ?page <http://ordi.ontotext.com/sar#hasFeature> ?doc_feature . ?doc_feature <http://ordi.ontotext.com/sar#hasValue> ?URL. ?doc_feature <http://ordi.ontotext.com/sar#hasKey> "WEBARCHIVEURL" . ?page <http://proton.semanticweb.org/2006/05/protont#title> ?title . FILTER regex(str(?title), "Foot and Mouth", "i") . FILTER regex(str(?title), "Prime Minister", "i") . ?page <http://proton.semanticweb.org/2006/05/protont#hasDate> } • So, integrating the system with other services is a must.

  10. Applying records management processes principles to the open government record Any Questions? Contact us: webarchive@nationalarchives.gsi.gov.uk Visit: nationalarchives.gov.uk/webarchive

More Related