Building and Publishing Social-Science Datasets with RDF and Linked Data

Building and Publishing Social-Science Datasets with RDF and Linked Data Kevin Feeney, Trinity College Dublin For Concordance in Macrohistorical Datasets – Santa Fe Institute, May 6, 2015

Overview Part 1: Representing complex entities • Knowledge Graphs • The Semantic Web • Linked Data Part 2: Seshat Knowledge Model Part 3: Dacura dataset curation architecture

Representing Data • Tables of one form or another remain the predominant means of representing and storing machine-readable data. • RDBMS (relational database management systems), Excel • Not a good way of representing knowledge • Minimal Expressivity both structurally and semantically • Extremely basic support for capturing relationships between entities

Typical Web Publishing Architecture • Structurally rigid and expensive to change • Much of the data-structure has to be supplied by the application that consumes the data. Web-browser Web-server Web Non-standard, brittle App Code App Code Fixed, static, hidden SQL Pre-web technology Relational Database Local identifiers Record Structure Data record

Knowledge Graphs Labelled directed graphs Nodes represent entities Edges represent relationships between entities Graph can be considered as a set of triples: subject predicate object

Why Graphs Just a better way of representing knowledge • Simple low-level representation: no loss in expressivity. • Links between entities are manifest in the structure – everything is linked into a single explicit knowledge space. 100BCE 5BCE seshat:validFrom seshat:validTo Roman Principate Population Value dacura:maxValue seshat:hasPopulation 550000 prov:source dacura:minValue Tacitus 400000

Graphs & Machines Machines like Graphs • Great expressive power in specifying domain knowledge in machine-interpretable, decidable ways, • Allowing extremely rich descriptions of domains: http://www.w3.org/TR/owl-guide/wine.rdf • Amenable to automated reasoning Much greater structural flexibility • Expressive ability gives you power to change whole graph structure in one go seshat:polityrdf:typefoaf:person

Mature Open Standards • RDF / RDFSResource Description Framework & Schema • W3C standard (2004) for meta-data or knowledge modelling • Represents facts and fact structure • Graph-based, supports a referential semantics • OWL Web Ontology Language • W3C standard (2004): rich language for describing relationships between classes and their properties • SPARQL • W3C Standard (2009) – Query Language for graph data.

Ontologies & Semantic Web • Ontologies are graphs which express, rich, structured machine-readable descriptions of a particular domain or theme. • The semantic web is the interlinking of ontologies into a universal knowledge graph. • Self-describing: the meaning of the data, the relationships between entities, and whatever aspects of the context are considered important can be embedded in the graph itself. • Widespread adoption in large-scale scientific knowledge projects • Gene ontology: http://geneontology.org/ • Biomedical: http://www.obofoundry.org/ • Climate ontology: https://cds.nccs.nasa.gov/tools-services/ontology/

Linked Data • Using web model to publish and link raw structured data • Web model: • There is a universal space of information • Documents have global identifiers (addresses) in this space • Documents can reference (link) to other documents • Raw structured data: • Numbers, facts, assertions • Ordered, related and labelled • With a well-defined semantics Definition based on http://www.cambridgesemantics.com/semantic-university/introduction-to-linked-data

Linked Data Advantages • Publish data and vocabulary into a unified information space (the web of data) • Data sets • Can be linked • Can be distributed across web • Existing data-sets to link to • Raw data is machine readable • Vocabularies • Common vocabularies defined • Vocabulary specification languages • Standard web data query language (SPARQL) • Supports semantic and schema query • Separation of data from human-centric presentation

Web of Documents Vs. Web of Data Web of Documents Web of Data seshat:Roman Principate rdf:type seshat:Polity prov:source seshat:hasPopulation dbpedia:Tacitus 500000 rdf:type dbpedia-owl:Writer

Linked Data Data-sets, 2013 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

State of the Art: Semantic Web • 20+ year scientific research program – formalised, characterised and specified much of the problem domain • Significant commercial activity in scalable triple-stores, ontology engineering tools. • Focus is on publishing data, not collecting it • Mostly for scientific projects with lots of resources • Little integration into practical commercial publishing pipelines • Very limited support for controlling structure of the data – given the flexibility of graph structure, this is a big problem. • Very limited support for management of linked datasets by non-knowledge engineers.

Part 2 Seshat Knowledge Model

Meta-Model – Class Hierarchy Coordinates [(long, lat), (…)] Duration [start, end] Organization Event Territory ExistsWithin [start,end] Controls [start,end] FFA Free Form Area City Political Authority Interest Group Battle War NGANatural Geographic Area Sub-Polity Polity Quasi-Polity Religious System

Expressing Uncertainty and Temporal Bounds Temporally Bounded Value Value Range rdf:type rdf:type 50000 dacura:minValue Population Value Entity hasPopulation dacura:validFrom dacura:maxValue dacura:validTo 60,000 100BCE 80BCE

Expressing Uncertainty and Temporal Bounds (2) Temporally Bounded Value Uncertain Value List rdf:type rdf:type Rome dacura:possibleValue Capital City Entity seshat:hasCapital dacura:possibleValue dacura:validFrom dacura:validTo Byzantium 400CE 420CE

Named Graphs – Provenance & Annotations A P P (Provenance Graph) A (Annotation Graph) V V (Geo-temporally Scoped Value Graph)

Provenance • Goal is to capture, record and make searchable all provenance and processing information associated with every dataset entry, all the way back to the original artefacts… • W3C PROV ontology http://www.w3.org/TR/prov-overview/ Tacitus Historian A prov:wasInformedBy prov:wasDerivedFrom Population Value NGA Coding prov:wasGeneratedBy seshat:expert seshat:RA Ed Joe

Part 3 Dacura & Seshat Dataset Curation Architecture

Funding & Consortia • Aligned: 3 year €4m EU Horizon 2020– 3 years http://aligned-project.eu • ADAPT: €30m Irish Computer Science Research Centre http://adaptcentre.ie/ • Axial-Age Religions and the Z-Curve of Human Egalitarianism

Knowledge modelling – where we are • Linked Data and W3C standards provide a solid foundation on which complex scientific datasets can be constructed. • Web of data provides access to lots of useful information that we can exploit in building rich, useful datasets. • Regular web provides access to large numbers of potential volunteers and vast quantities of data, mostly expressed in natural language. • Machine learning and, information retrieval and automated knowledge extraction keep on getting better and can be usefully deployed to save labour in multiple aspects of the . • Still ‘not good enough’ at best. We still need humans in the loop. • Not quite good enough is a very big problem at scale: “firehose” effect. • Basic Shortcomings • Everything is targeted at knowledge engineers – very little towards dataset managers • Very little support for Dataset life-cycle management.

What do we want to do? Electronic Archives Collective Intelligence Feedback Seshat Databank databases Data Consumers High Quality Data Community of experts & Volunteers “improve the extraction of collective intelligence from electronic archives, research communities and data consumers to improve the quality of published data”

Goals • Develop a rich knowledge model for Seshat using Linked Data technologies both to exploit opportunities to incorporate data from third parties and to encourage reuse in turn by third parties. • Develop tools to enable the Seshat editors to manage the state of their datasets over time. • Develop a process model to support dataset compilation • Deploy automation wherever possible in the data-collection pipeline to minimise the requirement for human effort and maximise the rate at which high quality knowledge accretes.

DaCura: Generic Data-Quality Workflow Goal is to minimise work requirements from expert users (domain expert, architect) and to ensure data-quality in different dimensions at different steps in the process.

Candidate Generation • Harvesters (human or machine) inspect sources and extract relevant entities • The process is inherently error-prone: different harvesters may disagree on what entities are present in the sources. • We cannot trust extracted entities automatically – hence they are considered as ‘Candidates’ Extracted Entities Extracted Entities Source A E1 E1 E2 E3 Source B Harvester 1 Harvester 2 E2 E2 E4 E3

From Candidates to Reports • Editors (human or machine) take all of the candidates for any source and produce “reports” • Candidate Graph may contain many contradictory accounts of what a source contains • Reports are a single interpretation of what a source contains in a structured form Candidate Graph Report Graph E1 E1 Source A Report A E1 E2 E2 E3 Harvester 1 E2 E2 Source B Editor Report B E2 E4 E3 E4 Harvester 2

From Reports to Interpretations • Entity Recognition – reports may refer to the same underlying entity • Property flattening – there may be multiple, contradictory values for entity properties in reports, they must be flattened into an interpretation that is consistent with the real universe. Report Graph Resolved Entity Graph Interpretation Graph Report A E1 E1 E1 E2 E4 E4 E1 E4 P1 P2 Report B E2 E2 E2 E4 E2 E2 P1

DaCura Harvesting Tools

Publishing USPV dataset as linked data • Dataset published at: http://dacura.scss.tcd.ie/pv • Publication: Publishing Social Sciences Datasets as Linked Data: a Political Violence Case Study, ENRICH workshop, SIGIR 2013

Migrating Seshat to a Graph Current situation: wiki, flat structure • Good choice for initial trials of data capture • Not manageable over time Phase 1: Develop Knowledge Model and process integration • Use knowledge model to provide data validation for Seshat variables. • Develop scraper code which will create dumps in spreadsheets. Phase 2: Incorporate Dacura tools into Seshat wiki • User Interfaces which prevent errors, capture provenance • Visualisations, Linked Data Publishing

Phase 1 Schema Definition & Validation • Schema Definition • Vocabulary selection • Formal Modelling • Import wiki data into Dacura • Semantic Mapping • Automated validation of wiki data against schema • Generation of warnings for editors • Publication via SPARQL endpoint • Wiki remains authoritative source Dacura Seshat Triple Store Seshat Wiki Automated Record Harvesting SPARQL Endpoint Seshat Schema Automated Quality Analysis warnings Seshat Editor

Seshat Editor Seshat Contributors Seshat Administrator Seshat Analyst Seshat Data Consumer Time Series Analysis View Data Export Data Errors, Suggested Corrections Enter Data Seshat Data Web Time Series Pre-processing Seshat Data Extraction and Validation Data Visualisations Create Wiki Data Validation Tool Data Export Tool Data Dump File (TSV ) Links to other Datasets Data Transformations Data Syntax Validation Data Extraction TSV to RDF Conversion Seshat Data Web Pages Data Quality Controls Wiki Read Data Seshat Code Book Seshat Data Linked Data Publication RDF Triple Store Seshat Schema Knowledge Model SeshatData Knowledge Model Convert Feedback Knowledge Engineers

Where we are • Demo • The End!

Building and Publishing Social-Science Datasets with RDF and Linked Data

Building and Publishing Social-Science Datasets with RDF and Linked Data

Presentation Transcript

SKOS and Linked Data

UNIMARC and linked data

Data acquisition and FIRST datasets

RDA and Linked Data

SCIENCE , RESEARCH DATA, AND PUBLISHING

The UNIMARC in RDF project: namespaces and linked data

An introduction to RDF and library linked data

Enriching Social Science with Quantitative and Survey Data

RDA and Linked Data

Social Science Datasets and Digital Resources

Data integration and Linked Data

RDF and Linked Data

Data: Semantic Web, Linked Data, and Social Media

Linked TCM and Drug Datasets

Linked Open Data and Next Generation Science

National economic and social datasets and research

Publishing and Interacting with Linked Data

Building Partnerships Between Social Science Data Archives and Institutional Repositories

Publishing Linked Sensor Data

Libraries and Linked Data

Publishing Standards for Datasets and Data Tables