1 / 36

Building and Publishing Social-Science Datasets with RDF and Linked Data

Building and Publishing Social-Science Datasets with RDF and Linked Data. Kevin Feeney, Trinity College Dublin For Concordance in Macrohistorical Datasets – Santa Fe Institute, May 6, 2015. Overview. Part 1: Representing complex entities Knowledge Graphs The Semantic Web Linked Data

Télécharger la présentation

Building and Publishing Social-Science Datasets with RDF and Linked Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building and Publishing Social-Science Datasets with RDF and Linked Data Kevin Feeney, Trinity College Dublin For Concordance in Macrohistorical Datasets – Santa Fe Institute, May 6, 2015

  2. Overview Part 1: Representing complex entities • Knowledge Graphs • The Semantic Web • Linked Data Part 2: Seshat Knowledge Model Part 3: Dacura dataset curation architecture

  3. Representing Data • Tables of one form or another remain the predominant means of representing and storing machine-readable data. • RDBMS (relational database management systems), Excel • Not a good way of representing knowledge • Minimal Expressivity both structurally and semantically • Extremely basic support for capturing relationships between entities

  4. Typical Web Publishing Architecture • Structurally rigid and expensive to change • Much of the data-structure has to be supplied by the application that consumes the data. Web-browser Web-server Web Non-standard, brittle App Code App Code Fixed, static, hidden SQL Pre-web technology Relational Database Local identifiers Record Structure Data record

  5. Knowledge Graphs Labelled directed graphs Nodes represent entities Edges represent relationships between entities Graph can be considered as a set of triples: subject predicate object

  6. Why Graphs Just a better way of representing knowledge • Simple low-level representation: no loss in expressivity. • Links between entities are manifest in the structure – everything is linked into a single explicit knowledge space. 100BCE 5BCE seshat:validFrom seshat:validTo Roman Principate Population Value dacura:maxValue seshat:hasPopulation 550000 prov:source dacura:minValue Tacitus 400000

  7. Graphs & Machines Machines like Graphs • Great expressive power in specifying domain knowledge in machine-interpretable, decidable ways, • Allowing extremely rich descriptions of domains: http://www.w3.org/TR/owl-guide/wine.rdf • Amenable to automated reasoning Much greater structural flexibility • Expressive ability gives you power to change whole graph structure in one go seshat:polityrdf:typefoaf:person

  8. Mature Open Standards • RDF / RDFSResource Description Framework & Schema • W3C standard (2004) for meta-data or knowledge modelling • Represents facts and fact structure • Graph-based, supports a referential semantics • OWL Web Ontology Language • W3C standard (2004): rich language for describing relationships between classes and their properties • SPARQL • W3C Standard (2009) – Query Language for graph data.

  9. Ontologies & Semantic Web • Ontologies are graphs which express, rich, structured machine-readable descriptions of a particular domain or theme. • The semantic web is the interlinking of ontologies into a universal knowledge graph. • Self-describing: the meaning of the data, the relationships between entities, and whatever aspects of the context are considered important can be embedded in the graph itself. • Widespread adoption in large-scale scientific knowledge projects • Gene ontology: http://geneontology.org/ • Biomedical: http://www.obofoundry.org/ • Climate ontology: https://cds.nccs.nasa.gov/tools-services/ontology/

  10. Linked Data • Using web model to publish and link raw structured data • Web model: • There is a universal space of information • Documents have global identifiers (addresses) in this space • Documents can reference (link) to other documents • Raw structured data: • Numbers, facts, assertions • Ordered, related and labelled • With a well-defined semantics Definition based on http://www.cambridgesemantics.com/semantic-university/introduction-to-linked-data

  11. Linked Data Advantages • Publish data and vocabulary into a unified information space (the web of data) • Data sets • Can be linked • Can be distributed across web • Existing data-sets to link to • Raw data is machine readable • Vocabularies • Common vocabularies defined • Vocabulary specification languages • Standard web data query language (SPARQL) • Supports semantic and schema query • Separation of data from human-centric presentation

  12. Web of Documents Vs. Web of Data Web of Documents Web of Data seshat:Roman Principate rdf:type seshat:Polity prov:source seshat:hasPopulation dbpedia:Tacitus 500000 rdf:type dbpedia-owl:Writer

  13. Linked Data Data-sets, 2013 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

  14. Linked Data Data-sets, 2013 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

  15. State of the Art: Semantic Web • 20+ year scientific research program – formalised, characterised and specified much of the problem domain • Significant commercial activity in scalable triple-stores, ontology engineering tools. • Focus is on publishing data, not collecting it • Mostly for scientific projects with lots of resources • Little integration into practical commercial publishing pipelines • Very limited support for controlling structure of the data – given the flexibility of graph structure, this is a big problem. • Very limited support for management of linked datasets by non-knowledge engineers.

  16. Part 2 Seshat Knowledge Model

  17. Meta-Model – Class Hierarchy Coordinates [(long, lat), (…)] Duration [start, end] Organization Event Territory ExistsWithin [start,end] Controls [start,end] FFA Free Form Area City Political Authority Interest Group Battle War NGANatural Geographic Area Sub-Polity Polity Quasi-Polity Religious System

  18. Expressing Uncertainty and Temporal Bounds Temporally Bounded Value Value Range rdf:type rdf:type 50000 dacura:minValue Population Value Entity hasPopulation dacura:validFrom dacura:maxValue dacura:validTo 60,000 100BCE 80BCE

  19. Expressing Uncertainty and Temporal Bounds (2) Temporally Bounded Value Uncertain Value List rdf:type rdf:type Rome dacura:possibleValue Capital City Entity seshat:hasCapital dacura:possibleValue dacura:validFrom dacura:validTo Byzantium 400CE 420CE

  20. Named Graphs – Provenance & Annotations A P P (Provenance Graph) A (Annotation Graph) V V (Geo-temporally Scoped Value Graph)

  21. Provenance • Goal is to capture, record and make searchable all provenance and processing information associated with every dataset entry, all the way back to the original artefacts… • W3C PROV ontology http://www.w3.org/TR/prov-overview/ Tacitus Historian A prov:wasInformedBy prov:wasDerivedFrom Population Value NGA Coding prov:wasGeneratedBy seshat:expert seshat:RA Ed Joe

  22. Part 3 Dacura & Seshat Dataset Curation Architecture

  23. Funding & Consortia • Aligned: 3 year €4m EU Horizon 2020– 3 years http://aligned-project.eu • ADAPT: €30m Irish Computer Science Research Centre http://adaptcentre.ie/ • Axial-Age Religions and the Z-Curve of Human Egalitarianism

  24. Knowledge modelling – where we are • Linked Data and W3C standards provide a solid foundation on which complex scientific datasets can be constructed. • Web of data provides access to lots of useful information that we can exploit in building rich, useful datasets. • Regular web provides access to large numbers of potential volunteers and vast quantities of data, mostly expressed in natural language. • Machine learning and, information retrieval and automated knowledge extraction keep on getting better and can be usefully deployed to save labour in multiple aspects of the . • Still ‘not good enough’ at best. We still need humans in the loop. • Not quite good enough is a very big problem at scale: “firehose” effect. • Basic Shortcomings • Everything is targeted at knowledge engineers – very little towards dataset managers • Very little support for Dataset life-cycle management.

  25. What do we want to do? Electronic Archives Collective Intelligence Feedback Seshat Databank databases Data Consumers High Quality Data Community of experts & Volunteers “improve the extraction of collective intelligence from electronic archives, research communities and data consumers to improve the quality of published data”

  26. Goals • Develop a rich knowledge model for Seshat using Linked Data technologies both to exploit opportunities to incorporate data from third parties and to encourage reuse in turn by third parties. • Develop tools to enable the Seshat editors to manage the state of their datasets over time. • Develop a process model to support dataset compilation • Deploy automation wherever possible in the data-collection pipeline to minimise the requirement for human effort and maximise the rate at which high quality knowledge accretes.

  27. DaCura: Generic Data-Quality Workflow Goal is to minimise work requirements from expert users (domain expert, architect) and to ensure data-quality in different dimensions at different steps in the process.

  28. Candidate Generation • Harvesters (human or machine) inspect sources and extract relevant entities • The process is inherently error-prone: different harvesters may disagree on what entities are present in the sources. • We cannot trust extracted entities automatically – hence they are considered as ‘Candidates’ Extracted Entities Extracted Entities Source A E1 E1 E2 E3 Source B Harvester 1 Harvester 2 E2 E2 E4 E3

  29. From Candidates to Reports • Editors (human or machine) take all of the candidates for any source and produce “reports” • Candidate Graph may contain many contradictory accounts of what a source contains • Reports are a single interpretation of what a source contains in a structured form Candidate Graph Report Graph E1 E1 Source A Report A E1 E2 E2 E3 Harvester 1 E2 E2 Source B Editor Report B E2 E4 E3 E4 Harvester 2

  30. From Reports to Interpretations • Entity Recognition – reports may refer to the same underlying entity • Property flattening – there may be multiple, contradictory values for entity properties in reports, they must be flattened into an interpretation that is consistent with the real universe. Report Graph Resolved Entity Graph Interpretation Graph Report A E1 E1 E1 E2 E4 E4 E1 E4 P1 P2 Report B E2 E2 E2 E4 E2 E2 P1

  31. DaCura Harvesting Tools

  32. Publishing USPV dataset as linked data • Dataset published at: http://dacura.scss.tcd.ie/pv • Publication: Publishing Social Sciences Datasets as Linked Data: a Political Violence Case Study, ENRICH workshop, SIGIR 2013

  33. Migrating Seshat to a Graph Current situation: wiki, flat structure • Good choice for initial trials of data capture • Not manageable over time Phase 1: Develop Knowledge Model and process integration • Use knowledge model to provide data validation for Seshat variables. • Develop scraper code which will create dumps in spreadsheets. Phase 2: Incorporate Dacura tools into Seshat wiki • User Interfaces which prevent errors, capture provenance • Visualisations, Linked Data Publishing

  34. Phase 1 Schema Definition & Validation • Schema Definition • Vocabulary selection • Formal Modelling • Import wiki data into Dacura • Semantic Mapping • Automated validation of wiki data against schema • Generation of warnings for editors • Publication via SPARQL endpoint • Wiki remains authoritative source Dacura Seshat Triple Store Seshat Wiki Automated Record Harvesting SPARQL Endpoint Seshat Schema Automated Quality Analysis warnings Seshat Editor

  35. Seshat Editor Seshat Contributors Seshat Administrator Seshat Analyst Seshat Data Consumer Time Series Analysis View Data Export Data Errors, Suggested Corrections Enter Data Seshat Data Web Time Series Pre-processing Seshat Data Extraction and Validation Data Visualisations Create Wiki Data Validation Tool Data Export Tool Data Dump File (TSV ) Links to other Datasets Data Transformations Data Syntax Validation Data Extraction TSV to RDF Conversion Seshat Data Web Pages Data Quality Controls Wiki Read Data Seshat Code Book Seshat Data Linked Data Publication RDF Triple Store Seshat Schema Knowledge Model SeshatData Knowledge Model Convert Feedback Knowledge Engineers

  36. Where we are • Demo • The End!

More Related