University of Florida’s dchecker: Software for ensuring semantic data integrity

Introduction Upkeep of data integrity requires constant vigilance. The problem of ensuring data integrity in DBMS is well understood, but this is not the case for semantic web triple data. UF VIVO: a researcher networking application The University of Florida has implemented VIVO (http://vivo.ufl.edu), a semantic web application for researcher networking. Although VIVO will reject malformed RDF/XML that does not pass validation, it has relatively few restrictions on the data it will accept. UF’s dchecker software Dchecker is a Python script that runs daily on a set of associated SPARQL queries. Queries can be added indefinitely to expand the capabilities. Some examples of data constraints checked: referential integrity: links between authors and their publications must be valid. Positions must be linked to people and organizations. domain integrity: numeric identifiers must consist only of integers. semantic integrity: unique identifiers must not be duplicated across people, or on a single person data restrictions integrity: restricted data must not be exposed. people within sensitive organizations must be protected from public display. University of Florida’s dchecker:Software for ensuring semantic data integrity Types of data integrity Unique identifiers must truly be unique per individual, a property defined as a book title must not hold chapter headings, people must not also be classed as organizations, and so forth. Ensuring data is properly defined semantically helps supplementary processes like automated semantic reasoning. Fig. 1: example SPARQL queries in command line Conclusion Unlike DBMS systems where data can be de-normalized across many tables, semantic data on particular subjects collects on one URI. Future semantic data checking should consider the totality of facts collected on a URI to ensure semantic correctness: people should not have facts that would correspond to books such as page numbers, for example. Future expansion of dchecker should be able to circumvent some of the constraints of the SPARQL query language and perform multiple parallel queries to compare the results. Insertion of problematic entries into the dchecker report would make data repair easy. In addition, rules for automated data correction would enable hands-free cleanup. Fig. 2: sample data quality report. Note the non-zero entries representing errors that need to be corrected. Nicholas Rejack, MS1, Christopher P. Barnes1, Michael Conlon, PhD2 nrejack@ufl.edu, cpb@ufl.edu, mconlon@ufl.edu 1Clinical and Translational Science Informatics and Technology, University of Florida 2Clinical and Translational Science Institute, University of Florida

University of Florida’s dchecker: Software for ensuring semantic data integrity

University of Florida’s dchecker: Software for ensuring semantic data integrity

Presentation Transcript

Database Security and Auditing: Protecting Data Integrity and Accessibility

Web of Data

Introduction to Semantic Web

Control Flow Integrity & Software Fault Isolation

General Introduction for Semantic Web and Linked Open Data

Semantic Enhancement

BASIC CONCEPTS IN PIPELINE INTEGRITY MANAGEMENT

Semantic Provenance: Trusted Biomedical Data Integration

Applications of Immunochemical Methods in the Clinical Laboratory

Semantic Inference for Question Answering

Feasting on Brains! From Web Services to Web 2.0 to the Semantic Web and back again…

Building Sharable Ontology for Intelligent Agents based on Semantic Web

Mobile Learning & Semantic Web

Global Data Services Developing Data-Intensive Applications Using Globus Software

Software Engineering Data flow diagrams

THE 2007 NATIONAL INTEGRITY PERCEPTION INDEX REPORT

Internet2 Multicast Workshop Florida International University February 2003

Data Structure in C ++

Semantic Web

Intelligent Systems

Semantic Web Services Tutorial

Software Testing

University of Florida’s dchecker: Software for ensuring semantic data integrity