200 likes | 995 Vues
RDF Triple Stores. Nipun Bhatia Department of Computer Science. Stanford University. Contents . Introduction Different Architectures Implications An Example : Jena SDB Evaluations Evaluations using LUBM/DBPedia Open Research Issues
E N D
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University
Contents • Introduction • Different Architectures • Implications • An Example : Jena SDB • Evaluations • Evaluations using LUBM/DBPedia • Open Research Issues • Which RDF Store to choose for a particular application? • Possible system diagram for Phenotype Annonations.
Introduction • What is an RDF store? A system to provide a mechanism for persistent storage and access of RDF graphs. • Potential Applications areas: Plenty! Backend for Protege, BioPortal, Phenotype Annotations.
Different Architectures • Based on their implementation, can be divided into 3 broad categories : In-memory, Native, Non-native Non-memory. • In – Memory : RDF Graph is stored as triples in main –memory. Eg. Storing an RDF graph using Jena API/ Sesame API. • Native : Persistent storage systems with their own implementation of databases. Eg. Sesame Native, Virtuoso, AllegroGraph, Oracle 11g. • Non-Native Non-Memory : Persistent storage systems set-up to run on third party DBs. Eg. Jena SDB.
Implications • Scalability • Different query languages supported to varying degrees. • Sesame – SeRQL, Oracle 11g – Own query language. • Different level of inferencing. • Sesame supports RDFS inference, AllegroGraph – RDFS++, Oracle 11g – RDFS++, OWL Prime • Lack of interoperability and portability. • More pronounced in Native stores.
Jena SDB • SDB basically is a Java Loader. • Multiple stores supported: MySQL, PostgreSQL, Oracle, DB2. • Takes incoming triples and breaks them down into components ready for the database. • Multiple layouts • Integration with the Joseki server. • SPARQL supported. (Non) Interest Declaration: I was previously an intern at HP Labs with the Jena team
Evaluations • Third party evaluations for Sesame, Jena SDB, Virtuoso • Oracle 11g company evaluations • Methodology • LUBM – Lehigh University BenchMark • DBPedia • Multiple Queries • Load Times
Evaluations • DB Pedia – Database of structured information extracted from Wikipedia. Information about places, persons, music albums and films[2] • LUBM – Synthetically generated RDF data containing universities, departments, students etc.[1] • Dataset size: • DataSet1: 15,472,624 triples; 2.1 GB • DataSet 2: LUBM 50 – 2.75 Million & LUBM 1000 – 55.09 Million • 3 Queries
Results – Query 1 • Simple select query – 2 variables
Query 2 • Unconstrained Select Query – only predicate was specified.
Query 3 • Complex Query – Uses filter
Observations • Native Stores perform better than systems using third party stores. • Optimizations are possible • Each of the systems uses different database layouts. • Virtuoso – OGPS,POGS,PSOG,SOPG • SDB – SPO,GSPO • Hashing on SDB is very bad.
Open Research Issues • Inferencing[4] • Present common implementations: • Make a number of small queries to propagate the effects of rule firing. • Each of these queries creates an interaction with the database. • Not very efficient • Approaches • Snapshot the contents of the database-backed model into RAM for the duration of processing by the inference engine. • Performing inferencing in-stream. • Precompute the inference closure of ontology and analyze the in-coming data-streams, add triples to it based on your inference closure. • Assumes rigid seperation of the RDF Data(A-box) and the Ontology data(T-box) • Even this maynot work for very large ontologies – BioMedical Ontologies
Open Research Issues • Query Optimization • Third party stores undo’s any optimization done at the API level. • Better performance of native stores points to that direction. • Some work in optimizing SPARQL queries for in-memory story.
Which RDF store to choose for an app? • Frequency of loads that the application would perform. • Single scaling factor and linear load times. • Level of inferencing. • Support for which query language. W3C recommendations. • Special system needs. Eg. Allegograph needs 64 bit processor.
Phenotype Annotations Jena API Jena API Inferencing j Jena Model SDB Jena API Set of Ontologies required for Phenotype Annotationseg. PATO, Fly etc. MySQL / Virtuoso Phenotype Annotations Jena API j Jena API Jena Model SDB
References • [1] http://esw.w3.org/topic/RdfStoreBenchmarking • [2] http://www4.wiwiss.fu-berlin.de/benchmarks-200801/ • [3] Kurt Rohloff et al.: An Evaluation of Triple-Store Technologies for Large Data Stores. Comparing Sesame, Jena and AllegroGraph. 2007 • [4]N Bhatia, A Seaborne – ‘Ingestion pipeline for RDF’