A Path-based Relational RDF Database

A Path-based Relational RDF Database A. Matono, T. Amagasa, M. Yoshikawa, S. Uemura ADC 2005 SNU IDB Lab. Hyewon Lim January 9th, 2009

Contents • Introduction • An Overview of RDF • Related Work and the Differences with Our Work • Path-based Approach for Storing RDF Data in Relational Databases • Performance Evaluation • Conclusions

Introduction (1/8) • Quality and quantity of metadata • Semantic Web makes it possible to perform high-level processes • Reasoning, deduction, semantic searches • Metadata • Described by Resource Description Framework (RDF) • RDF describes data and their semantics

Introduction (2/8) • The specification defines an RDF model and RDF syntax • RDF model • Statements describe a relationship between a pair of terms • A set of statements • Represent metadata whose structure is a directed graph

Introduction (3/8) • RDF is common to use as a format to describe various types of metadata • Typical usage: describe large-scale metadata • Wordnet (35MB), Gene Ontology (365MB), Open Directory Project (2GB) • In order to handle such data efficiently, RDF DBs that can manage massive RDF data are essential

Introduction (4/8) • One naïve approach is to use XML DBs • Any RDF data can be serialized as XML data • This approach is impractical • Structure of semantics as RDF data is different to the structure of syntax as XML data • Semantics cannot be stored into XML DBs

Introduction (5/8) • Another way: utilize relational DBs or Berkeley DB • Several RDF DBs have been proposed • Such conventional RDF DBs can be classified into two groups • 1. Schema data are designed based on RDF schema • Cannot handle such RDF data that do not have accompanying RDF schema • 2. RDF DBs store RDF data in terms of triples

Introduction (6/8) • Problems of processing large RDF data using conventional RDF databases • Abilityto handle RDF schema • RDF query using information of RDF schema is important classes of RDF queries • Second group do not make any distinction between schema information and instance data • First group can process such queries • Poor performance in processing path queries • Need to perform a join operation per each path step

Introduction (7/8) • Propose a path-based relational RDF DBs • Relational schema is designed to be independent of RDF schema information, and • Designed to make the distinction between schema information and instance data • Can handle schemaless RDF data as well as RDF data with schema • Extract all reachable path expressions for each resource, and store them • To improve performance for path queries • Do not need to perform join operations

Introduction (8/8) • Steps • Classify every statement into categories according to the type of predicate • Construct subgraphs for each category • Store the subgraphs into distinct relational tables • Apply appropriate techniques for representing the semantics of each subgraph • Limit the structure of a subgraph is DAG

An Overview of RDF (1/4) • RDF • A foundation for representing and manipulating metadata on Web resources • Usable as long as the location of a Web resource is identifiable in terms of a URI • Statements represent binary relationships between two distinct(or identical) resources • RDF data are modeled as a directed graph • Nodes and arcs represent resources and relationships • “This paper is authored byAkiyoshi MATONO.” authored www.matono.net/paper “Akiyoshi MATONO”

An Overview of RDF (2/4) • RDF Schema • A specification for defining schematic information of RDF data • We can define: • Classes (rdfs:class) as types of resources • Properties of a class (rdf:Property) • Domains (rdfs:domain) and ranges (rdf:range) of the properties • Inheritance relationships (rdfs:subClassOf, rdfs:subPropertyOf) among classes or properties • Types (rdf:type)

An Overview of RDF (3/4) • Using RDF and RDF Schema, we can represent complex information

An Overview of RDF (4/4) • Classifying RDF data • Large size • Wordnet, ODP, and Gene Ontology • Created mainly for systematical organization of data resources • Do not contain cycles • Simple structure • Small size • RSS, FOAF, and Dublin Core • Used as metadata of images or Web pages

Related Work and the Differences with Our Work (1/3) • Several RDF DBs have been proposed • Most of which use Relational DBs or Berkeley DB as their underlying data storage • Approaches using RDB • Flatly sores statements into a single relational table • Creates relational tables for classes and properties that are defined in the RDF schema information, storing resources according to their classes/properties • Approaches using BDB • Create three hash tables • Keys: subjects, predicates, objects

Related Work and the Differences with Our Work (2/3) • Problems of the conventional approaches • Using the flat and hash approaches • Difficult to perform schema queries • They do not make any distinction between schema information and resource descriptions • schema approach • Be able to process queries about RDF schema • Cannot handle RDF data without RDF schema information • Relational schema is designed based on that • Costly to maintain schema evolution • Capabilities of the three approaches for processing path-based queries are not sufficient

Related Work and the Differences with Our Work (3/3) • In conventional RDF databases, • statement-based queries can be processed efficiently • RDF data is decomposed into a large number of statements • When processing a path-based query • Require a number of join operations according to the steps in the path expression

Path-based Approach for Storing RDF Data in Relational Databases- Subgraph extraction from RDF graph(1/2) • When storing RDF data • Parses the RDF data • generates own RDF graph • decomposes the graph into five subgraphs according to the type of predicate • Class Inheritance (CI) graphs – rdfs:subClassOf • Property Inheritance (PI) graphs – rdfs:subPropertyOf • Type (T) graphs – rdf:type • Domain-Range (DR) graphs – rdfs:domain, rdfs:range • Generic (G) graphs

Path-based Approach for Storing RDF Data in Relational Databases- Subgraph extraction from RDF graph(2/2) • Advantages of dividing an RDF graph • Store RDF data into distinct relational tables • Dising relational schema to be independent of RDF schema information • Structures of the resulting subgraphs are less complex than the original RDF graphs • Opportunities to apply several techniques for representing each subgraph by consider each graph structure

Path-based Approach for Storing RDF Data in Relational Databases- Path expressions (1/3) • Most queries of RDF data • Queries to detect subgraphs matching a given graph • Queries to detect a set of nodes which can be reached via given path expressions • These queries are represented in path expressions • Storage based on path expressions • Decrease in the number of join operations

Path-based Approach for Storing RDF Data in Relational Databases- Path expressions (2/3) • Store not the entire RDF graph • only graph G to which path-based queries are frequently posed • Graph CI and PI should be stored by a scheme that can detect ancestor-descendant relationships • Queries for RDF data use path expressions consisting of arcs • Stores arc paths into a relational table

Path-based Approach for Storing RDF Data in Relational Databases- Path expressions (3/3) • Arc path • DAG g, node set V(g), arc set E(g) • Afinitesequence of arcs • (v0, v1), (v1, v2), …, (vk-2, vk-1), (vk-1, vk) • The path expression of the arc path • l(v0, v1), l(v1, v2), …, l(vk-2, vk-1), l(vk-1, vk) • Absolute arc path • An arc path whose source node is a root vm vn

Path-based Approach for Storing RDF Data in Relational Databases- Extended interval numbering scheme for DAGs (1/2) • Interval numbering scheme • Detect ancestor-descendant relationships between two nodes in a tree • We use it to detect inheritance relationships between classes or properties • Extend the scheme to apply it to DAGs

Path-based Approach for Storing RDF Data in Relational Databases- Extended interval numbering scheme for DAGs (2/2) • Relationship between two nodes can be verified by a subsumption • v is an ancestor of uiffpre(v) < pre(u) ∧ post(u) < post(v) • v is a parent of u if depth(u) - depth(v)=1 v v (2, 5, 1) (5, 4, 2) u u (6, 3, 3) (4, 1, 3)

Path-based Approach for Storing RDF Data in Relational Databases- Proposed relational schema (1/2) • Designed relational schema for storing RDF data based on the subgraphs

Path-based Approach for Storing RDF Data in Relational Databases- Proposed relational schema (2/2) • Storage example of the RDF data

Path-based Approach for Storing RDF Data in Relational Databases- Query Processing • Examples • Find the title of something painted by someone SELECT r.resourceNameFROM path AS p, resource AS rWHERE p.pathID=r.pathIDAND p.pathexp=‘#title<#paints’ • Find the names of the classes that are http://www.w3.org/2000/01/rdf-schema#Resources‘s direct superclass SELECT c1.classNameFROM class AS c, class AS c1WHERE c.pre<c1.pre AND c.post>c1.postAND c.depth=c1.depth-1 AND c.className=‘http://www.w3.org/2000/01/ref-schema#Resources’

Performance Evaluation • Compared the processing time between our approach and Jena2 • Jena2: based on the flat approach • Cannot evaluate the performance of schema-based queries • Exist no RDF data with schema information whose size is large enough to be used in our experiments on the Web • Environments • Athlon 1.4 GHz CPU, 1GB memory, Gentoo Linux 1.4, PostgreSQL 7.4.3

Performance Evaluation- Schema-based Queries (1/3) • Basic schema queries • Find immediate children (or parents) of a given class (or property) • Find inheritance relationships between given two classes (or properties) • Find classes as a domain (or range) of a given property • Querying the meta-schema • Find all resources, that is, instances of “rdfs:Resource”. • Find all literals

Performance Evaluation- Schema-based Queries (2/3) • Quering type information • Find a set of instances of given class • Find a set of statements using given property • When the above queries are processed, there are two cases: • Answer is obtained by a single access to data storage, or multiple accesses

Performance Evaluation- Schema-based Queries (3/3) • The ability of each approach for schema-based queries • Our approach is efficient because of interval number scheme • In meta-schema queries, if the RDF graph includes many multiple paths, the redundancy is increased

Performance Evaluation- Path-based Queries (1/2) • Datasets • Sufficient size to see scalability • The G graph of the data does not contain any cycles • The G graph of the data contain long absolute path expressions • Use the Gene Ontology

Performance Evaluation- Path-based Queries (2/2) • Experiment results

Conclusions • We can handle schemaless RDF data • We can process schema-based queries using the interval numbering scheme • For path-based queries • Achieved high performance • To reduce the number of join operations, we stored RDF data based on path expressions • Future work • Investigate query-processing techniques • Query language, query transformation, and query optimization for RDF data

A Path-based Relational RDF Database