Efficient SQL-Based RDF Querying: A Comprehensive Approach

An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development CenterOracle

Talk Outline • Introduction • Functionality • Design and Implementation • Performance • Conclusions and Future Work

Introduction

RDF (Resource Description Framework) • RDF is a W3C Standard for describing resources on the web • Uniform Resource Identifiers (URIs) are used to identify resources • Example: http://www.oracle.com/people#John • RDF triples are used to make statements about a resource • Format: (subject predicate object) • Example:(:John :brotherOf :Mary) • Represents a directed, labeled edge in an RDF graph: :brotherOf :John :Mary

RDF Data and Graph Example John :name Family Data: (:John :brotherOf :Mary) (:Mary :parentOf :Matt) (:John :name “John”) (:Mary :name “Mary”) (:Matt :name “Matt”) :John :brotherOf :parentOf :Mary :Matt :name :name Mary Matt

RDF Querying Problem • Given • RDF graphs: the data set to be searched • Graph Pattern: containing a set of variables • Find • Matching Subgraphs • Return • Sets of variable bindings: where each set corresponds to a Matching Subgraph

RDF Query Example John :name Family Data: (:John :brotherOf :Mary) (:Mary :parentOf :Matt) (:John :name “John”) (:Mary :name “Mary”) (:Matt :name “Matt”) Graph Pattern: (names of Mary’s brothers) (?x :brotherOf ?y) (?y :name “Mary”) (?x :name ?n) Variable Bindings: x = :John, y = :Mary, n = “John” Matching Subgraph: (:John :brotherOf :Mary) (:Mary :name “Mary”) (:John :name “John”) :John :brotherOf :parentOf :Mary :Matt :name :name Mary Matt

RDF Storage Issues • Need to store RDF <subject, predicate, object> triples where the individual components can be URIs, blank nodes, or literals • Namespaces used in URIs could be long • Multiple triples describe a resource resulting in repetition of (possibly long) URIs • Different representations possible for a literal occurring in multiple triples • e.g. 120 120.0 12.0e+1 1.20e+2 • RDF graph may include schema triples • e.g. (:brotherOf rdfs:domain :Male)

RDF Querying Issues in SQL • Support specification of graph pattern-based SQL query • Occurrence of same variables in multiple triples of graph pattern: Processing requires self-join • e.g. (?x :brotherOf ?y) (?y :name “Mary”) (?x :name ?n) • Query processing (e.g for filter conditions, ORDER BY) requires datatype-specific comparison semantics Schema Triple: (:age rdfs:range xsd:int) Graph Pattern:(?x :age ?a) Filter Condition:a > 60 ORDER BY:a DESCENDING

RDF Querying Issues: Inference • Query processing may involve Inferencing. • Example: Data: (:Jim :brotherOf :John) (:John :fatherOf :Mary) Graph Pattern: (?x :uncleOf ?y) Result: Empty Rule: (?x :brotherOf ?y) (?y :fatherOf ?z)  (?x :uncleOf ?z) Inferred data:(:Jim :uncleOf :Mary) Result: x = :Jim, y = :Mary

RDF Querying Approach • General Approach • Create a new (declarative, SQL-like) query language • e.g.: RQL, SeRQL, TRIPLE, N3, Versa, SPARQL, RDQL, RDFQL, SquishQL, RSQL, etc. • SQL-based Approach • Introduces a SQL Table FunctionRDF_MATCH that uses SPARQL-like graph pattern to express RDF queries • Benefits of SQL-based Approach • Leverages all the powerful constructs in SQL (e.g., SELECT / FROM / WHERE, ORDER BY, GROUP BY, aggregates, Join) to process graph query results • RDF queries can easily be combined with conventional queries on database tables thereby avoiding staging

Embedding RDF Query in SQL • SELECT …FROM …, TABLE ( ) t, …WHERE …; • Use of RDF_MATCHTable Function allows embedding a graph query in a SQL query RDF Query(expressed as RDF_MATCH Table Function invocation)

Functionality

RDF_MATCH Table Function • Input parameters RDF_MATCH (Pattern,  graph patternModels,  Data (set of RDF graphs)RuleBases,  Rules (0 or more rulebases)Aliases  list of prefixes for namespaces) • Returns a set of columns containing variable bindings • Variable matching URI returned as single VARCHAR2 column with the same name (e.g. x for ?x) • Variable matching literal returned as a pair of VARCHAR2 columns with a name (e.g. x for ?x) and the type (x$type for ?x)

RDF_MATCH Example • Example: student reviewers less than 25 years old SELECT t.r reviewer, t.c conf, t.a age FROM TABLE ( RDF_MATCH ( ‘(?r rdf:type :Student) (?r :reviewerOf ?c) (?r :age ?a)’, RDFModels(‘reviewers’), NULL, RDFAliases(…)) ) t WHERE t.a < 25;

Specifying Rules • RDFS rulebase: Pre-Loaded • Can add User-defined rules • Rule: “Chairperson of Conference is also a reviewer” (‘rb’,  rulebase name ‘ChairpersonRule’,  rule name ‘(?r :ChairpersonOf ?c)’  antecedents NULL,  filter condition NULL,  aliases ‘(?r :ReviewerOf ?c)’)  consequents

RDF_MATCH Example with rulebase • Query: Find reviewers of conferences • SELECT t.r reviewer FROM TABLE( RDF_MATCH( ‘(?r :ReviewerOf ?c)’, RDFModels (‘reviewers’), RDFRules (‘rb’), NULL)) t; • Data(:Mary :ChairpersonOf :IDBC2005) • Inferred data(:Mary :ReviewerOf :IDBC2005)

Design & Implementation

RDF Data Storage • Triples Data stored after normalization in two tables • UriMap(UriID, UriValue,…) contains mapping of (URIs, blank nodes, literals) to internal identifiers • IdTriples (ModelID, SubjectID, PropertyID, ObjectID,…) contains the triple information encoded as three identifiers • Multiple representation of literals: The first occurrence treated as canonical, rest mapped to canonical representation • e.g. 120.0  120 1.20e+2 12.0e+1

RDF_MATCH Query Processing • Subsititute aliases with namespaces in search pattern • Convert URIs and literals to internal IDs • Generate Query • Generate self-join query based on matching variables • Generate SQL subqueries for rulebases component (if any) • Generate the join result by joining internal IDs with UriMap table • Use model IDs to restrict IdTriples table • Compile and Execute the generated query

Optimization: Table Function Rewrite • TableRewriteSQL( ) • Takes RDF Query (specified via arguments) as input • generates a SQL string • Substitute the table function call with the generated SQL string • Reparse and execute the resulting query • Advantages • Avoid execution-time overhead (linear in number of result rows) associated with table function infrastructure • Leverage SQL optimizer capabilities to optimize the resulting query (including filter condition pushdown)

Optimization: Materialized Join Views • Generic Materialized Join views (MJVs) • Subject-Subject, Object-Subject, … • Subject-property matrix MJVs (SPMJVs) • custom, workload based (e.g., frequent search patterns) Example: Select student name, university, and age • Select r, u, a …… ‘(?r rdf:type :Student) (?r :enrolledAt ?u) (?r :age ?a)’ …… • SPMJV: < Student enrolledAt age >

Performance

Dataset • WordNet : lexical database for English language • UniProt : large scale (80 million triples) • Protein and annotation data

Experiments • Varying number of triples in search pattern • Varying filter conditions • Varying projection list • Large-scale RDF data • Subject-property MJVs

Varying Number of Triples • ‘(?a wn:hyponymOf ?b) (?b wn:hyponymOf ?c) ….. • Increasing number of self-joins

Varying Number of Triples

Varying Projection List • ‘(?c0 wn:wordForm ?word) (?c0 wn:wordForm ?syn1) (?c1 wn:wordForm ?syn1) …. (5 triples) • Benefit of the projection list optimization • Eliminate joins with UriMap table for variables not referenced outside of RDF_MATCH

Varying Projection List

Large-Scale RDF Data • UniProt – 10M, 20M, 40M, 80M triples • 6 example queries given with UniProt • Number of matches remain constant as dataset size changes (ROWNUM)

UniProt Sample Queries • Description • Pattern • Projection • Result limit • Q1:Display the ranges of transmembrane regions • 6 triples5 vars • 3 vars • 15000 rows • Q2: List proteins with publications by authors with matching names • 5 triples5 vars 1 LIKE pred. • 3 vars • 10 rows • Q3: Count the number of times a publication by a specific author is cited • 3 triples2 vars • 0 vars • 32 rows • Q4: List resources that are related to proteins annotated with a specific keyword • 3 triples2 vars • 1 var • 3000 rows • Q5: List genes associated with human diseases • 7 triples5 vars • 3 vars • 750 rows • Q6:List recently modified entries • 2 triples2 vars1 range pred. • 2 vars • 8000 rows

RDF_MATCH Performance Scalability • Q1 • Q2 • Q3 • Q4 • Q5 • Q6 • 10 M Triples • 0.86 • < 0.01 • < 0.01 • 0.03 • 0.18 • 0.46 • 20 M Triples • 0.95 • < 0.01 • < 0.01 • 0.03 • 0.19 • 0.47 • 40 M Triples • 0.96 • < 0.01 • < 0.01 • 0.03 • 0.18 • 0.47 • 80 M Triples • 1.03 • < 0.01 • < 0.01 • 0.03 • 0.20 • 0.49 • Maximum  • .054 • 0.002 • 0.002 • .011 • .065 • 0.07 Query Response Times

Conclusions

Conclusions and Future Work • SQL-based RDF querying scheme • RDF_MATCH table function • Supports graph-pattern based query on RDF data with RDFS and user-defined rules • Efficient Execution • Table Function Rewrite • Materialized Join Views: Generic and Subject-Property • Rule Indexes • Future work • OPTIONAL support – outer-join • Provenance support

Efficient SQL-Based RDF Querying: A Comprehensive Approach