Efficient SQL-Based RDF Querying: A Comprehensive Approach
This presentation outlines an innovative approach to querying RDF (Resource Description Framework) data using SQL. It covers the need for efficient storage and retrieval of RDF triples, highlights the challenges of traditional methods, and introduces a new SQL-based query language that integrates RDF graph patterns. By leveraging the capabilities of SQL, such as SELECT, JOIN, and ORDER BY, this method enables seamless queries on RDF data while allowing for additional filters and ordering. The presentation discusses functionality, design, implementation, performance, and future work in RDF querying.
Efficient SQL-Based RDF Querying: A Comprehensive Approach
E N D
Presentation Transcript
An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development CenterOracle
Talk Outline • Introduction • Functionality • Design and Implementation • Performance • Conclusions and Future Work
RDF (Resource Description Framework) • RDF is a W3C Standard for describing resources on the web • Uniform Resource Identifiers (URIs) are used to identify resources • Example: http://www.oracle.com/people#John • RDF triples are used to make statements about a resource • Format: (subject predicate object) • Example:(:John :brotherOf :Mary) • Represents a directed, labeled edge in an RDF graph: :brotherOf :John :Mary
RDF Data and Graph Example John :name Family Data: (:John :brotherOf :Mary) (:Mary :parentOf :Matt) (:John :name “John”) (:Mary :name “Mary”) (:Matt :name “Matt”) :John :brotherOf :parentOf :Mary :Matt :name :name Mary Matt
RDF Querying Problem • Given • RDF graphs: the data set to be searched • Graph Pattern: containing a set of variables • Find • Matching Subgraphs • Return • Sets of variable bindings: where each set corresponds to a Matching Subgraph
RDF Query Example John :name Family Data: (:John :brotherOf :Mary) (:Mary :parentOf :Matt) (:John :name “John”) (:Mary :name “Mary”) (:Matt :name “Matt”) Graph Pattern: (names of Mary’s brothers) (?x :brotherOf ?y) (?y :name “Mary”) (?x :name ?n) Variable Bindings: x = :John, y = :Mary, n = “John” Matching Subgraph: (:John :brotherOf :Mary) (:Mary :name “Mary”) (:John :name “John”) :John :brotherOf :parentOf :Mary :Matt :name :name Mary Matt
RDF Storage Issues • Need to store RDF <subject, predicate, object> triples where the individual components can be URIs, blank nodes, or literals • Namespaces used in URIs could be long • Multiple triples describe a resource resulting in repetition of (possibly long) URIs • Different representations possible for a literal occurring in multiple triples • e.g. 120 120.0 12.0e+1 1.20e+2 • RDF graph may include schema triples • e.g. (:brotherOf rdfs:domain :Male)
RDF Querying Issues in SQL • Support specification of graph pattern-based SQL query • Occurrence of same variables in multiple triples of graph pattern: Processing requires self-join • e.g. (?x :brotherOf ?y) (?y :name “Mary”) (?x :name ?n) • Query processing (e.g for filter conditions, ORDER BY) requires datatype-specific comparison semantics Schema Triple: (:age rdfs:range xsd:int) Graph Pattern:(?x :age ?a) Filter Condition:a > 60 ORDER BY:a DESCENDING
RDF Querying Issues: Inference • Query processing may involve Inferencing. • Example: Data: (:Jim :brotherOf :John) (:John :fatherOf :Mary) Graph Pattern: (?x :uncleOf ?y) Result: Empty Rule: (?x :brotherOf ?y) (?y :fatherOf ?z) (?x :uncleOf ?z) Inferred data:(:Jim :uncleOf :Mary) Result: x = :Jim, y = :Mary
RDF Querying Approach • General Approach • Create a new (declarative, SQL-like) query language • e.g.: RQL, SeRQL, TRIPLE, N3, Versa, SPARQL, RDQL, RDFQL, SquishQL, RSQL, etc. • SQL-based Approach • Introduces a SQL Table FunctionRDF_MATCH that uses SPARQL-like graph pattern to express RDF queries • Benefits of SQL-based Approach • Leverages all the powerful constructs in SQL (e.g., SELECT / FROM / WHERE, ORDER BY, GROUP BY, aggregates, Join) to process graph query results • RDF queries can easily be combined with conventional queries on database tables thereby avoiding staging
Embedding RDF Query in SQL • SELECT …FROM …, TABLE ( ) t, …WHERE …; • Use of RDF_MATCHTable Function allows embedding a graph query in a SQL query RDF Query(expressed as RDF_MATCH Table Function invocation)
RDF_MATCH Table Function • Input parameters RDF_MATCH (Pattern, graph patternModels, Data (set of RDF graphs)RuleBases, Rules (0 or more rulebases)Aliases list of prefixes for namespaces) • Returns a set of columns containing variable bindings • Variable matching URI returned as single VARCHAR2 column with the same name (e.g. x for ?x) • Variable matching literal returned as a pair of VARCHAR2 columns with a name (e.g. x for ?x) and the type (x$type for ?x)
RDF_MATCH Example • Example: student reviewers less than 25 years old SELECT t.r reviewer, t.c conf, t.a age FROM TABLE ( RDF_MATCH ( ‘(?r rdf:type :Student) (?r :reviewerOf ?c) (?r :age ?a)’, RDFModels(‘reviewers’), NULL, RDFAliases(…)) ) t WHERE t.a < 25;
Specifying Rules • RDFS rulebase: Pre-Loaded • Can add User-defined rules • Rule: “Chairperson of Conference is also a reviewer” (‘rb’, rulebase name ‘ChairpersonRule’, rule name ‘(?r :ChairpersonOf ?c)’ antecedents NULL, filter condition NULL, aliases ‘(?r :ReviewerOf ?c)’) consequents
RDF_MATCH Example with rulebase • Query: Find reviewers of conferences • SELECT t.r reviewer FROM TABLE( RDF_MATCH( ‘(?r :ReviewerOf ?c)’, RDFModels (‘reviewers’), RDFRules (‘rb’), NULL)) t; • Data(:Mary :ChairpersonOf :IDBC2005) • Inferred data(:Mary :ReviewerOf :IDBC2005)
RDF Data Storage • Triples Data stored after normalization in two tables • UriMap(UriID, UriValue,…) contains mapping of (URIs, blank nodes, literals) to internal identifiers • IdTriples (ModelID, SubjectID, PropertyID, ObjectID,…) contains the triple information encoded as three identifiers • Multiple representation of literals: The first occurrence treated as canonical, rest mapped to canonical representation • e.g. 120.0 120 1.20e+2 12.0e+1
RDF_MATCH Query Processing • Subsititute aliases with namespaces in search pattern • Convert URIs and literals to internal IDs • Generate Query • Generate self-join query based on matching variables • Generate SQL subqueries for rulebases component (if any) • Generate the join result by joining internal IDs with UriMap table • Use model IDs to restrict IdTriples table • Compile and Execute the generated query
Optimization: Table Function Rewrite • TableRewriteSQL( ) • Takes RDF Query (specified via arguments) as input • generates a SQL string • Substitute the table function call with the generated SQL string • Reparse and execute the resulting query • Advantages • Avoid execution-time overhead (linear in number of result rows) associated with table function infrastructure • Leverage SQL optimizer capabilities to optimize the resulting query (including filter condition pushdown)
Optimization: Materialized Join Views • Generic Materialized Join views (MJVs) • Subject-Subject, Object-Subject, … • Subject-property matrix MJVs (SPMJVs) • custom, workload based (e.g., frequent search patterns) Example: Select student name, university, and age • Select r, u, a …… ‘(?r rdf:type :Student) (?r :enrolledAt ?u) (?r :age ?a)’ …… • SPMJV: < Student enrolledAt age >
Dataset • WordNet : lexical database for English language • UniProt : large scale (80 million triples) • Protein and annotation data
Experiments • Varying number of triples in search pattern • Varying filter conditions • Varying projection list • Large-scale RDF data • Subject-property MJVs
Varying Number of Triples • ‘(?a wn:hyponymOf ?b) (?b wn:hyponymOf ?c) ….. • Increasing number of self-joins
Varying Projection List • ‘(?c0 wn:wordForm ?word) (?c0 wn:wordForm ?syn1) (?c1 wn:wordForm ?syn1) …. (5 triples) • Benefit of the projection list optimization • Eliminate joins with UriMap table for variables not referenced outside of RDF_MATCH
Large-Scale RDF Data • UniProt – 10M, 20M, 40M, 80M triples • 6 example queries given with UniProt • Number of matches remain constant as dataset size changes (ROWNUM)
UniProt Sample Queries • Description • Pattern • Projection • Result limit • Q1:Display the ranges of transmembrane regions • 6 triples5 vars • 3 vars • 15000 rows • Q2: List proteins with publications by authors with matching names • 5 triples5 vars 1 LIKE pred. • 3 vars • 10 rows • Q3: Count the number of times a publication by a specific author is cited • 3 triples2 vars • 0 vars • 32 rows • Q4: List resources that are related to proteins annotated with a specific keyword • 3 triples2 vars • 1 var • 3000 rows • Q5: List genes associated with human diseases • 7 triples5 vars • 3 vars • 750 rows • Q6:List recently modified entries • 2 triples2 vars1 range pred. • 2 vars • 8000 rows
RDF_MATCH Performance Scalability • Q1 • Q2 • Q3 • Q4 • Q5 • Q6 • 10 M Triples • 0.86 • < 0.01 • < 0.01 • 0.03 • 0.18 • 0.46 • 20 M Triples • 0.95 • < 0.01 • < 0.01 • 0.03 • 0.19 • 0.47 • 40 M Triples • 0.96 • < 0.01 • < 0.01 • 0.03 • 0.18 • 0.47 • 80 M Triples • 1.03 • < 0.01 • < 0.01 • 0.03 • 0.20 • 0.49 • Maximum • .054 • 0.002 • 0.002 • .011 • .065 • 0.07 Query Response Times
Conclusions and Future Work • SQL-based RDF querying scheme • RDF_MATCH table function • Supports graph-pattern based query on RDF data with RDFS and user-defined rules • Efficient Execution • Table Function Rewrite • Materialized Join Views: Generic and Subject-Property • Rule Indexes • Future work • OPTIONAL support – outer-join • Provenance support