Optimized Index Structures for Querying RDF from the Web

Andreas Harth, Stefan Decker andreas.harth@deri.org 3rd Latin American Web Congress Optimized Index Structures for Querying RDF from the Web

Contents Scenario Index Structure Query Processing Implementation Evaluation Summary

Scenario • Data collected from the Web • Much instance data, few ontology • Storage of large amounts of data with mostly unknown schemas, fast retrieval essential

Example RDF with Context

Notation 3 • #s #p #o . syntax for RDF • E.g. @prefix foaf: http://xmlns.com/foaf/0.1/ . http://decker.cn/stefan/ foaf:name “Stefan Decker” . • N3 is extension of RDF data model with quotation of graphs and universally quantified variables • Able to express queries with ql:select and ql:where predicates

Example Query • Get all triples where predicate is foaf:name and object is “Stefan Decker” @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix yars: <http://sw.deri.org/2004/06/yars#> . @prefix ql: <http://www.w3.org/2004/12/ql#> . <> ql:where { ?s foaf:name “Stefan Decker” . } .

Contents • Scenario • Index Structure • Query Processing • Implementation • Evaluation • Summary

Indexes • Lexicon: store mappings from literal values and resources to object IDs and vice versa • Quad Index: store quads (s, p, o, c)

Object Identifiers • OIDs help to save space (only need to store 64 bit OID instead of whole resource/literal all the time)

Keyword Index • Simple search UI require keyword searches • Inverted index on literals

Quad Access Patterns • We want to be able to retrieve all combinations of (s,p, o, c) without performing a join • In total, 2*2*2*2 = 16 combinations

Recap: B+-Trees • Underlying storage technique based on B+-trees • One property of B+-trees: range lookups/prefix lookups • (key, value) pairs with fast retrieval on given (partial) key

Complete Index on Quads • Given prefix lookup capabilities, only 6 indexes are needed to cover all access patterns

Occurrence Counts • Queries for in-degree/out-degree of a node in a graph • Also: statistics for join reordering • Store occurrence counts for quad patterns directly in index

Physical Access Plans • Access pattern (?s, foaf:name, “Stefan Decker”, ?c) • Translate string values to OIDs (2, 11) • Determine index (POCS) • Construct key (2:11:*:*) • Perform prefix lookup on POCS with key 2:11:*:* • Translate result back to string values

Prototype Implementation • Java, JDBM, Apache Tomcat • HTTP interface: GET/POST for querying, PUT for adding data, DELETE for removing data

Index Construction Performance - Lehigh Univ(20)

Query Performance Evaluation Queries: 1: ?x rdf:type univ:UndergradStudent 2: ?x ?p "UndergraduateStudent0" 3: <http://www.Univ965.edu> ?p ?o 4: ?x univ:worksFor ?y

Summary • Complete index on RDF quads to minimize joins • Keyword-based searches • Extensive statistical information • High-performance metadata repository • Core storage backend and query-processing technology for SWSE (Semantic Web Search Engine) • Used in projects at e.g. University of Karlsruhe (RDFReactor, RDF2Go)

Optimized Index Structures for Querying RDF from the Web