Optimized Index Structures for Querying RDF from the Web
240 likes | 397 Vues
Andreas Harth, Stefan Decker andreas.harth@deri.org. 3 rd Latin American Web Congress. Optimized Index Structures for Querying RDF from the Web. Contents. Scenario Index Structure Query Processing Implementation Evaluation Summary. Scenario. Data collected from the Web
Optimized Index Structures for Querying RDF from the Web
E N D
Presentation Transcript
Andreas Harth, Stefan Decker andreas.harth@deri.org 3rd Latin American Web Congress Optimized Index Structures for Querying RDF from the Web
Contents Scenario Index Structure Query Processing Implementation Evaluation Summary
Scenario • Data collected from the Web • Much instance data, few ontology • Storage of large amounts of data with mostly unknown schemas, fast retrieval essential
Notation 3 • #s #p #o . syntax for RDF • E.g. @prefix foaf: http://xmlns.com/foaf/0.1/ . http://decker.cn/stefan/ foaf:name “Stefan Decker” . • N3 is extension of RDF data model with quotation of graphs and universally quantified variables • Able to express queries with ql:select and ql:where predicates
Example Query • Get all triples where predicate is foaf:name and object is “Stefan Decker” @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix yars: <http://sw.deri.org/2004/06/yars#> . @prefix ql: <http://www.w3.org/2004/12/ql#> . <> ql:where { ?s foaf:name “Stefan Decker” . } .
Contents • Scenario • Index Structure • Query Processing • Implementation • Evaluation • Summary
Indexes • Lexicon: store mappings from literal values and resources to object IDs and vice versa • Quad Index: store quads (s, p, o, c)
Object Identifiers • OIDs help to save space (only need to store 64 bit OID instead of whole resource/literal all the time)
Keyword Index • Simple search UI require keyword searches • Inverted index on literals
Quad Access Patterns • We want to be able to retrieve all combinations of (s,p, o, c) without performing a join • In total, 2*2*2*2 = 16 combinations
Recap: B+-Trees • Underlying storage technique based on B+-trees • One property of B+-trees: range lookups/prefix lookups • (key, value) pairs with fast retrieval on given (partial) key
Complete Index on Quads • Given prefix lookup capabilities, only 6 indexes are needed to cover all access patterns
Occurrence Counts • Queries for in-degree/out-degree of a node in a graph • Also: statistics for join reordering • Store occurrence counts for quad patterns directly in index
Contents • Scenario • Index Structure • Query Processing • Implementation • Evaluation • Summary
Physical Access Plans • Access pattern (?s, foaf:name, “Stefan Decker”, ?c) • Translate string values to OIDs (2, 11) • Determine index (POCS) • Construct key (2:11:*:*) • Perform prefix lookup on POCS with key 2:11:*:* • Translate result back to string values
Contents • Scenario • Index Structure • Query Processing • Implementation • Evaluation • Summary
Prototype Implementation • Java, JDBM, Apache Tomcat • HTTP interface: GET/POST for querying, PUT for adding data, DELETE for removing data
Contents • Scenario • Index Structure • Query Processing • Implementation • Evaluation • Summary
Query Performance Evaluation Queries: 1: ?x rdf:type univ:UndergradStudent 2: ?x ?p "UndergraduateStudent0" 3: <http://www.Univ965.edu> ?p ?o 4: ?x univ:worksFor ?y
Contents • Scenario • Index Structure • Query Processing • Implementation • Evaluation • Summary
Summary • Complete index on RDF quads to minimize joins • Keyword-based searches • Extensive statistical information • High-performance metadata repository • Core storage backend and query-processing technology for SWSE (Semantic Web Search Engine) • Used in projects at e.g. University of Karlsruhe (RDFReactor, RDF2Go)