gStore: Answering SPARQL Queries Via Subgraph Matching

gStore: Answering SPARQL Queries Via Subgraph Matching 1Peking University, 2Hong Kong University of Science and Technology, 3University of Waterloo Lei Zou1, Jinghui Mo1, Lei Chen2, M. Tamer Özsu3, Dongyan Zhao1

Outline • Background & Related Work • Overview of gStore • Encoding Technique • VS*-tree & Query Algorithm • Experiments • Conclusions

Semantic Web “Semantic Web Technologies” is a collection of standard technologies to realize a Web of Data.

RDF Data Model URI Literals URI

RDF Graph Literal Vertex Entity Vertex

SPARQL Queries SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. } Query Graph

Subgraph Match vs. SPARQL Queries

Naïve Triple Store SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. } Too many Self-Joins SQL: Select T3.Subject From T as T1, T as T2, T as T3 Where T1.Predict=“BornOnDate” and T1.Object=“1809-02-12” and T2.Predict=“DiedOnDate” and T2.Object=“1865-04-15” and T3. Predict=“hasName” and T1.Subject = T2.Subject and T2. Subject= T3.subject

Existing Solutions Three categories of solutions are proposed to speed up query processing: • Property Table; Jena [K. Wilkinson et al. SWDB 03], … 2. Vertically Partitioned Solution; SW-store [D. J. Abadi et al. VLDB 07],… 3. Exhaustive-IndexingRDF-3x [T. Neumann et al. VLDB 08], Hexastore [C. Weiss et al. VLDB 08 ],…

Existing Solutions-Property Table SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. } Reducing # of join steps SQL: Select People.hasName from People where People.BornOnDate = “1809-02-12” and People.DiedOnDate = “1865-04-15”.

Existing Solutions-Vertically Partitioned Solution Fast Merge Join

Existing Solutions- Exhaustive-Indexing Range query & Merge Join Each SPARQL query statement can be translated into one “range query”. SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. }

Some Limitations • Difficult to handle ``wildcard queries’’. • Difficult to handle updates.

Intuition of gStore Finding Matches over a Large Graph is not a trivial task.

Preliminaries Literal Vertex Entity Vertex

Preliminaries • RDF graph

Preliminaries • Query Graph

Preliminaries • match

Preliminaries • Problem definition

Storage Schema in gStore Encoding all neibhors into a “bit-string”, called signature.

Encoding Technique (1) • |eSig(e).e| = M. • we employ m different string hash functions Hi (i = 1, ...,m) • For each hash function Hi, we set the (Hi(eLabel) MOD M)-th bit in eS ig(e).e to be ‘1’ • Encoding Sig(e).n is the same • |eSig(e).n| = N • n different hash functions

Encoding Technique (2) “Abr”, “bra”, ”rah”, ”aha”, …., 0000 0010 0000 0000 ( hasName, “Abraham Lincoln”) 1000 0000 0000 0000 0010 0000 0000 1000 0010 0100 0001 0000 0000 0100 0000 ( BornOnDate, “1809-02-12”) 0100 0000 0000 0100 0010 0100 1000 0000 0000 0000 0001 OR ( DiedOnDate, “1865-04-15”) 1000 0010 0100 0001 0000 1000 0000 0000 0010 0100 0000 OR ( DiedIn, “y:Washington_D.c”) 0110 1010 0000 1100 0010 0100 1001 0000 0010 0000 1000 0010 0100 0001

Encoding Technique (3)

Outline • Background & Related Work • Overview of gStore • Encoding Technique • VS-tree & Query Algorithm • Experiments • Conclusions

A Straightforward Solution (1) u2 u1 L1 L2

A Straightforward Solution (2) L1 L2 Large Join Space ! 

VS-tree

VS-Tree query definition

Pruning Technique Reduced Join Space!  u2 u1 10010

Query Algorithm-Top-Down

Optimized method • Too many super edges • Which level to start search • No brute-force enumeration

VS*-Tree Insert • The criterion in the VS-tree only depends on the Hamming distance between the signatures of u and the node in VS-tree. • the criterion in VS∗- tree depends on both node signatures and G∗’s structure

Updates- Insertion in G*

Updates- Insertion in VS*-tree

VS*-Tree split • the B+1 entities of the node will be partitioned into two new nodes, where B is the maximal fanout for a node in VS∗-tree. • 1. we find two entities that have the maximal Hamming distance between them as two seed nodes • 2. we associate each left entry with the nearest seed node, according to Equation 1.

VS*-Tree deletion • Similar to split • if some node d has less than b entries, where b is the minimal fanout of node in VS∗-tree, then d is deleted and its entries are reinserted into VS∗-tree.

Updates- Deletion in VS*-tree To be deleted

Which Level To Begin • a concept “pruning power” of GIwith regard to Q∗ denoted as P(Q∗,GI)

Estimate P(Q*,GI)

Finding Valid Child States • propose a DFS strategy to find all valid child states of J. • start a DFS over G∗ beginning from some vertex vi

Datasets

Offline Performance

Exact Queries

Wildcard Queries

gStore: Answering SPARQL Queries Via Subgraph Matching

gStore: Answering SPARQL Queries Via Subgraph Matching

Presentation Transcript

Answering Queries Using Views LMSS 95

Answering Similar Region Search Queries

Answering Queries Using Views: A Survey

Using the TBox to Optimise SPARQL Queries

Cloud Service Placement via Subgraph matching

Answering Queries and Hypertree Decompositions

SPARQL 201: Construct queries and data maintenance

Answering Queries Using Views: A Survey

Answering queries across mappings

Retroactive Answering of Search Queries

Answering Queries: Problems

Answering Approximate Queries Efficiently

Answering Queries using views: A survey

Answering Top-k Queries Using Views

Answering Conceptual Queries with Ferret

Multilingual Ontology Matching based on Wiktionary Data Accessible via SPARQL Endpoint

Answering Queries Using Views

Retroactive Answering of Search Queries

Answering Queries Using Views

gStore: Answering SPARQL Queries Via Subgraph Matching

SPARQL 201: Construct queries and data maintenance

Answering Approximate Queries Efficiently