300 likes | 426 Vues
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs . Example 1: Social Network. Example 2: Bibliographical Network. Contributions. G-SPARQL language Pattern matching Reachability Hybrid execution engine Graph topology in main memory Graph data in relational database
E N D
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs
Contributions • G-SPARQL language • Pattern matching • Reachability • Hybrid execution engine • Graph topology in main memory • Graph data in relational database • Algebraic transformation • Operators • Optimizations • Experimental evaluation
1. G-SPARQL Query Language • Extends a subset of SPARQL • Based on triple pattern: (subject, predicate, object) • Sub-graph matching patterns on • Graph structure • Node attribute • Edge attribute • Reachability patterns on • Path • Shortest path
G-SPARQL Pattern Matching • Node attribute • ?Person @officeNumber “518” • Edge attribute • ?E @Role “Programmer” • Structural • ?Person worksAt Microsoft • ?Person ?E(worksAt) Microsoft
G-SPARQL Reachability • Path • Subject ??PathVar Object • Shortest path • Subject ?*PathVar Object • Path filters • Path length • All edges • All nodes
Example: G-SPARQL Query • SELECT ?L1 ?L2 • WHERE { ?X ??P ?Y. ?X @Label ?L1. ?Y @Label ?L2. ?X @Age ?Age1. ?Y @Age ?Age2. ?X Affiliated UNSW. ?Y ?E(Affiliated) Microsoft. ?X LivesIn Sydney. ?E @Title "Researcher". FILTER(?Age1 >= 40). FILTER(?Age2 >= 40). FILTERPATH( Length( ??P, <= 3) ). • }
Outline • G-SPARQL language • Pattern matching • Reachability • Hybrid execution engine • Graph topology in main memory • Graph data in relational database • Algebraic transformation • Operators • Optimizations • Experimental evaluation
2. Hybrid Execution Engine • Reachability queries • Main memory algorithms • Example: BFS and Dijkstra’salgorithm • Pattern matching queries • Relational database • Indexing • Example: B-tree • Query optimizations, • Example: selectivity estimation, and join ordering • Recursive queries • Not efficient: large intermediate results and multiple joins
Graph Representation established Node Label age office location keyword type authorOf affiliated published citedBy country order month title know supervise
Hybrid Execution Engine: interfaces Traversal operations G-SPARQL query SQL commands
3. Intermediate Language & Compilation Traversal operations Front-end compilation Back-end compilation Physical execution plan G-SPARQL query Algebraic query plan Step 1 Step 2 SQL commands
Intermediate Language • Objective • Generate query plan and chop it • Reachability part -> main-memory algorithms on topology • Pattern matching part -> relational database • Optimizations • Features • Independent of execution engine and graph representation • Algebraic query plan
G-SPARQL Algebra • Variant of “Tuple Algebra” • Algebra details • Data: tuples • Sets of nodes, edges, paths. • Operators • Relational: select, project, join • Graph specific: nodeand edge attributes, adjacency • Path operators
Relational NOT Relational
Front-end Compilation (Step 1) • Input • G-SPARQL query • Output • Algebraic query plan • Technique • Map • from triple patterns • To G-SPARQL operators • Use inference rules
Front-end Compilation: Optimizations • Objective • Delay execution of traversal operations • Technique • Order triple patterns, based on restrictiveness • Heuristics • Triple pattern P1 is more restrictive than P2 • P1 has fewer path variables than P2 • P1 has fewer variables than P2 • P1’s variables have more filter statements than P2’s variables
Back-end Compilation (Step 2) • Input • G-SPARQL algebraic plan • Output • SQL commands • Traversal operations • Technique • Substitute G-SPARLQ relational operators with SPJ • Traverse • Bottom up • Stop when reaching root or reaching non-relational operator • Transform relational algebra to SQL commands • Send non-relational commands to main memory algorithms
Back-end Compilation: Optimizations • Optimize a fragment of query plan • Before generating SQL command • All operators are Select/Project/Join • Apply standard techniques • For example pushing selection
Example: G-SPARQL Query • SELECT ?L1 ?L2 • WHERE { ?X ??P ?Y. ?X @label ?L1. ?Y @label ?L2. ?X @age ?Age1. ?Y @age ?Age2. ?X affiliated UNSW. ?Y ?E(affiliated) Microsoft. ?X livesIn Sydney. ?E @title "Researcher" FILTER(?Age1 >= 40). FILTER(?Age2 >= 40). • }
4. Experimental Evaluation • Objective • This is a good idea • Good performance from DBMS and main memory topology • Data sets • Real ACM bibliographic network • Synthetic graphs • See technical report
Experimental Environment • Workload • Created Q1 … Q12 • Process • Compare to Neo4J (non-optimized, optimized) • Environment • Implementation • Main memory algorithms in C++ • IBM DB2 • PC Server
Conclusions • G-SPARQL Language • Expresses pattern matching and reachability queries on attributed graphs • Hybrid engine • Graph topology in main memory • Graph data in database • Compilation into algebraic plan • Operators and optimizations • Evaluation • Real and synthetic datasets • Good performance • Leveraging database engine and main memory topology