300 likes | 434 Vues
G-SPARQL is a hybrid query language designed for efficient querying of large attributed graphs, integrating components of SPARQL with advanced pattern matching and reachability capabilities. By combining graph topology stored in memory with relational databases, G-SPARQL supports complex queries involving node and edge attributes. It utilizes a hybrid execution engine that optimizes performance through algebraic transformations and efficient traversal algorithms. Experimental evaluations demonstrate its effectiveness compared to existing systems like Neo4J, making it a powerful tool for social and bibliographical network analysis.
E N D
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs
Contributions • G-SPARQL language • Pattern matching • Reachability • Hybrid execution engine • Graph topology in main memory • Graph data in relational database • Algebraic transformation • Operators • Optimizations • Experimental evaluation
1. G-SPARQL Query Language • Extends a subset of SPARQL • Based on triple pattern: (subject, predicate, object) • Sub-graph matching patterns on • Graph structure • Node attribute • Edge attribute • Reachability patterns on • Path • Shortest path
G-SPARQL Pattern Matching • Node attribute • ?Person @officeNumber “518” • Edge attribute • ?E @Role “Programmer” • Structural • ?Person worksAt Microsoft • ?Person ?E(worksAt) Microsoft
G-SPARQL Reachability • Path • Subject ??PathVar Object • Shortest path • Subject ?*PathVar Object • Path filters • Path length • All edges • All nodes
Example: G-SPARQL Query • SELECT ?L1 ?L2 • WHERE { ?X ??P ?Y. ?X @Label ?L1. ?Y @Label ?L2. ?X @Age ?Age1. ?Y @Age ?Age2. ?X Affiliated UNSW. ?Y ?E(Affiliated) Microsoft. ?X LivesIn Sydney. ?E @Title "Researcher". FILTER(?Age1 >= 40). FILTER(?Age2 >= 40). FILTERPATH( Length( ??P, <= 3) ). • }
Outline • G-SPARQL language • Pattern matching • Reachability • Hybrid execution engine • Graph topology in main memory • Graph data in relational database • Algebraic transformation • Operators • Optimizations • Experimental evaluation
2. Hybrid Execution Engine • Reachability queries • Main memory algorithms • Example: BFS and Dijkstra’salgorithm • Pattern matching queries • Relational database • Indexing • Example: B-tree • Query optimizations, • Example: selectivity estimation, and join ordering • Recursive queries • Not efficient: large intermediate results and multiple joins
Graph Representation established Node Label age office location keyword type authorOf affiliated published citedBy country order month title know supervise
Hybrid Execution Engine: interfaces Traversal operations G-SPARQL query SQL commands
3. Intermediate Language & Compilation Traversal operations Front-end compilation Back-end compilation Physical execution plan G-SPARQL query Algebraic query plan Step 1 Step 2 SQL commands
Intermediate Language • Objective • Generate query plan and chop it • Reachability part -> main-memory algorithms on topology • Pattern matching part -> relational database • Optimizations • Features • Independent of execution engine and graph representation • Algebraic query plan
G-SPARQL Algebra • Variant of “Tuple Algebra” • Algebra details • Data: tuples • Sets of nodes, edges, paths. • Operators • Relational: select, project, join • Graph specific: nodeand edge attributes, adjacency • Path operators
Relational NOT Relational
Front-end Compilation (Step 1) • Input • G-SPARQL query • Output • Algebraic query plan • Technique • Map • from triple patterns • To G-SPARQL operators • Use inference rules
Front-end Compilation: Optimizations • Objective • Delay execution of traversal operations • Technique • Order triple patterns, based on restrictiveness • Heuristics • Triple pattern P1 is more restrictive than P2 • P1 has fewer path variables than P2 • P1 has fewer variables than P2 • P1’s variables have more filter statements than P2’s variables
Back-end Compilation (Step 2) • Input • G-SPARQL algebraic plan • Output • SQL commands • Traversal operations • Technique • Substitute G-SPARLQ relational operators with SPJ • Traverse • Bottom up • Stop when reaching root or reaching non-relational operator • Transform relational algebra to SQL commands • Send non-relational commands to main memory algorithms
Back-end Compilation: Optimizations • Optimize a fragment of query plan • Before generating SQL command • All operators are Select/Project/Join • Apply standard techniques • For example pushing selection
Example: G-SPARQL Query • SELECT ?L1 ?L2 • WHERE { ?X ??P ?Y. ?X @label ?L1. ?Y @label ?L2. ?X @age ?Age1. ?Y @age ?Age2. ?X affiliated UNSW. ?Y ?E(affiliated) Microsoft. ?X livesIn Sydney. ?E @title "Researcher" FILTER(?Age1 >= 40). FILTER(?Age2 >= 40). • }
4. Experimental Evaluation • Objective • This is a good idea • Good performance from DBMS and main memory topology • Data sets • Real ACM bibliographic network • Synthetic graphs • See technical report
Experimental Environment • Workload • Created Q1 … Q12 • Process • Compare to Neo4J (non-optimized, optimized) • Environment • Implementation • Main memory algorithms in C++ • IBM DB2 • PC Server
Conclusions • G-SPARQL Language • Expresses pattern matching and reachability queries on attributed graphs • Hybrid engine • Graph topology in main memory • Graph data in database • Compilation into algebraic plan • Operators and optimizations • Evaluation • Real and synthetic datasets • Good performance • Leveraging database engine and main memory topology