300 likes | 432 Vues
This paper discusses the enhancement of tree pattern queries for graph-structured data by integrating logical operators. It addresses the challenges of querying graph data, particularly within social and biological network analyses. By introducing Generalized Tree Pattern Queries (GTPQ), the authors present innovative solutions to common query-related issues such as satisfiability, containment, and equivalence. The incorporation of Boolean logic allows for a more expressive query syntax that can capture complex relationships, facilitating precise data retrieval from graph databases like DBLP.
E N D
VLDB 2012 ADDING LOGICAL OPERATORS TO TREE PATTERN QUERIES ON GRAPH STRUCTURED DATA Authors: Qiang Zeng, Xiaorui Jiang, and Hai Zhuge The Speaker: Hai Zhuge Key Lab of Intelligent Information Processing Chinese Academy of Sciences
Query on Graph • Example: Query on DBLP XML document • Get A’s conference papers published from 2000 to 2010 and co-authored with B • Get conference papers of either A or B published from 2000 to 2010. • Get A’s conference papers that are not co-authored with B published from 2000 to 2010. Query Graph data Tree
DBLP Graph pattern matching (i.e.,subgraph matching) : Given a data graph G and a pattern query q, identify “subgraphs” that match q in isomorphic semantics v1 u1 v2 … v3 edge edge edge path
DBLP • Graph pattern matching is a building block of many graph queries which are key to many applications • Social/biological networks analysis • program analysis • Information retrieval
GTPQ: Generalized Tree Pattern Query • Applications need more powerful semantics • Incorporating Boolean logic to patterns • Each node is associated with a distinct propositional variable • In addition to attribute predicates, each non-leaf node has a structural predicate fs in terms of propositional logic with variables corresponding to its children • Applications often need a part of nodes • allowing a portion of query nodes to be output nodes (full-fledged evaluation) paper • author1 or author2 • fs(u1)=pu2∨pu3 • author1 but not with author2 • fs(u1)=pu2∧¬pu3 Output the title only author1 author2 title Twig query
Previous Approaches • On tree structure unsuitable for graphs • Node encoding schemes unsuitable for graphs • Some extensions are also on tree structure • Minimization has been studied • On graph structure • Time and space costs are high • On graph pattern matching • No disjunction and negation operations • On query results • Most approaches concern complete result • Applications often request a portion of query as result
Fundamental Problems • Satisfiability • Answer to query on graph G, Q(G), is not empty • Containment, Equivalence • Q(G)Q’(G), Q(G)=Q’(G) • Based on homomorphism • Minimization • Find equivalent Q(G) with minimal number of nodes
Contributions • Proposed a new class of tree pattern queries over graph-structure data GTPQ • Proposed an approach to raise TPQ efficiency • a graph representation of intermediate results • a pruning approach for evaluating query patterns over graphs • Investigated fundamental problems • Satisfiability, containment, equivalence and minimization • Developed the algorithm GTPQ
Complexity analysis Satisfiability: A GTPQ is satisfiable if there is a data graph on which the answer to the query is non-empty. • Satisfiableiff the attribute predicate and the complete structural predicate of the root are both satisfiable • NP-Complete ¬ Containment • Q1 is contained in Q2iff there is a homomorphism from Q2 to Q1 • Containment problem: Co-NP-hard Output node neighborhood reachability
Complexity analysis Satisfiability: A GTPQ is satisfiable if there is a data graph on which the answer to the query is non-empty. • Satisfiableiff the attribute predicate and the complete structural predicate of the root are both satisfiable • NP-Complete Containment • Q1 is contained in Q2iff there is a homomorphism from Q2 to Q1 • Co-NP-hard Minimization • Remove all redundant query nodes • Case 1: those semantically contained by some others (containment problem) • Case 2: unsatisfiable subqueries (satisfiability problem) • Determine whether a query is minimal: NP-Hard
Existing Approaches for Conjunctive TPQ • Reachability index + Structural joins • Structural joins : decompose the pattern into smaller and simpler substructures • Binary SJoins (RJoin, ICDE’08, TKDE 2011) RJoin pattern query Use 2-hop to find the reachability pairs
Existing Approaches for Conjunctive TPQ • Reachability index + Structural joins • Structural joins : decompose the pattern into smaller and simpler substructures • Binary SJoins (RJoin, ICDE’08, TKDE 2011) • Complete Bipartite SJoins (HGJoin, VLDB’08) HGJoin pattern query Use Interval index to find the reachability pairs
Existing Approaches for Conjunctive TPQ • Reachability index + Structural joins • Structural joins : decompose the pattern into smaller, simpler substructures • Binary SJoins (RJoin, ICDE’08, TKDE) • Complete Bipartite SJoins (HGJoin, VLDB’08) • Pipelined joins on trees + Naïve on non-trees (VLDB’05, 12) A B Use “pools” Path/TwigStack Path/TwigStackD
Existing Approaches for Conjunctive TPQ • The index size is typically large. • In particular, #index(RJoin)=Ω(n2) • Produce large amounts of intermediate results • selectivity(query) << selectivity(substructures) • TwigStackD introduces a pre-filtering process, but it needs to scan the whole data graph. • TPQ with negation and disjunction ? • Decompose the pattern into a set of conjunctive TPQ and perform joins (again, involving producing many redundant intermedidate results) • Full-fledge evaluation? • Projection
GTEA: Evaluation algorithm Applying existing algorithms to process GTPQ • large amounts of intermediate results • not efficient for full-fledged evaluation • first find the results of the whole pattern and perform projection • The decomposition-based approach has rather low performance • has to decompose a query to several conjunctive sub-queries • Structural-join problems Our Approach: Stage 1: bottom-up and top-down pruning Stage 2: construct the Maximal Matching Graph (MMG) Stage 3: enumerate results via a graph traversal on MMG
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints Basic operation A u1 u2 • Use 3-hop to determine the reachability between two sets • Key idea: exploit the shared reachability using a substructure B We can also use other reachability index structures
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints Process a set of edges holistically
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints • Maximal Matching Graph (MMG) • Represent intermediate results • Vs. tuple form • smaller space complexity • easier to derive final results v1 u1 w1 v1 v1 w1 w3 v1 u1 u3 u3 w3 v1 MMG
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints • Maximal Matching Graph (MMG) • Represent intermediate results • Vs. tuple form • smaller space complexity • easier to derive final results • Similar ideas are also used in several other studies for representing the final results. (able to reduce the query complexity)
GTEA: Evaluation algorithm • 2-Round pruning • Bottom-up: downward structural constraints • Top-down: upward structural constraints • Maximal Matching Graph • Represent intermediate results • Optimized for non-output nodes • GTPQ Prime Subtree (2nd pruning) Shrunk Prime Subtree (MMG) output node
GTEA: Experimental study Datasets arXIv data: 9562 nodes and 28120 edges XMark data: 0.64M ~ 5.17M nodes, 0.77M ~ 6.20M edges Algorithms Algorithms for tree-structured data: TwigStack, Twig2Stack Algorithms for graph-structured data: TwigStackD, HGJoin, GTEA Experiments • The efficiency and scalability for processing conjunctive queries • The expected I/O costs • The impact of adding negation and disjunction on performance • The effectiveness of the pruning process
GTEA: Experimental study • Better even for conjunctive queries • MMG approach is effective
GTEA: Experimental study • The size of intermediate results is small
GTEA: Experimental study • Optimization for non-output results • The performance gap is significantly widened especially when the query has negation operations
Summary • Explore a new tree pattern matching query with Boolean logic on graph-structured data • Structural predicate, output nodes • Analyze computational complexities of four problems for static global optimization • Satisfiability, containment and equivalence, minimization • The first study on these problems • Propose an algorithm GTEA • Pruning approach using 3-hop • Optimization for non-output nodes • Maximal matching graph
Future Work • Query over Semantic Link Network • Different from RDF • Real-world applications • New conditions and requirements Query Relational rules: parentOf fatherOf V motherOf childOf sonOf V daughterOf H.Zhuge, The Knowledge Grid, World Scientific Publishing Co., Singapore, 2012. 2nd Edition A simple Semantic Link Network
Incorporating the Semantic Space H.Zhuge, The Knowledge Grid, World Scientific Publishing Co., Singapore, 2012. 2nd Edition
Problems • System • Interface • Application • Automatically generating semantic link networks • Semantics • Understand query and patterns Irrelevant to size Semantics? Query Graph Graph
References on Semantic Link NetworkConcern AI and Database • H.Zhuge, The Knowledge Grid, World Scientific Publishing Co., Singapore, 2012. 2nd Edition. • Chapter 2. The Semantic Link Network • H.Zhuge, The Web Resource Space Model, Springer, 2008. • H.Zhuge, Semantic linking through spaces for cyber-physical-socio intelligence: A methodology, Artificial Intelligence, 175(2011)988-1019. • H.Zhuge, Communities and Emerging Semantics in Semantic Link Network: Discovery and Learning, IEEE Transactions on Knowledge and Data Engineering, vol.21, no.6, 2009, pp. 785-799. • H.Zhuge, Interactive Semantics, Artificial Intelligence, 174(2010)190-204.