VLDB 2012 COMMENTS ON ‘STACK-BASED ALGORITHMS FOR PATTERN MATCHING ON DAGS’

VLDB 2012 COMMENTS ON ‘STACK-BASED ALGORITHMS FOR PATTERN MATCHING ON DAGS’ Authors: Qiang Zeng and Hai Zhuge Speaker: Hai Zhuge Key Lab of Intelligent Information Processing Chinese Academy of Sciences

Query on Graph Pattern Matching Query Graph Tree • XML, RDF • Ontology • Social Network • Basic Approaches: Pre-computing • Structure (e.g., transitive closure) • Index

Graph pattern matching (i.e.,subgraph matching)  Given a data graph G and a pattern query q, identify ‘subgraphs’ that match q in isomorphic semantics DBLP v1 u1 v2 … v3 edge edge edge path

Graph pattern matching is a building block of many graph queries in real-world applications • Social/biological networks analysis • program analysis • Information retrieval DBLP • On a DBLP graph • Find the authors of the paper entitled “A History and Evaluation of System R” • Whom Selinger have collaborated with • Find 2011 Grand Slam Winners in the area of databases

Approaches • Reachability index + Structural joins • Structural joins : decompose the pattern into smaller, simpler substructures • Binary SJoins (RJoin, ICDE’08, TKDE 2011) RJoin pattern query Use 2-hop to find the reachability pairs

Approaches • Reachability index + Structural joins • Structural joins : decompose the pattern into smaller, simpler substructures • Binary SJoins (RJoin, ICDE’08, TKDE 2011) • Complete Bipartite SJoins (HGJoin, VLDB’08) HGJoin Use Interval index to find the reachability pairs pattern query

Stack-based approach [VLDB05] • Reachability index + Holistic joins: PathStackD, TwigStackD • Holistic joins: an extensively used technique in processing tree-structured data [Pathstack] • Pipelined joins using linked stacks: time- and space-efficient • PathStack a1 a2 A a2 a1 c1 c2 c3 c1 c2 C SE SC SA e1 E c4 e1 e2

Stack-based approach [VLDB05] • Reachability index + Holistic joins: PathStackD, TwigStackD • Holistic joins: an extensively used technique in processing tree-structured data [Pathstack] • Pipelined joins using linked stacks: time- and space-efficient • PathStack (e2, c4, a2) (e2, c2, a2) (e1, c2, a2) a1 a2 A a2 a1 c1 c4 c2 c3 c1 c2 C SE SC SA e1 E • nodes are considered to be put into the respective stacks in pre-order • nodes in a stack is maintained in a root-leaf order c4 e1 e2 e2

PathStackD/TwigStackD • Find partial results from the tree cover (spanning tree(s)) using PathStack/TwigStack • Joined with partial results outside the tree cover A B Use “pools” Path/TwigStack data graph Path/TwigStackD

Adding pools: PathStackD stacks 9 5 (partial) solutions on the Tree Cover (TC partial solutions) 6 6 5 6 4 pools 7 (partial) solutions outside the tree cover (NTC partial solutions) (partial) solutions outside the tree cover (NTC solutions) 8 4 A node is added to a pool if there exists a descendant in the child pool.  reverse top order Stack (partial) solutions with leaf matching node are replicated to the pools We must make sure that all NTC partial solutions can be constructed in pools

Pool Stack Vs. • nodes are considered to be put into the pools in reverse top order • nodes in a pool is maintained in a root-leaf order • Each pool node has exactly one link to an ancestor in the parent pool • nodes are considered to be put into the respective stacks in pre-order • nodes in a stack is maintained in a root-leaf order • Each stack node has exactly one link to an ancestor in the parent stack Reverse topological order Pre-order Given a set of nodes, pre-order = reverse topological order, if there does not exist two nodes connected on the tree-cover (TC)

Four type of solutions: TC component precedes NTC component in rtop order √ √ Type-1 Type-2 √ ? Identified only in pretty ‘lucky’ cases connected on TC Type-3 Type-4 connected via NTC

Type-4 solutions path query stacks 3 2 Data graph 1 1 pools 4 2 5 5 3

Type-4 solutions: case 1 path query stacks 3 3 2 2 Data graph 1 1 1 pools 4 2 5 5 3 When the stack solution (1, 2 ,3) is replicated to the pools, (1, 2) joins with (5), forming the lucky type-4 solution

Type-4 solutions: case 2 path query stacks 2 2 Data graph 1 1 pools 4 2 5 5 3 (1, 2, 5) cannot be identified

Type-4 solutions can be found if and only if each Type-1 part partially constitutes one TC ancestor extension of a leaf matching node. How to resolve the problem 2 2 path query stacks 1 1 Data graph 1 pools 4 2 5 5 3

TwigStackD: a little more involved • TwigStackD: • A node is put into a stack if it has a descendant extension and there exists an ancestor in the parent stack. • It is very expensive to check a descendant extension (essentially another pattern matching problem) • A node is pushed into a pool if there exists a descendant in the child pool TwigStackD cannot find certain Type-2 and Type-4 solutions (not rare) . . Type-4(2) Type-4(1) Twig query

Pre-filtering process • For tree-structured data • It is easy to filter redundant node a2 a1 A D a2 a1 e1 c1 c2 c3 c4 d1 c4 c1 c2 C • Do not work on graphs E e1 f2 d1 e1, e2 c3 F e2 f1 f1, f2

Pre-filtering process • The original work proposed a pre-filtering process using bit-vectors a1 1111 A 1111 B b1 d1 D 0110 0011 0110 0011 c2 c1 C 0010 0010 0010 • With the pre-filtering process, the time complexity of the modified TwigStackD can be reduced to that of the PathStackD • But, the process always needs to scan the whole data

Experimental Study Datasets arXiv data: 9562 nodes and 28120 edges XMark data: 0.64M ~ 5.17M nodes, 0.77M ~ 6.20M edges Algorithms The PathStackD and TwigStackD in [VLDB05] The modified algorithms: PathStackD-M, TwigStackD-NF, TwigStackD-F in [VLDB2012] Experiments • Reveal the incorrectness of the algorithms in [VLDB2005] • The efficiency of the modified versions

Experimental Study • The original PathStackD identifies none of solutions to the Type-4 path query • The original TwigStackD identifies none of the Type-2 and the Type-4

Experimental Study The pre-filtering process can significantly improve the performance

Summary • The original work in [VLDB05] • Generalize the classical holistic twig join algorithms and propose three algorithms to evaluate path, twig and DAG pattern queries on DAG. • Our work • Analyze PathStackD and TwigStackD algorithms in [VLDB05] • Point out that the algorithms cannot find one or two types of query results that are common in practice • Re-design the algorithms to guarantee the soundness and completeness • Discuss the issues regarding optimization with some discrepancies in [VLDB05] • Provide theory for the new algorithms • Prove the effectiveness of the new algorithms

Main References • L.Chen, A.Gupta, and M.E.Kurul, Stack-based algorithms for pattern matching on DAGs, VLDB 2005 • L Chen, A Gupta, and ME Kurul, Efficient algorithms for pattern matching on directed acyclic graphs, ICDE 2005. • Papers used the algorithms need to be checked • J.Cheng, J.X.Yu, and P.S.Yu, Graph pattern matching: A join/semijoin approach, IEEE TKDE, 23(7)(2011)1006-1021. • H.Wang, J.Li, J.Luo, and H.Gao, Hash-based subgraph query processing method for graph-structured XML documents, VLDB 2008. • H.Wang, J.Li, W.Wang, and X.Lin, Coding-based join algorithms for structural queries on graph-structured XML documents, WWW, 11(4)(2008)485-510. • H.Wu, T.W.Ling, G.Dobbie, Z.Bao, and L.Xu, Reducing graph matching to tree matching for XML queries with ID references, International Conference on Database and Expert Systems Applications, 2010. • Q.Zeng and H.Zhuge, Comments On ‘Stack-based algorithms for pattern matching on DAGs’, VLDB2012

Acknowledgements • We thank the reviewers and proceeding editors for their helpful comments • The authors of the original paper read our paper and provide comments during review process • Relevant research was supported by National Science Foundation of China

Thanks!

VLDB 2012 COMMENTS ON ‘STACK-BASED ALGORITHMS FOR PATTERN MATCHING ON DAGS’

VLDB 2012 COMMENTS ON ‘STACK-BASED ALGORITHMS FOR PATTERN MATCHING ON DAGS’

Presentation Transcript

Pushdown Automata PDAs

Oracle ASM Reduces Cost of VLDB Deployment

Pattern-Directed Inference Systems

Automatic Schema Matching

disk stack centrifuge

Disk Stack Centrifuge

On Finding Repeats in Strings

Introduction to Algorithms

String Matching Using the Rabin-Karp Algorithm

Algorithms

5. Impedance Matching and Tuning

THE GRC STACK ( V2.0) Understanding and applying the CSA GRC stack for payoffs and protection

Chapter 5

VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges

Combinatorial Pattern Matching

Embedding-Based Subsequence Matching in Large Sequence Databases

Data Mining Algorithms for Recommendation Systems

Nursing Health Assessments

Stack Smashing, printf, return-to-libc

Condor and Workflows: An Introduction Condor Week 2012

Combinatorial Pattern Matching