1 / 82

Reachability Query Over A Large Graph

北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University. Reachability Query Over A Large Graph. Instructor: Lei Zou. Reachability Query.

lita
Télécharger la présentation

Reachability Query Over A Large Graph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Reachability Query Over A Large Graph Instructor: Lei Zou

  2. Reachability Query • Problem Definition: Given a single large directed graph G and two vertices u1 and u2, recahability query verifies whether there exists an directed path from u1 and u2. 15 • ?Query(1,11) • Yes • ?Query(3,9) • No 14 11 13 10 12 6 7 8 9 3 4 5 1 2

  3. Applications • XML Data “//”: ancestor-descendant search • Biological Networks “Finding all genes whose expressions are directly or indirectly influenced by a given molecule” • Paper Citation Network “Given two papers A and B, we want to check whether A is directly or indirectly cited by B”.

  4. Naïve Methods 10,000 nodes and 20,000 edges

  5. Existing Solutions • Chain Cover 5 1 2 6 7 3 8 4

  6. Existing Solutions • Tree Cover 1 2 6 7 3 8 4 5

  7. Existing Solutions • 2-hop Labeling 1 Hops 2 6 7 3 8 4 5

  8. Preliminary •  A directed acyclic graph (commonly abbreviated to DAG), is a directed graph with no directed cycles. • Any directed graph G can be transformed into a DAG G’ by condensing each “strong connected component” into one node. • G and G’ have the same “reachability” information.

  9. Preliminary • In graph theory, a topological sort or topological ordering of a directed acyclic graph (DAG) is a linear ordering of its nodes in which each node comes before all nodes to which it has outbound edges. • Every DAG has one or more topological sorts.

  10. Topological Ordering E C B D G F A A G B D C E F

  11. Outline • Chain Cover Chain@TODS1990 • Tree Cover GRIPP @ SIGMOD 2007, Dual Labeling@ICDE2006 • Path-Tree Cover pathtree@SIGMOD 2008 • 2Hop-Labeling HOPI@EDBT2004, FAST2Hop@EDBT2008

  12. Outline • Chain Cover Chain@TODS1990 • Tree Cover GRIPP @ SIGMOD 2007, Dual Labeling@ICDE2006 • Path-Tree Cover pathtree@SIGMOD 2008 • 2Hop-Labeling HOPI@EDBT2004, FAST2Hop@EDBT2008

  13. Chain@TODS1990 (2, C1) E (1, C1) C (0, C1) (1, C0) (3, C0) B D G F (2, C0) 13 Chains: A  B  D  F G  C  E A (0, C0)

  14. Chain@TODS1990 (2, C1) Can A reach E ? E (1, C1) C (0, C1) (1, C0) (3, C0) B D G F (2, C0) A (0, C0)

  15. Chain Decomposition E C B D G F A A G B D C E F

  16. Chain Decomposition • A Simple Method: • 1. Firstly, a topological sort is performed over DAG. 2. Find v, the smallest vertex (in the ascending order of the topological sort) in the graph and add it to the path P. • Then find v’, such that v’ is the smallest vertex in the graph such that there is an edge from v to v’. Add v’ to the path P. • Iteratively, go to Step 3 until that no v’ can be found. • Remove all vertices in P from the DAG • Iteratively, go to Step 2 until no vertex left in DAG.

  17. Outline • Chain Cover Chain@TODS1990 • Tree Cover GRIPP @ SIGMOD 2007, Dual Labeling@ICDE2006 • Path-Tree Cover pathtree@SIGMOD 2008 • 2Hop-Labeling HOPI@EDBT2004, FAST2Hop@EDBT2008

  18. GRIPP @ SIGMOD 2007 • Pre- and Postorder Numbering For Trees Node labeling during depth-first traversal of G. R [ 0, 17 ] w reachable from v iff vpre < wpre < vpost [ 1, 16] A [ 2, 7] [ 8, 9] [ 10, 15] E reachable from A? Epre = 3 Apre = 1, Apost = 16 1 < 3 < 16  true B C D [ 3, 4] [ 5, 6] [ 11, 12] [ 13, 14] E F G H

  19. Non-Tree Edge General Idea In GRIPP Input: Given a graph, G = (V, E) Output:Create the GRIPP index table, IND(G) one instance for every node v in G – Node identifier – Preorder value – Postorder value – Instance type R Non-tree edge A B C D E F G H

  20. Index creation Depth-first traversal of G [0 21 ] R Non-tree instances 17 ] [16 [1 20 ] A [10 19 ] [8 9 ] [2 7 ] B C D [12 13 ] [5 6 ] [3 4 ] [11 14 ] [15 18 ] E F G H

  21. GRIPP Index Table, IND(G) Depth-first traversal of G [0 21 ] R 17 ] [16 [1 20 ] A [10 19 ] [8 9 ] [2 7 ] B C D [12 13 ] [5 6 ] [3 4 ] [11 14 ] [15 18 ] E F G H

  22. Order Tree, O(G) post R [0 21 ] R A D 20 H 17 ] A [16 G [1 20 ] A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E E F G H pre 5 10 15 20 [15 18 ]  Tree instances: Inner or leaf node  Non-tree instances: Always leaf node

  23. Reachable Instance Set (RIS) post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 [15 18 ] 23

  24. Query Processing Query : A  H ? post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 [15 18 ] Yes Yes 24

  25. Query Processing Query : D  C ? We must extend the search post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 [15 18 ] Yes No 25

  26. Query Processing post R [0 21 ] R A D 20 H 17 ] [16 A [1 20 ] G A 15 B [10 19 ] [8 9 ] [2 7 ] 10 C B C D [12 13 ] B 5 F [5 6 ] [3 4 ] [11 14 ] E F E G H pre 5 10 15 20 In worst case, we need to perform ( |E|-|V|) queries Yes

  27. Algorithm Step 1 : Retrieve RIS(D) If Stop Search; Return TRUE else proceed Step2. Step 2: Check every i ∈ RIS(D)  If i is tree instance – [G and H] – Continue else – [A and B] – i has no successors in O(G), but possibly in G – proceed to Step 3 Step 3: Obtain the tree instance of node i  Proceed to Step 1 C ∈ RIS(D) R A D 20 H A G 15 B 10 C B 5 F E pre 5 10 15 20 Repeat steps 1…3 until  an instance of node C is found  or no more hop nodes are available

  28. Pruning Rules Pruning hop nodes  Simple – Never retrieve reachable instance sets twice  Skip – Skip already searched areas  Stop – Certain nodes cannot have hop nodes in their reachable instance set – This property can be pre-computed

  29. Simple and Skip-Strategy • Keep list of used hop nodes, U • When examining a new hop node h four possible positions of h relative to u ∈ U h = u u ∈ RIS(h) h ∈ RIS(u) h sibling to u No No Yes Yes, but skip

  30. Stop Nodes Node s is a stop node iff  all non-tree instances in RIS(s) also have their corresponding tree instances in RIS(s) or are s R A D 20 H 1. Node A is a stop node  no hop node in RIS(A) can be used 2. Nodes B, C, E, F, and R are also stop Nodes 3. Nodes D, G, and H are no stop nodes A G 15 B 10 C B 5 F E pre 5 10 15 20

  31. Implementation Issues: GRIPP • 1. Introducing one virtual root node r. • 2. Connect r to the node with the highest degree. • 3. Traverse and Label the nodes. • 4. For the remaining part, repeat steps 2 – 3 until all vertices are covered.

  32. Dual Labeling @ICDE 2006 [0, 21 ] • Naïve Dual Labeling: • Tree Interval-Based Labeling • Link Table [1, 8] [9, 20] X 11  [1, 8] 16  [10, 15] [10, 15] [16, 19] [2, 7] Y [5, 6] [3, 4] [13, 14] [11, 12] [17, 18]

  33. Query Processing Over Naïve Dual Labeling [0, 21 ] Case 1: Query B  F ? 17 ∈ [9, 20] A [1, 8] [9, 20] X B Case 2: Query C  X ? [10, 15] [16, 19] [2, 7] G Y C 16  [10, 15] 11 ∈ [10, 15] 11  [1, 8] YES [5, 6] [3, 4] [13, 14] H I D E F [11, 12] [17, 18]

  34. Query Processing Over Naïve Dual Labeling Lemma 1. Assume two nodes u and v are labeled [a, b] and [c, d], respectively. There is a path from u to v iff c ∈[a, b] or the link table contains a series of m non-tree edges: i1 [j1,k1], … …, im  [jm, km] (1) such that i1 ∈ [a,b], c ∈[jm,km], and im’ ∈[jm’-1, km’-1]for all 1 < m’ ≤ m.

  35. Transitive Link Table • Size of Transitive Link is O(t), where t = |E(G)|-|V(G)| • Naïve Query Processing: Traversing and exploring the non-tree edges in an iterative fashion. • Transitive Link Table Given two links i1  [j1, k1] and i2  [j2, k2] in the link table, if i2 ∈[j1,k1], we add a new link i1  [j2,k2]. • Size of the transitive link table: O(t2).

  36. Transitive Link Table • An Example [0, 21 ] A 16  [10, 15] 11  [1, 8] 16  [1, 8] [1, 8] [9, 20] X B [10, 15] [16, 19] [2, 7] G Y C [5, 6] [3, 4] [13, 14] H I D E F [11, 12] [17, 18]

  37. Query Over Transitive Link Table • Assume two nodes u and v are labeled [a, b] and [c, d], respectively. There is a path from u to v iff 1) c ∈[a, b]; or 2) The transitive link table contains such link i [j, k], where i ∈[a, b] AND c ∈[j, k] k d c j a i b

  38. N(.,.) 15 Query: B  G ? N(9, 2) - N (9, 20) > 0  (B  G) TRUE 10 8 7 N(9,2) = 2 N(9,20) = 0 2 1 9 11 16 20

  39. Reducing Space Cost ■ ■ In each cell, all H(.,.) are the exactly same with each other. 15 10 ■ ■ 8 ■ ■ 7 2 1 ■ ■ 9 11 16 20

  40. Outline • Chain Cover Chain@TODS1990 • Tree Cover GRIPP @ SIGMOD 2007, Dual Labeling@ICDE2006 • Path-Tree Cover pathtree@SIGMOD 2008 • 2Hop-Labeling HOPI@EDBT2004, FAST2Hop@EDBT2008

  41. Path-Tree in a Nutshell 15 14 P4 11 13 10 12 P2 6 7 8 9 P4 P1 P3 3 4 5 P3 1 2 P2 P1

  42. Key Problems • How to construct a path-tree? • Algorithm • How can a path-tree help with reachability queries? • Labeling • Transitive Closure Compression • How does path-tree compare with the existing methods? • Optimality

  43. Constructing Path-Tree • Step 1: Path-Decomposition of DAG • Step 2: Minimal Equivalent Edge Set between any two paths • Step 3: Path-Graph Construction • Step 4: Path-Tree Cover Extraction

  44. Step 1: Path-Decomposition 15 (PID,SID) =(2, 5) 14 11 For any two nodes (u, v) in the same path, u  v if and only if (u.sid  v.sid) 13 10 12 6 7 8 9 P4 3 4 5 P3 1 2 P2 P1 Simple linear algorithm based on topological sort can achieve a path-decomposition

  45. Step 2: Minimal equivalent edge set The reachability between any two paths can be captured by a unique minimal set of edges 15 15 14 14 11 11 13 10 13 10 6 7 P1 P2 P1  P2 6 7 3 4 3 4 1 2 1 2 P2 P2 P1 P1 The edges in the minimal equivalent edge set do not cross (always parallel)!

  46. Step 2: Constructing Minimal Equivalent Edge Set (PiPj) • Ordering the vertices in Pi and Pj by decreasing order • Finding the first vertex v in P_j that P_i can reach • Finding the last vertex u in P_i that reach v • Removing all the edges cross (u,v) and • repeat 2-4 15 14 11 13 10 6 7 3 4 1 2 P2 P1 P1 P2

  47. Step 3: Path-Graph Construction Weight reflects the cost we have to pay for the transitive closure computation if we exclude this path-tree edge 15 14 P2 11 2 5 13 10 12 4 P4 P1 2 2 1 1 6 7 8 9 1 P4 P3 3 4 5 P3 Weighted Directed Path-Graph 1 2 P2 P1

  48. Step 4: Extracting Path-Tree Cover P2 P2 2 2 4 5 5 P4 P4 P1 P1 2 2 2 1 1 1 P3 P3 Weighted Directed Path-Graph Maximal Directed Spanning Tree Chu-Liu/Edmonds algorithm, O(m’+ k logk)

  49. Key Problems • How to construct a path-tree? • Algorithm • How can path-tree help with reachability queries? • Labeling • Transitive Closure Compression • How does path-tree compare with the existing methods? • Optimality

  50. 3-Tuple Labeling for Reachability 15 [1,3] P2 14 11 [1,4] P4 13 10 12 P1 [1,1] [2,2] 6 7 8 P3 9 P4 3 4 5 Interval labeling (2-tuple) High-level description about paths Pi  Pj ? P3 1 2 P2 P1 DFS labeling (1-tuple)

More Related