Exploiting Local Similarity for Indexing Paths in Graph-Structured Data
270 likes | 412 Vues
This paper presents a novel approach for indexing paths in graph-structured data by exploiting local similarities among nodes. Traditional methods struggle with large datasets that do not fit into memory, making it necessary to summarize structural information into a manageable graph. The authors introduce the concept of bisimilarity and k-bisimilarity for node matching, allowing efficient indexing based on structural relationships. Key processes, examples, and results illustrate the effectiveness of this indexing strategy, enhancing graph query performance.
Exploiting Local Similarity for Indexing Paths in Graph-Structured Data
E N D
Presentation Transcript
Exploiting Local Similarity for Indexing Paths in Graph-Structured Data by RaghavKaushik, PradeepShenoy, Philip Bohannon and Ehud Gudes Abdullah Mueen
Outline • No Outline • No Confusing Syntax • No Pseudocode • Examples • Results Abdullah Mueen
XML as Data Graph oid label(3) value(13) Non-tree edges: model IDREF relationships in the document Abdullah Mueen
Some Notations • node path: • 1.2.3.7.14 • label path: • ROOT.metro.cultural.museum.name • 1.2.3.7 matchesROOT.metro.cultural.museum • 2.3.7 does not matchmetro.cultural.museum.name • 7 and 6 both matchesROOT.etro.cultural.museum • k-path: • Label Path of length ≤ k Abdullah Mueen
Path Expression matches with any label alteration repetition • ROOT.metro.cultural.museum • 6,7 • ROOT.(-.-.-).name • 12,14,16,19,22,24 • ROOT.-*.hotel • All hotel nodes • ROOT.metro.neighborhoods.neighborhood. (-|-.-)?.(hotel|museum).name • 12,14,16,19 label sequencing optional Xpath and other Query Languages that use Path Expressions • http://saxon.sourceforge.net/saxon6.5.3/expressions.html • http://www.w3.org/1999/09/ql/docs/xquery.html Abdullah Mueen
The Problem • Given a graphGand a path expression P, what are the labels of the nodes that match with P. • Possible Solution is to evaluate the path expression query using the data graph. • But data graphcan be Very Large to fit in the main memory and can be Very Large to search completely even if it fits. Abdullah Mueen
Indexing Data Graph • No Schema • No Keys • Only Structural Information is there which can be summarized by a smaller graph I(G). This summary graph serves as an Index for the whole data graph. Abdullah Mueen
Indexing Data Graph : Example(1) R 0 Precise Index eg. DataGuide, 1-index R Extent 11 C A B 1 3 2 C A B 12 14 13 C {3} B D {1} 4 5 6 {2,4} D 15 C 17 C D 7 8 {6} {5,7} D ext(17) = {5,7} ext(13) = {2,4} 18 D 9 {8,9} index graph I(G) data graph G R.A.-*.C = {5,7} R.-.B = {4,2} R.A.-*.C = {5,7} R.-.B = {4,2} Abdullah Mueen
Indexing Data Graph : Example(2) R 0 R 11 C A B C 1 3 2 A B 12 14 13 {3,5,7} {1} C {2,4} B D 4 5 6 D 15 {6,8,9} C D 7 8 index graph I(G) D 9 Safe Index data graph G R.A.-*.C = {3,5,7} R.-.-*.B = {2,4} R.A.-*.C = {5,7} R.-.-*.B = {4} Abdullah Mueen
Indexing Data Graph : Example(3) R 0 R 11 C A B C 1 3 2 A B 12 14 13 {3,5,7} {1} C {2,4} B D 4 5 6 D 15 {6,8,9} C D 7 8 index graph I(G) D 9 Unsafe Index data graph G R.A.-*.C = {5,7} R.-.-*.B = {2} R.A.-*.C = {3,5,7} R.-.-*.B = { } Abdullah Mueen
Bisimilarity R 0 Two nodes u and v are called bisimilar(u ≈b v) if label(u) = label(v) every incoming label path from ROOT to u matches with at least one incoming path from ROOT to v and vice versa. C A B 1 3 2 C B D 4 5 6 • 2,4 are bisimilar. • 5,7 are bisimilar • 8,9 are bisimilar • 6,8 are Not bisimilar C D 7 8 D 9 • ≈b defines an equivalence class over the set of nodes in G • Needs O(m log n) time to find the partitions data graph G R.A.-*.C = {5,7} R.-.B = {4,2} Abdullah Mueen
Equivalence Classb → The 1-index R 0 R 11 C A B 1 3 2 C A B 12 14 13 C {3} B D {1} 4 5 6 {2,4} D 15 C 17 C D 7 8 {6} {5,7} D 18 D 9 {8,9} index graph I(G) data graph G R.A.-*.C = {5,7} R.-.B = {4,2} R.A.-*.C = {5,7} R.-.B = {4,2} Abdullah Mueen
Revisiting Bisimilarity • 1-index is upper bounded by the size (number of nodes) of the data graph • For real large documents it is almost 45% of the size of the data graph Bisimilarity partitions nodes by considering all incoming paths from ROOT which is a global comparison between nodes. Abdullah Mueen
k-bisimilarity R 0 Two nodes u and v are called k-bisimilar(u ≈k v) if label(u) = label(v) every incoming label path of length≤kto u matches with at least one incoming path of length≤kto v and vice versa. C A B 1 3 2 C B D 4 5 6 C D 7 8 D 9 • ≈k defines an equivalence class over the set of nodes in G • The algorithm for computing k-bisimulation will be shown later • 2,4 are 0-bisimilar. • 5,7 are 1-bisimilar • 8,9 are 2-bisimilar • 6,8 are 1-bisimilar Abdullah Mueen
Equivalence Class0 → A(0) index R 0 R 11 C A B C 1 3 2 A B 12 14 13 {3,5,7} {1} C {2,4} B D 4 5 6 D 15 {6,8,9} C D 7 8 D Label grouping / Label partition 9 data graph G index graph A(0) Abdullah Mueen
Equivalence Class1 → A(1) index R 0 R 11 C A B 1 3 2 C A B 12 14 13 C {1} B D {3} 4 5 {2} 6 C B D 15 16 17 C D 7 8 {5,7} {6,8,9} {4} D 9 data graph G index graph A(1) Abdullah Mueen
A(k) index family R 0 R 11 R 11 C A C B A 12 14 B 13 1 3 {3,5,7} 2 {1} {2,4} C D 15 A B 12 14 13 {6,8,9} C {1} B D {3} 4 5 6 {2} A(0) A(1) C R 11 B D R 11 15 16 17 C D {5,7} {6,8,9} 7 8 A C C {4} A B B 12 14 12 14 13 13 {1} {1} {3} {3} {2} {2} C D C B 9 B D D 15 16 17 15 16 17 {4} {5} {5} {6} {4} {6} D data graph G D C 18 C 19 18 19 {8} {7} {8,9} {7} D 18 A(2) A(3) = 1-index {9} Abdullah Mueen
Properties of A(k) index R 0 R 11 C A B 1 3 2 C A B 12 14 13 {1} C B D {3} 4 5 6 {2} C B D 15 16 17 C D 7 8 {5,7} {6,8,9} {4} D 9 A(1) Abdullah Mueen
Properties of A(k) index R 0 R 11 C A B 1 3 2 C A B 12 14 13 {1} C B D {3} 4 5 6 {2} C B D 15 16 17 C D 7 8 {5,7} {6,8,9} {4} D 9 A(1) Abdullah Mueen
How to compute A(1) index R 0 Label partition {1} {2,4} {3,5,7} {6,8,9} Lookup: {1} {2,4} {3,5,7} {6,8,9} C {1} {2} {4} {3,5,7} {6,8,9} A Refining: {1} {2,4} {3,5,7} {6,8,9} B 1 3 2 {1} {2,4} {3,5,7} {6,8,9} C B D {1} {2} {4} {3} {5,7} {6,8,9} 4 5 6 {1} {2,4} {3,5,7} {6,8,9} C D 7 8 {1} {2} {4} {3} {5,7} {6,8,9} D 9 {1} {2,4} {3,5,7} {6,8,9} {1} {2} {4} {3} {5,7} {6,8,9} 1-bisimilar partition Abdullah Mueen
How to compute A(2) index R 0 1-bisimilar partition {1} {2} {4} {3} {5,7} {6,8,9} Lookup: {1} {2} {4} {3} {5,7} {6,8,9} C {1} {2} {4} {3} {5,7} {6,8,9} Refining: {1} {2} {4} {3} {5,7} {6,8,9} A B 1 3 2 {1} {2} {4} {3} {5,7} {6,8,9} {1} {2} {4} {3} {5} {7} {6,8,9} C B D 4 5 6 {1} {2} {4} {3} {5,7} {6,8,9} {1} {2} {4} {3} {5} {7} {6,8,9} C D 7 8 {1} {2} {4} {3} {5,7} {6,8,9} D 9 {1} {2} {4} {3} {5} {7} {6} {8,9} {1} {2} {4} {3} {5,7} {6,8,9} 2-bisimilar partition {1} {2} {4} {3} {5} {7} {6} {8,9} Abdullah Mueen
Query Evaluation : Fwd or Bckwd R 11 C A B 12 14 13 {1} {3} R A {2} - C B D 15 16 17 {5,7} {6,8,9} C {4} R.A.-*.C = {5,7} • Repeated state is prevented • O(|A|*m) • Backward evaluation using label-group Abdullah Mueen
Query Evaluation : Validation R 11 R A C B A B 12 14 13 {1} {3} D C {2} C B D 15 16 17 {5,7} {6,8,9} {4} R.A.B.C.D = {6,8,9} • Repeated state is prevented • O(|A|*m) Abdullah Mueen
Avoiding Validation R 11 R.-*.C.D= {6,8,9} C A B 12 14 13 {1} {3} {2} For Queries like R.-*.p, we can safely avoid validation on A(k) if p is a k-path. C B D 15 16 17 {5,7} {6,8,9} {4} A(1) Abdullah Mueen
Results Abdullah Mueen
Results Abdullah Mueen
Conclusion • A(k) index is smaller than precise indexes and have their advantages, such as faster execution time with significant accuracy. • Future presentations • Change of the indexes with updates. • Incorporating more complex queries. Abdullah Mueen