1 / 43

Covering Index for Branching Path Queries

Covering Index for Branching Path Queries Raghav kaushik University of Wisconsin Philip Bohannon Bell Laboratories Jeffrey F Naughton University of Wisconsin Henry F Korth Bell Laboratories SIGMOD 2002 Presented by: Yu Fan Overview Motivation Problem Introduction Background

elina
Télécharger la présentation

Covering Index for Branching Path Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Covering Index for Branching Path Queries Raghav kaushik University of Wisconsin Philip Bohannon Bell Laboratories Jeffrey F Naughton University of Wisconsin Henry F Korth Bell Laboratories SIGMOD 2002 Presented by: Yu Fan

  2. Overview • Motivation • Problem • Introduction • Background • Covering Index Definition Scheme • Performance Study • Conclusion

  3. Motivation • Covering index is a well-known technique in relation database systems • Define an index that “cover” all attributes of a table that are referenced in a query • Evaluate query without the table • Speed up query performance • Can covering index used to accelerate the branching path queries? • Yes

  4. Problem • The existing index are large in practice • DataGuide • 1-Index • Forward and Backward Index (F&B Index)

  5. The Labeled Graph Data Model • Model XML or semi-structured data as a directed, node-labeled tree with extra set of special edges called idrefedges • Directed graph

  6. The Labeled Graph Data Model

  7. Branching Path Expressions • Forward and Backward Separators • If ni and ni+1 are separated by a • /: then ni is the parent of ni+1 • //: then ni is the ancestor of ni+1 • : then ni points to ni+1 through an idref edge • \: then ni is the child of ni+1 • \\:then ni is the descendant of ni+1 • : then ni is poined byni+1 through an idref edge

  8. Branching Path Expressions • Label-path • A sequence of labels l1, l2,…lp separated by the separators • Node-path • A sequence of nodes n1,n2,…np separated by the separators • A node-path matches a label-path if the corresponding separators are the same and label(ni) = li

  9. Branching Path Expressions • Primary path is the path that remains when all parts between brackets “[” and “]” are removed. • Example: Root/metro/neighorhoods/neighbornood[/business hotel]/cultural museum

  10. Index Graph • Index Graph I(G), where G is the data graph • A is the node in I, ext(A), the extent of A, is the subset of VG • Query result • A branching path expression P on I(G) • Union of the extents of the index nodes that result from evaluating P on I(G)

  11. Bisimularity • Definition: a symmetric, binary relation  on VG is called a bisimulation if, for any two data nodes u and v with u  v, we have that: • u and v have the same label • If paru is the parent of u and parv is the parent of v, then paru  parv • If u’ points to u through an idref edge, then there is a v’ that points to v through an idref such that u’  v’, and vice-versa.

  12. DataGuide • Concise and accurate structural summaries of semi-structured databases

  13. 1-index • Index graph which is constructed on data graph G using bisimulation • Intuition: try to group together nodes if they have the same incoming paths

  14. Forward and Backward index • Construct F&B-Index on edge-labeled data graph • For every (edge) label l, add a new label l-1 • For every edge e labeled l from node u to node v, add an (inverse) edge e-1 with label l-1 from v to u • Compute the 1-Index (or DataGuide) on this modified graph

  15. Succ-Stable and Pred-Stable • For a set of nodes A, Let Succ(A) denote the set of successors of the nodes in A. • Given two sets of data graph nodes A and B, A is said to be succ-stable with respect to B if either A is a subset of Succ(B) or A and Succ(B) are disjoint • Pred-stable can be defined in the same way

  16. Stability • If A is succ-stable with respect to B and there is an edge from B to A, then every note in extent of A has a parent in the extent of B • Important for precision of index graph • Stabilize A and B • Splite A into A1 and A2 • A1 is A  succ(B) • A2 is A – succ(B) • 1-Index • Initialization by label grouping • Splitting the label grouping till we obtain succ-stable refinement

  17. Another View of F&B-Index • Another way to build F&B-Index • Reverse all edges in G • Compute the bisimilarity partition • Set the current partition to what is output by the previous step • Reverse edges in G again • Compute the bisimilarity partition • Set the current partition to what is output by the previous step • Repeat the above steps till the current partition does not change • Obtain a partition of the data nodes that is both succ-stable and pred-stable

  18. Size of the F&B-Index • F&B-Index over a data graph G covers all branching path expressions over G • Any index graph that covers all branching path expressions over G must be a refinement of F&B Index • F&B-Index is the smallest index graph that covers all branching path expressions over G • F&B-Index is often big. It can approach the size of the base data itself

  19. Covering Index Definition Scheme • Eliminating branching path expressions which are deemed less important. • Smaller index handling the remaining branching queries more efficiently • Four approaches towards the goal • Tags to be indexed • Tree edges vs idref edges • Exploiting local similarity • Restricting tree depth

  20. Tags to be indexed • Tags that never queried • Need not be indexed • Alter the label with a unique label: other • If not in the tree path to any node that is indexed, it can be assumed to be absent • Can have a lot of effect in practice • XMark data, 100MB(1.43M nodes) • F&B-Index has 436000 nodes • Ignore text tags such as bold and emph • Number of nodes drops to 18000

  21. Tree Edges vs idref Edges • Effect of idref edges • XMard data • F&B-Index on tree edges and idref edges has 1.35M nodes (ignore text nodes) • F&B-Index on only tree edges has 18000 nodes (ignore text nodes) • Give tree edges priority • Specify the set of idref edges to be indexed

  22. Exploiting Local similarity • Observations: • Most queries refer to short paths and seldom ask for long paths • Two nodes are locally similar, but they may be stored in different extents due to a variety of complex paths • Exploiting local similarity • Give up absolute precision and group similar pieces of data together • A(k)-Index

  23. K-bisimulation • Definition: k (k-bisimilarity) is defined inductively • For any two nodes, v and v, u 0 v iff u and v have the same label • Node u kv iff u k-1v, paru k-1 parv • For every u’ that points to u through an idref edge, there is a v’ that points to v through an idref edge such that u’ k-1 v’, and vice versa

  24. A(k)-index • Constructed on data graph G using k-bisimulation • Precise for any simple path expression of length less than or equal to k • Use k to control the size of the index and the maximum area of the index graph affected • Increasing k refines the partition until a fixed point is reached, which is 1-Index.

  25. A(k)-Index Example

  26. Restricting Tree Depth • Tree Depth • Given a branching path expression • All nodes that do not have tree-depth 0 • Nodes that have a path from some node in the primary path have tree-depth 1 • Nodes that do not have tree-depth 1 and have a path to some node of tree-depth 1 have tree-depth 2 • Nodes that do not have tree-depth 2 and have a path from some node of tree-depth 2 have tree-depth 3 • And so on… • Tree depth of a query is the maximum tree-depth of its nodes

  27. Tree Depth Example • Query example • //museums/history/museum[/featured and cultural\neighborhood [/cultural  museum [\art]]] • asks for history museums that have a featured exhibit and also have an art museum in the same neighborhood

  28. F+B-Index • Consider one iteration of F&B-Index Computation • Reverse all edges in G. • Compute the bisimilarity partition • Reverse edges in G again • Compute the bisimilarity partition • Call this index graph F+B-Index • F+B+F+B-Index: two iteration

  29. F+B-Index • F+B-Index is accurate for branching path expressions that have tree depth at most 1 • F+B+F+B-Index is accurate for branching path expressions that have tree depth at most 3 • Can not handle all the queries • Meaningful queries are often with small tree depth

  30. Putting it together • Index definition • A set of tags T to be indexed. • For each of the forward and backward didrecions • Set of idref edges to be indexed (denote as reffwd and refback) • The extent of local similarity desired (denote as kfwd and kback) • Tree depth td, the number of iterations in the F&B-index computation to be performed

  31. Compute Partition

  32. Compute Index

  33. Example • Tags to be indexed • ROOT, metro, cinema-hall, neighborhoods, neighborhood, business • Local similatiry • kfwd= kback = ∞ • td = ∞ ROOT metro business neighborhoods neighborhood neighborhood Cinema-halls 9,10 business Cinema-hall business 24,26

  34. Definition Scheme of Existing Index

  35. Index Selection • Given query • The tag should be indexed • kfwd≥ path length of the query • kback ≥ path length of the query • td ≥ tree depth of the query • More generic index, more queries coverd, worse performance we get. • Depends heavily on the data and the queries

  36. Performance study • XMark XML benchmark dataset • Models an auction site

  37. Range of Index

  38. Range of Index

  39. Performance on Queries • Use defn 5,6,8, called Iall, Ialmost-alland Ispecific • Use 5 different queries • Some index may not cover the queries due to the reduction • Three scenarios • RELSTORE: stored in relational system • NSTORE: stored using a native storage engine • RELPUBLISH: stored in relation system and queries are over an XML view of data

  40. Test Queries

  41. Performance on Queries (a) (b) (a): RELSTORE (b): NSTORE (c): RELPUBLISH (c)

  42. Conclusion • Covering indexes are a promising approach to their efficient evaluation • F&B-Index can be a covering index for all set of branching path queries, but the size of the index is to big in practice • Using scheme definition, we can get much smaller covering indexes that cover certain classes of queries

  43. Questions?

More Related