Covering Index for Branching Path Queries

Covering Index for Branching Path Queries Raghav kaushik University of Wisconsin Philip Bohannon Bell Laboratories Jeffrey F Naughton University of Wisconsin Henry F Korth Bell Laboratories SIGMOD 2002 Presented by: Yu Fan

Overview • Motivation • Problem • Introduction • Background • Covering Index Definition Scheme • Performance Study • Conclusion

Motivation • Covering index is a well-known technique in relation database systems • Define an index that “cover” all attributes of a table that are referenced in a query • Evaluate query without the table • Speed up query performance • Can covering index used to accelerate the branching path queries? • Yes

Problem • The existing index are large in practice • DataGuide • 1-Index • Forward and Backward Index (F&B Index)

The Labeled Graph Data Model • Model XML or semi-structured data as a directed, node-labeled tree with extra set of special edges called idrefedges • Directed graph

The Labeled Graph Data Model

Branching Path Expressions • Forward and Backward Separators • If ni and ni+1 are separated by a • /: then ni is the parent of ni+1 • //: then ni is the ancestor of ni+1 • : then ni points to ni+1 through an idref edge • \: then ni is the child of ni+1 • \\:then ni is the descendant of ni+1 • : then ni is poined byni+1 through an idref edge

Branching Path Expressions • Label-path • A sequence of labels l1, l2,…lp separated by the separators • Node-path • A sequence of nodes n1,n2,…np separated by the separators • A node-path matches a label-path if the corresponding separators are the same and label(ni) = li

Branching Path Expressions • Primary path is the path that remains when all parts between brackets “[” and “]” are removed. • Example: Root/metro/neighorhoods/neighbornood[/business hotel]/cultural museum

Index Graph • Index Graph I(G), where G is the data graph • A is the node in I, ext(A), the extent of A, is the subset of VG • Query result • A branching path expression P on I(G) • Union of the extents of the index nodes that result from evaluating P on I(G)

Bisimularity • Definition: a symmetric, binary relation  on VG is called a bisimulation if, for any two data nodes u and v with u  v, we have that: • u and v have the same label • If paru is the parent of u and parv is the parent of v, then paru  parv • If u’ points to u through an idref edge, then there is a v’ that points to v through an idref such that u’  v’, and vice-versa.

DataGuide • Concise and accurate structural summaries of semi-structured databases

1-index • Index graph which is constructed on data graph G using bisimulation • Intuition: try to group together nodes if they have the same incoming paths

Forward and Backward index • Construct F&B-Index on edge-labeled data graph • For every (edge) label l, add a new label l-1 • For every edge e labeled l from node u to node v, add an (inverse) edge e-1 with label l-1 from v to u • Compute the 1-Index (or DataGuide) on this modified graph

Succ-Stable and Pred-Stable • For a set of nodes A, Let Succ(A) denote the set of successors of the nodes in A. • Given two sets of data graph nodes A and B, A is said to be succ-stable with respect to B if either A is a subset of Succ(B) or A and Succ(B) are disjoint • Pred-stable can be defined in the same way

Stability • If A is succ-stable with respect to B and there is an edge from B to A, then every note in extent of A has a parent in the extent of B • Important for precision of index graph • Stabilize A and B • Splite A into A1 and A2 • A1 is A  succ(B) • A2 is A – succ(B) • 1-Index • Initialization by label grouping • Splitting the label grouping till we obtain succ-stable refinement

Another View of F&B-Index • Another way to build F&B-Index • Reverse all edges in G • Compute the bisimilarity partition • Set the current partition to what is output by the previous step • Reverse edges in G again • Compute the bisimilarity partition • Set the current partition to what is output by the previous step • Repeat the above steps till the current partition does not change • Obtain a partition of the data nodes that is both succ-stable and pred-stable

Size of the F&B-Index • F&B-Index over a data graph G covers all branching path expressions over G • Any index graph that covers all branching path expressions over G must be a refinement of F&B Index • F&B-Index is the smallest index graph that covers all branching path expressions over G • F&B-Index is often big. It can approach the size of the base data itself

Covering Index Definition Scheme • Eliminating branching path expressions which are deemed less important. • Smaller index handling the remaining branching queries more efficiently • Four approaches towards the goal • Tags to be indexed • Tree edges vs idref edges • Exploiting local similarity • Restricting tree depth

Tags to be indexed • Tags that never queried • Need not be indexed • Alter the label with a unique label: other • If not in the tree path to any node that is indexed, it can be assumed to be absent • Can have a lot of effect in practice • XMark data, 100MB(1.43M nodes) • F&B-Index has 436000 nodes • Ignore text tags such as bold and emph • Number of nodes drops to 18000

Tree Edges vs idref Edges • Effect of idref edges • XMard data • F&B-Index on tree edges and idref edges has 1.35M nodes (ignore text nodes) • F&B-Index on only tree edges has 18000 nodes (ignore text nodes) • Give tree edges priority • Specify the set of idref edges to be indexed

Exploiting Local similarity • Observations: • Most queries refer to short paths and seldom ask for long paths • Two nodes are locally similar, but they may be stored in different extents due to a variety of complex paths • Exploiting local similarity • Give up absolute precision and group similar pieces of data together • A(k)-Index

K-bisimulation • Definition: k (k-bisimilarity) is defined inductively • For any two nodes, v and v, u 0 v iff u and v have the same label • Node u kv iff u k-1v, paru k-1 parv • For every u’ that points to u through an idref edge, there is a v’ that points to v through an idref edge such that u’ k-1 v’, and vice versa

A(k)-index • Constructed on data graph G using k-bisimulation • Precise for any simple path expression of length less than or equal to k • Use k to control the size of the index and the maximum area of the index graph affected • Increasing k refines the partition until a fixed point is reached, which is 1-Index.

A(k)-Index Example

Restricting Tree Depth • Tree Depth • Given a branching path expression • All nodes that do not have tree-depth 0 • Nodes that have a path from some node in the primary path have tree-depth 1 • Nodes that do not have tree-depth 1 and have a path to some node of tree-depth 1 have tree-depth 2 • Nodes that do not have tree-depth 2 and have a path from some node of tree-depth 2 have tree-depth 3 • And so on… • Tree depth of a query is the maximum tree-depth of its nodes

Tree Depth Example • Query example • //museums/history/museum[/featured and cultural\neighborhood [/cultural  museum [\art]]] • asks for history museums that have a featured exhibit and also have an art museum in the same neighborhood

F+B-Index • Consider one iteration of F&B-Index Computation • Reverse all edges in G. • Compute the bisimilarity partition • Reverse edges in G again • Compute the bisimilarity partition • Call this index graph F+B-Index • F+B+F+B-Index: two iteration

F+B-Index • F+B-Index is accurate for branching path expressions that have tree depth at most 1 • F+B+F+B-Index is accurate for branching path expressions that have tree depth at most 3 • Can not handle all the queries • Meaningful queries are often with small tree depth

Putting it together • Index definition • A set of tags T to be indexed. • For each of the forward and backward didrecions • Set of idref edges to be indexed (denote as reffwd and refback) • The extent of local similarity desired (denote as kfwd and kback) • Tree depth td, the number of iterations in the F&B-index computation to be performed

Compute Partition

Compute Index

Example • Tags to be indexed • ROOT, metro, cinema-hall, neighborhoods, neighborhood, business • Local similatiry • kfwd= kback = ∞ • td = ∞ ROOT metro business neighborhoods neighborhood neighborhood Cinema-halls 9,10 business Cinema-hall business 24,26

Definition Scheme of Existing Index

Index Selection • Given query • The tag should be indexed • kfwd≥ path length of the query • kback ≥ path length of the query • td ≥ tree depth of the query • More generic index, more queries coverd, worse performance we get. • Depends heavily on the data and the queries

Performance study • XMark XML benchmark dataset • Models an auction site

Range of Index

Performance on Queries • Use defn 5,6,8, called Iall, Ialmost-alland Ispecific • Use 5 different queries • Some index may not cover the queries due to the reduction • Three scenarios • RELSTORE: stored in relational system • NSTORE: stored using a native storage engine • RELPUBLISH: stored in relation system and queries are over an XML view of data

Test Queries

Performance on Queries (a) (b) (a): RELSTORE (b): NSTORE (c): RELPUBLISH (c)

Conclusion • Covering indexes are a promising approach to their efficient evaluation • F&B-Index can be a covering index for all set of branching path queries, but the size of the index is to big in practice • Using scheme definition, we can get much smaller covering indexes that cover certain classes of queries

Questions?

Covering Index for Branching Path Queries

Covering Index for Branching Path Queries

Presentation Transcript

A Privacy-Preserving Index for Range Queries

A Privacy-Preserving Index for Range Queries

Covering Indexes for XML Queries by Prakash Ramanan

A Privacy Preserving Index for Range Queries

Covering Indexes for Branching Path Queries

AGGREGATE PATH INDEX FOR INCREMENTL WEB VIEW MAINTENANCE

Evaluating Path Queries over Route Collections

Evaluating Reachability Queries over Path Collections*

Evaluating Reachability Queries over Path Collections

A Privacy – Preserving Index for Range queries

Dual Bitmap Index: Space-Time Efficient Bitmap Index for Equality and Membership Queries

Branching

Evaluating “find a path” reachability queries

Compiling Path Queries in Software-Defined Networks

Interactive Storytelling for Video Games Chapter 9 : Branching Path Stories

Branching

On the Path to Efficient XML Queries

Dual Bitmap Index: Space-Time Efficient Bitmap Index for Equality and Membership Queries

Path-Hop: efficiently indexing large graphs for reachability queries

Branching

Branching

A Privacy-Preserving Index for Range Queries