A New Algorithm for Evaluating Ordered Tree Pattern Queries

A New Algorithm for Evaluating Ordered Tree Pattern Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

Outline • Motivation • Algorithm for tree pattern query evaluation based on ordered tree matching - Tree encoding - Algorithm description • Experiment results • Summary

a tree pattern query XML documents Motivation • Efficient method to evaluate XPath expression queries – XML query processing

<Purchase> <Seller> <Name>dell</Name> <Item> <Manufacturer>IBM</Manufacturer> <Name>part#1</Name> <Item> <Manufacturer>Intel</Manufacturer> </Item> </Item> <Item> <Name>Part#2</Name> </Item> <Location>Houston</Location> </Seller> <Buyer> <Location>Winnipeg</Location> <Name>Y-Chen</Name> </Buyer> </Purchase> P S B N I I L L N Houston Winnipeg Y-Chen Dell M N I N IBM Part#1 Part#2 M Intel Motivation Document:

P S B N I I L L N Houston Winnipeg Y-Chen Dell M N I N IBM Part#1 Part#2 M Intel Motivation Document: Query – XPath expressions: Q1: /Purchase[Seller[Loc=‘Boston’]]/ Buyer[Loc = ‘New York’ Purchase Buyer Seller Location Location ‘Winnipeg’ ‘Houston’ Q2: /Purchase//Item[Manufacturer = ‘Intel’] Buyer d-edge: ancestor- descendant relationship Item Manufacturer c-edge: parent-child relationship ‘Intel’

a b b c d c e d book title author Art of Programming fn ln Knuth Donald Motivation • XPath evaluation against XML documents - XPath expression a[b[c and .//d]]/b[c and e//d] book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and ln = ‘Knuth’] <document> <book> <title> Art of Programming </title> <author> <fn>Donald Knuth</fn> … …

a c c a b b c b b Motivation • XPath evaluation against XML documents -Evaluation based on unordered tree matching: Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If uv in Q, then f(v) is a child of f(u) in T; if uv in Q, then f(v) is a descendant of f(u) in T. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 v2

s s np vp vp det n v np adv v n adv “The” “student” “reads” det adj n “carefully” “reads” “book” “the” “interesting” “book” Motivation • XPath evaluation against XML documents -Evaluation based on ordered tree matching XPath expression: s/vp[v = “reads” and /following-sibling::n = “book” and /following sibling::adv]

a c c a b b c b b Motivation • XPath evaluation against XML documents -Evaluation based on ordered tree matching: Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If uv in Q, then f(v) is a child of f(u) in T; if uv in Q, then f(v) is a descendant of f(u) in T. (iii) Preserve sibling order: For any two nodes v1 Q and v2 Q, if v1is to the left of v2, then f(v1)is to the left of f(v2) in T. T: Q: q3 v6 q1 q2 v4 v5 v1 v3 v2

T: (1, 1, 11, 1) A v1 B v8 (1, 2, 9, 2) B v2 (1, 10, 10, 2) v3 C (1, 3, 3, 3) B v4 (1, 4, 8, 3) v5 C v6 C D v7 (1, 7, 7, 4) (1, 5, 5, 4) (1, 6, 6, 4) Algorithm for query evaluation • Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. 1 <A> <C> </C> <C> </C> <C> </C> <D> </D> </A> 2 3 3 4 5 5 6 6 7 7 8 9 10 10 11

Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as a(v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. (i) ancestor-descendant: a node v1 associated with (d1, l1, r1, ln1) is an ancestor of another node v2 with (d2, l2, r2, ln2) iff d1= d2, l1< l2, and r1> r2. (ii) parent-child: a node v1 associated with (d1, l1, r1, ln1) is the parent of another node v2 with (d2, l2, r2, ln2) iff d1= d2, l1< l2, r1> r2, and ln2 = ln1 + 1. (iii)from left to right: a node v1 associated with (d1, l1, r1, ln1) is to the left of another node v2 with (d2, l2, r2, ln2) iff d1= d2, r1< l2.

T: (1, 1, 11, 1) A v1 B v8 (1, 2, 9, 2) B v2 (1, 10, 10, 2) v3 C (1, 3, 3, 3) B v4 (1, 4, 8, 3) v5 C v6 C D v7 (1, 7, 7, 4) (1, 5, 5, 4) (1, 6, 6, 4) Algorithm for query evaluation • Tree encoding 1 <A> <C> </C> <C> </C> <C> </C> <D> </D> </A> 2 3 3 4 5 5 6 6 7 7 8 9 10 Data streams: 10 sorted by LeftPos values 11 A:(1, 1, 11, 1) B:(1, 2, 9, 2)(1, 4, 8, 3), (1, 10, 10, 2) C:(1, 3, 3, 3)(1, 5, 5, 4), (1, 6, 6, 4) D:(1, 7, 7, 4)

A q1 L(q1) Q: B q5 q2 B L(q2) L(q5) q3 C C q4 L(q4) L(q3) Algorithm for query evaluation • Algorithm description • Our algorithm works bottom-up. Therefore, we need to sort XML • streams by (DocID, RightPos) values. • Each time a query Q is submitted to the system, we will associate • each query node q with a data stream L(q) such that for • each vL(q) label(v) = label(q), in which each query node is • attached with a list of matching nodes of the document tree. T: Q: sorted by RightPos values {v1} L(q1 ) =(1, 1, 11, 1) - {v4, v2, v8} L(q2 ) = L(q5)=(1, 4, 8, 3),(1, 2, 9, 2)(1, 10, 10, 2) - {v3, v5, v6} L(q3) = L(q4) = (1, 3, 3, 3)(1, 5, 5, 4), (1, 6, 6, 4) -

Algorithm for query evaluation • Algorithm description 1. First, we will number the nodes of Q in postorder. So the nodes in Q will be referenced by their postorder numbers. Additionally, we set a virtual node for Q, numbered 0, which is considered to be to the left of any node in Q. • For each node q of Q, a link from it to the left-most leaf node in • Q[q], denoted by (q), is established. For a leaf node q’, (q’) = q’. Q : (q1) A q1 A q1 5 3 (q2) 4 0 q2 B B q5 q2 B B q5 virtual node 2 1 q3 C C q4 C q4 q3 C

Algorithm for query evaluation • Algorithm description • Let q’ be a leaf node in Q. We denote by -1(q’) a set of nodes x • such that for each q x (q) = q’. (q1) -1(q3) = {q1, q2 , q3} A q1 -1(q4) = {q4} (q2) q2 B B q5 -1(q5) = {q5} C q4 q3 C

Algorithm for query evaluation • Algorithm description • Each node v in T is associate it with an array Av of length |Q|, • indexed from 0 to |Q| - 1. In Av, each entry is a query node or , • defined below: If there is a least leaf q’ larger than q such that -1(q’) contains at least one node x with Q[x] being embedded in T[v]; Max{x | x -1(q’)  T[v] embeds Q[x]}, Av[q] =  Otherwise. Av: Q: x1 Av[q] q xj q’

Algorithm for query evaluation • Algorithm description Av: Q: x1 Av[q] q xj q’ • X1 is the largest ancestor of q’ such that T[v] contains Q[X1]. • q’ is the closest leaf node to the right of q. In this way, both the subtree embedding and the left-to-right ordering can be recorded.

0 1 2 3 4 0 1 2 3 4 1          Av3: Av3: Algorithm for query evaluation • Setting values in Av • If we find Q[x] can be embedded in T[v], we will set Av[q1], ..., • Av[qk] to x, where each ql (1 l k) is a query node to the left of • x, to record the fact that x is the closest node to the right of ql • such that T[v] embeds Q[x]. T: A v1 Q : A q1 5 B v8 B v2 0 3 4 q2 B B q5 v3 C B v4 2 1 q3 C C q4 v5 C v6 C D v7

0 1 2 3 4 0 1 2 3 4 1 1 2        Av3: Av3: Algorithm for query evaluation • Setting values in Av • If some time later we find another node x’, which is to the right • of x,such that Q[x’] can be embedded in T[v], we will set • Av[p1], ..., Av[ps] to x’, where each pl (1l s) is to the left of • x’ but to the right of qk. T: A v1 Q : A q1 5 0 B v8 B v2 3 4 q2 B B q5 v3 C B v4 2 1 q3 C C q4 v5 C v6 C D v7

0 1 2 3 4 0 1 2 3 4 3 1         Av4: Av4: Algorithm for query evaluation • Setting values in Av • If x’ is an ancestor of x, we will find all those entries pointing to a • descendant of x’ on the left-most path in Q[x’]. Replace these • entries with x’. • For all the other nodes v’ such that T[v’] embeds Q[x], we will set • values for the entries in Av’ in the same way as (i), (ii), and (iii). T: A v1 Q : A q1 5 B v8 B v2 4 q2 B B q5 0 3 v3 C B v4 2 1 q3 B C q4 v5 B v6 C D v7

Algorithm for query evaluation • Using Av to check tree embedding • Let q in Q and v in T be the nodes encountered. Let v1, ..., vk be • the child nodes of v. Let q1, ..., ql be the child nodes of q. We first • check Av1starting from Av1[h], where h = (q) - 1. We begin the • search from (q) - 1 because it is the closest node to the left • of the first child of q. Let Av1[h] = q’. If q’ is not q1, noran ancestor • of q1, we will check Av2[h] in a next step. This process continues • until one of the following conditions is satisfied: • (i) All Avj’s have been checked, or • (ii) There exists vj such that Avj[h] is q1 or an ancestor of q1. ? label(v) = label(q) q v q Av1: q1 q1 ql v1 vk h = (q) - 1

Algorithm for query evaluation • Using Av to check tree embedding • If all Avj’s are checked (case (i)), it shows that Q[q1] cannot be • embedded in any subtree rooted at a child node of v. So T[v] • cannot embed Q[v]. If it is case (ii), we know that T[vj] embeds • Q[q1]. If q1 is a //-child, or both q1 and vj are /-children, we will • continue to check Av(j+1)[g] against q2, where g = Avj[h]. • (Otherwise, we will continue to check Av(j+1)[h] against q1.) q Av1: q1 q2 h = (q) - 1 Av2: Av2:[q1]

[2, 2, , , ] [1, , , , ] [2, 2, , , ] A A A A A A : : : : : : [1, , , , ] v2 v8 v3 v5 v6 v4 [1, , , , ] [1, 4, 4, 4, ] [5, , , , ] Algorithm for query evaluation • Algorithm description [1, 4, 4, 4, ] [3, , , , ] [3, 4, 4, 4, ] [3, 2, 4, 4, ] [3, 2, , , ] [3, 2, 4, 4, ] A [5, 2, 4, 4, ] : v1

Algorithm for query evaluation • Experiments • We conducted our experiments on a DELL desktop PC • equipped with Pentium(R) 4 CPU 2.80GHz, 0.99GB RAM • and 20GB hard disk. The code was compiled using • Microsoft Visual C++ compiler version 6.0, running • standalone. • Tested methods • In the experiments, we have tested four methods: • - TwigStack (TS for short) [3], • - Twig2Stack (T2S for short) [10], • - PRIX [30], • - tree-embedding (discussed in this paper, TE for short).

Algorithm for query evaluation • Experiments • Tested methods • In the experiments, we have tested four methods: • - TwigStack (TS for short) [1], • - Twig2Stack (T2S for short) [2], • - PRIX [3], • - tree-embedding (discussed in this paper, TE for short). [1] N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002, pp. 310-321. [2] S. Chen, H-G. Li, J. Tatemura, W-P. Hsiung, D. Agrawa, and K.S. Canda, Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents, in Proc. VLDB, Seoul, Korea, Sept. 2006, pp. 283-294. [3] P. Rao and B. Moon, Sequencing XML Data and Query Twigs for Fast Pattern Matching, ACM Transaction on Database Systems, Vol. 31, No. 1, March 2006, pp. 299-345.

Algorithm for query evaluation • Experiments • Theoretical computational complexities • Indexes XB-trees used for TwigStack, Twig2Stack, TE. Trie structure used for PRIX

Algorithm for query evaluation • Experiments • Data sets The TreeBank dataset is a real data set with narrow and deeply recursive structure that includes multiple recursive elements. (U. of Washington, The Tukwila System, available from http://data.cs.washington.edu/integration/tukwila/.) Data size: 82 MB Num. of nodes: 2.43 million Max/average tree depth: 36/7.9 • Queries Q1: //VP[DT]//PRP_DOLLAR Q2: //S/VP/PP[IN]/NP Q3: //S/VP//PP[NP/VB]/IN Q4: //VP[.//PP/IN]//NP/*//JJ Q5: //S[CC][.//PP]//NP[VBZ][IN]//JJ

Algorithm for query evaluation • Experiments • Test results For all the experiments, the buffer pool size was fixed at 2000 pages. The page size of 8KB was used. For each data set, all the tag names are stored in a single list and then each tag name is represented by its order number in that list during the evaluation of queries. In our implementation, each DocId occupies 4 bytes while a number in a Prüfer sequence, a LeftPos or a RightPos occupies 2 bytes. A levelNum value takes only 1 byte. I Page numbers

Algorithm for query evaluation • Experiments execution time (sec.) I/O time (sec.) 24 16 + + + PRIX TE TS T2S PRIX TE TS T2S 18 12 12 + 8 + + 6 4 + + + + 1 1 Q1 Q2 Q3 Q4 Q5 Q1 Q2 Q3 Q4 Q5

Summary • An efficient method for evaluating ordered tree pattern queries in XML document databases • - parent/child and ancestor/descendant relations • - from-left-to-right relations • Computational complexity • - O(|T|leafQ) time • - O(leafTleafQ) space • Experiments • - TreeBank database • - I/O time and CPU processing time

Thank you.

A New Algorithm for Evaluating Ordered Tree Pattern Queries

A New Algorithm for Evaluating Ordered Tree Pattern Queries

Presentation Transcript

Containment of Partially Specified Tree-Pattern Queries

SlimSS -tree: A New Tree Combined SS -tree With Slim-down Algorithm

On Testing Satisfiability of Tree Pattern Queries

Tree-Pattern Queries on a Lightweight XML Processor

A Full-Text Search Algorithm for Long Queries

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries

Answering Tree Pattern Queries Using Views

Efficient Algorithm For Processing XPath Queries

A Fast Algorithm for Multi-Pattern Searching

Evaluation of Tree Pattern Queries

Evaluating “find a path” reachability queries

Temporally-Ordered Routing Algorithm (TORA)

A Scalable Algorithm for Answering Queries Using Views

Spatio-temporal Pattern Queries

A Polynomial Time Matching Algorithm of Ordered Tree Patterns having Height-Constrained Variables

A New Top-down Algorithm for Tree Inclusion

A New Top-down Algorithm for Tree Inclusion

A New Algorithm for Evaluating Ordered Tree Pattern Queries

Frequent-Pattern Tree

Tree-Pattern Queries on a Lightweight XML Processor

A Scalable Algorithm for Answering Queries Using Views