270 likes | 431 Vues
BLAS: An Efficient XPath Processing System. Zhimin Song Advanced Database System Professor: Dr. Mengchi Liu. Outline. Introduction BLAS System Experimental Results Conclusions. < ProteinDatabase > < ProteinEntry > < Protein >
E N D
BLAS: An Efficient XPath Processing System Zhimin Song Advanced Database System Professor: Dr. Mengchi Liu
Outline • Introduction • BLAS System • Experimental Results • Conclusions
<ProteinDatabase> • <ProteinEntry> • <Protein> • <Name> cytochrome c [validated]</name> • <classification> • <superfamily>cytochrome c</superfamily> • </classification>… • </protein> • <reference> • <refinfo> • <authors> • <author>Evans, M.J.</author>… • </authors> • <year>2001</year> • <title> The human somatic cytochrome c gene </title> … • </refinfo>… • </reference>… • </ProteinEntry> … • </ProteinDatabase> • Figure 1 : Sample XML protein repository
Introduction • XML has complex, tree-like structure(nodes). • Languages for Querying XML are based on path navigation(XPath [1]). Given node Child node(Child axis) Given node Descendant node(Descendant axis)
Introduction(cont..) • Some techniques were already proposed in order to improve XPath Processing. For example, D-labeling which is used to efficiently handle descendant axis traversal. • What about complex queries including child axis, branch??? • In this case P-labeling is proposed in this paper. It optimizes an important class of queries called suffix path queries.
BLAS(Bi-LAbeling based System) • Basic definitions • The labeling scheme(Index generator) • Query translator
Basic definitions: • BLAS: a system for efficiently process complex queries based D-labeling and P-labeling. • The BLAS deals with a subset of XPath queires consisting of: • Child axis navigation ( / ) • Descendant axis navigation ( // ) • Branches ( […..] ) • The evaluation of a path expression P( [P] ) returns the set of nodes in an XML tree T which are reachable by P starting from the root of T. • Since P can be evaluated to retrieve a set of XML nodes, we use “Path expression” and “query” interchangeably. • P Q if and only if [P] [Q]. • P Q = if and only if [P] [Q] =
Basic definitions(cont..): • Suffix path expression: a path expression P which optionally begins with a descendant axis step(//), followed by zero or more child axis steps (/). • Example: //protein/name • Another one : /proteinDatabase/proteinEntry/protein/name • SP(n) : the unique simple path P from the root to the node n. • So evaluating a suffix path expression Q is to find all the nodes n such that SP(n) Q.
Subquery Suffix Path Query Subquery Generator (based on P-labeling) Query … … XPath Query Query decomposition Subquery composition (based on D-labeling) Subquery Suffix Path Query Ancestor-descendant relationship between the results of the suffix path queries Query Translator Query Engine P-labeling generator P-labelings SAX Parser XML Events Storage Data values Query result Data loader D-labeling generator D-labelings Architecture of BLAS
The labeling scheme(Index generator) • D-labeling scheme: triplet <d1,d2,d3> for a XML node n(n.d1 <= n.d2) and m(m.d1<=m.d2). • m is a descendant of n if and only if n.d1<m.d1 and n.d2>m.d2. • m is a child of n if and only if m is a descendant of n and n.d3+1=m.d3. • Let d1 and d2 for a node n be the position of the start tag and end tag. • d3 is set to be the level of n in the XML tree which is the length of the path from the root to n. D-label will be represented as <start,end,level>
Query: //proteinDatabase//refinfo First retrieve all the nodes reachable by refinfo and by proteinDatabase Let pDB and refinfo be two relations which store these nodes, then D-join them • Example: using D-labeling proteinDatabase proteinEntry protein reference superfamily // refinfo “cytochrome c” // author Title year Select pDB.start,pDB.end,refinfo.start,refinfo.end From pDB, refinfo Where pDB.start < refinfo.start and pDB.end > refinfo.end “Evans, M.J.” “2001”
P-labeling Scheme • It is also important to implement child axis navigation efficiently. • e.g. /proteinDatabase/proteinEntry/protein/name • Target: improve “/” evaluation • Focus on suffix path queries: e.g. //protein/name
Assign each node a number<p1>, and each suffix path an interval <p1,p2> such that: • For any two suffix paths Q1 and Q2, Q1 is contained in Q2 if Q1.p1<= Q2.p1 and Q1.p2>= Q2.p2 • A node n is contained in the suffix path Q if Q.p1<= SP(n).p1 <=Q.p2. • Let Q be a suffix path query. Then [Q] = {n | Q.p1 <= n.plabel<=Q.p2} when n.plabel=SP(n).p1
P-labeling Construction(algorithm) • Suppose that there are n distinct tags (t1,t2,….,tn). • Assign “/” a ratio r0 and each tag ti a ratio ri such that r0+r1+r2+…….+ri = 1. • Let ri = 1/(n+1). • Define the domain of the numbers in a P-label to be integers in [0, m-1], here m is chosen such that m>= , where h is the longest path in an XML tree. • Algorithms as follows: • Path // is assigned an interval(P-label) of <o, m-1>. • Partition the interval <0, m-1> in tag order proportional to ti’s ratio ri, for each path //ti and child axis navigation’s ratio r0. • This means we allocate the interval<0, m*r0 -1> to “/” and <pi, pi+1> to each ti such that (pi+1 - pi)/m=ri and p1/m = r0
/protein/name ... 4.0301*1010 4.03*1010 4.04*1010 //proteinDatabase/name //proteinEntry/name //protein/name /name ... 4.04*1010 5*1010 4*1010 4.01*1010 4.02*1010 4.03*1010 //protein Database //protein Entry //protein //name / ... 1012 0 1010 2*1010 3*1010 4*1010 5*1010 Query: //protein/name M=1012 99 tags Ri=0.01 • P-labeling Construction(Example)
Query translator:translates an input XPath query into standard SQL. • Query decomposition • Splits the query in to a set of suffix path queries and records the ancestor-descendant relationship. • SQL generation • Computes the query’s p-labeling and generates a corresponding subquery in SQL. • SQL composition • The subqueries are combined into a single SQL query based on D-labeling and the ancestor-descendant relationship.
P//q p and //q Q1 • Split algorithm: • D-elimination(query tree Q) proteinDatabase proteinEntry Depth-first traversal protein reference Split p//q into p and //q Q2 Invokes the B-elimination if branches in Q. Otherwise, it evaluates Q using P-labels. // refinfo superfamily year “cytochrome c” Title “2001” Join intermediate results by their D-labels // Q3 author “Evans, M.J.”
Q1 Q4 proteinDatabase proteinDatabase proteinEntry proteinEntry Q6 Q5 // // protein reference reference protein refinfo refinfo year Title year Title “2001” “2001” P[q1,q2….qi]/r p, //q1, //q2,…..,//qi, //r • B-elimination(query tree Q1)
protein B-elimination(cont..): Q4 proteinDatabase proteinEntry Q7 // Q5 // reference refinfo Q8 Q9 // // year Title “2001”
Since p/qi and p/r are more specific than //qi and //r, Then split P[q1,q2,….,qi]/r p, p/q1, p/q2, …..p/qi, p/r • Push up algorithm: optimize the branch elimination (B-elimination). proteinDatabase Q4 proteinDatabase proteinEntry proteinEntry proteinDatabase reference proteinEntry refinfo reference Q5 proteinDatabase refinfo proteinDatabase proteinEntry year reference proteinEntry “2001” refinfo protein title
Unfold algorithm:A further optimization of descendant-axis elimination(D-elimination). There is example as follows: Q2=/ProteinDatabase/ProteinEntry/protein//superfamily=“cytochrome c” Q21 = /ProteinDatabase/ProteinEntry/protein/classification/ superfamily=“cytochrome c” , P//q p/r1/q, p/r2/q, ….., p/ri/q
Experimental Results • Data sets • Query sets • Suffix path queries • Path queries • XPath queries • Query Engine: RDBMS or File System
Query Execution Time 1: suffix path query 2: path query 3: XPath query A:Auction P: Protein S: Shakespeare Query time for Shakespeare, Protein and Auction data sets
Scalability The performance of D-labeling, Split and Push up for the suffix path query
Conclusion • P-labeling scheme is proposed to evaluate suffix path queries efficiently. • BLAS combines P-labeling and D-labeling to evaluate XPath queries. • BLAS is more efficient because the queries translated from XPath queries require: • fewer disk accesses • fewer joins • Experiments show the effectiveness of BLAS
[1]J. Clark and S. DeRose. XML Path language (XPath), November1999. http://www.w3.org/TR/xpath. • [13] D. DeHaan, D. Toman, M. Consens, and M. T. Ozsu. A comprehensive XQuery to SQL translation using dynamic intervalencoding. In Proceedings of SIGMOD, 2001. • [26] J.-K. Min, M.-J. Park, and C.-W. Chung. XPRESS: A queriablecompression for XML data. In Proceedings of SIGMOD, 2003.
Thank you! Question ?