1 / 27

BLAS: An Efficient XPath Processing System

BLAS: An Efficient XPath Processing System. Zhimin Song Advanced Database System Professor: Dr. Mengchi Liu. Outline. Introduction BLAS System Experimental Results Conclusions. < ProteinDatabase > < ProteinEntry > < Protein >

wyatt
Télécharger la présentation

BLAS: An Efficient XPath Processing System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BLAS: An Efficient XPath Processing System Zhimin Song Advanced Database System Professor: Dr. Mengchi Liu

  2. Outline • Introduction • BLAS System • Experimental Results • Conclusions

  3. <ProteinDatabase> • <ProteinEntry> • <Protein> • <Name> cytochrome c [validated]</name> • <classification> • <superfamily>cytochrome c</superfamily> • </classification>… • </protein> • <reference> • <refinfo> • <authors> • <author>Evans, M.J.</author>… • </authors> • <year>2001</year> • <title> The human somatic cytochrome c gene </title> … • </refinfo>… • </reference>… • </ProteinEntry> … • </ProteinDatabase> • Figure 1 : Sample XML protein repository

  4. Introduction • XML has complex, tree-like structure(nodes). • Languages for Querying XML are based on path navigation(XPath [1]). Given node  Child node(Child axis) Given node  Descendant node(Descendant axis)

  5. Introduction(cont..) • Some techniques were already proposed in order to improve XPath Processing. For example, D-labeling which is used to efficiently handle descendant axis traversal. • What about complex queries including child axis, branch??? • In this case P-labeling is proposed in this paper. It optimizes an important class of queries called suffix path queries.

  6. BLAS(Bi-LAbeling based System) • Basic definitions • The labeling scheme(Index generator) • Query translator

  7. Basic definitions: • BLAS: a system for efficiently process complex queries based D-labeling and P-labeling. • The BLAS deals with a subset of XPath queires consisting of: • Child axis navigation ( / ) • Descendant axis navigation ( // ) • Branches ( […..] ) • The evaluation of a path expression P( [P] ) returns the set of nodes in an XML tree T which are reachable by P starting from the root of T. • Since P can be evaluated to retrieve a set of XML nodes, we use “Path expression” and “query” interchangeably. • P Q if and only if [P] [Q]. • P Q = if and only if [P] [Q] =

  8. Basic definitions(cont..): • Suffix path expression: a path expression P which optionally begins with a descendant axis step(//), followed by zero or more child axis steps (/). • Example: //protein/name • Another one : /proteinDatabase/proteinEntry/protein/name • SP(n) : the unique simple path P from the root to the node n. • So evaluating a suffix path expression Q is to find all the nodes n such that SP(n) Q.

  9. Subquery Suffix Path Query Subquery Generator (based on P-labeling) Query … … XPath Query Query decomposition Subquery composition (based on D-labeling) Subquery Suffix Path Query Ancestor-descendant relationship between the results of the suffix path queries Query Translator Query Engine P-labeling generator P-labelings SAX Parser XML Events Storage Data values Query result Data loader D-labeling generator D-labelings Architecture of BLAS

  10. The labeling scheme(Index generator) • D-labeling scheme: triplet <d1,d2,d3> for a XML node n(n.d1 <= n.d2) and m(m.d1<=m.d2). • m is a descendant of n if and only if n.d1<m.d1 and n.d2>m.d2. • m is a child of n if and only if m is a descendant of n and n.d3+1=m.d3. • Let d1 and d2 for a node n be the position of the start tag and end tag. • d3 is set to be the level of n in the XML tree which is the length of the path from the root to n.  D-label will be represented as <start,end,level>

  11. Query: //proteinDatabase//refinfo First retrieve all the nodes reachable by refinfo and by proteinDatabase Let pDB and refinfo be two relations which store these nodes, then D-join them • Example: using D-labeling proteinDatabase proteinEntry protein reference superfamily // refinfo “cytochrome c” // author Title year Select pDB.start,pDB.end,refinfo.start,refinfo.end From pDB, refinfo Where pDB.start < refinfo.start and pDB.end > refinfo.end “Evans, M.J.” “2001”

  12. P-labeling Scheme • It is also important to implement child axis navigation efficiently. • e.g. /proteinDatabase/proteinEntry/protein/name • Target: improve “/” evaluation • Focus on suffix path queries: e.g. //protein/name

  13. Assign each node a number<p1>, and each suffix path an interval <p1,p2> such that: • For any two suffix paths Q1 and Q2, Q1 is contained in Q2 if Q1.p1<= Q2.p1 and Q1.p2>= Q2.p2 • A node n is contained in the suffix path Q if Q.p1<= SP(n).p1 <=Q.p2. • Let Q be a suffix path query. Then [Q] = {n | Q.p1 <= n.plabel<=Q.p2} when n.plabel=SP(n).p1

  14. P-labeling Construction(algorithm) • Suppose that there are n distinct tags (t1,t2,….,tn). • Assign “/” a ratio r0 and each tag ti a ratio ri such that r0+r1+r2+…….+ri = 1. • Let ri = 1/(n+1). • Define the domain of the numbers in a P-label to be integers in [0, m-1], here m is chosen such that m>= , where h is the longest path in an XML tree. • Algorithms as follows: • Path // is assigned an interval(P-label) of <o, m-1>. • Partition the interval <0, m-1> in tag order proportional to ti’s ratio ri, for each path //ti and child axis navigation’s ratio r0. • This means we allocate the interval<0, m*r0 -1> to “/” and <pi, pi+1> to each ti such that (pi+1 - pi)/m=ri and p1/m = r0

  15. /protein/name ... 4.0301*1010 4.03*1010 4.04*1010 //proteinDatabase/name //proteinEntry/name //protein/name /name ... 4.04*1010 5*1010 4*1010 4.01*1010 4.02*1010 4.03*1010 //protein Database //protein Entry //protein //name / ... 1012 0 1010 2*1010 3*1010 4*1010 5*1010 Query: //protein/name M=1012 99 tags Ri=0.01 • P-labeling Construction(Example)

  16. Query translator:translates an input XPath query into standard SQL. • Query decomposition • Splits the query in to a set of suffix path queries and records the ancestor-descendant relationship. • SQL generation • Computes the query’s p-labeling and generates a corresponding subquery in SQL. • SQL composition • The subqueries are combined into a single SQL query based on D-labeling and the ancestor-descendant relationship.

  17. P//q  p and //q Q1 • Split algorithm: • D-elimination(query tree Q) proteinDatabase proteinEntry Depth-first traversal protein reference Split p//q into p and //q Q2 Invokes the B-elimination if branches in Q. Otherwise, it evaluates Q using P-labels. // refinfo superfamily year “cytochrome c” Title “2001” Join intermediate results by their D-labels // Q3 author “Evans, M.J.”

  18. Q1 Q4 proteinDatabase proteinDatabase proteinEntry proteinEntry Q6 Q5 // // protein reference reference protein refinfo refinfo year Title year Title “2001” “2001” P[q1,q2….qi]/r  p, //q1, //q2,…..,//qi, //r • B-elimination(query tree Q1)

  19. protein B-elimination(cont..): Q4 proteinDatabase proteinEntry Q7 // Q5 // reference refinfo Q8 Q9 // // year Title “2001”

  20. Since p/qi and p/r are more specific than //qi and //r, Then split P[q1,q2,….,qi]/r  p, p/q1, p/q2, …..p/qi, p/r • Push up algorithm: optimize the branch elimination (B-elimination). proteinDatabase Q4 proteinDatabase proteinEntry proteinEntry proteinDatabase reference proteinEntry refinfo reference Q5 proteinDatabase refinfo proteinDatabase proteinEntry year reference proteinEntry “2001” refinfo protein title

  21. Unfold algorithm:A further optimization of descendant-axis elimination(D-elimination). There is example as follows: Q2=/ProteinDatabase/ProteinEntry/protein//superfamily=“cytochrome c” Q21 = /ProteinDatabase/ProteinEntry/protein/classification/ superfamily=“cytochrome c” , P//q  p/r1/q, p/r2/q, ….., p/ri/q

  22. Experimental Results • Data sets • Query sets • Suffix path queries • Path queries • XPath queries • Query Engine: RDBMS or File System

  23. Query Execution Time 1: suffix path query 2: path query 3: XPath query A:Auction P: Protein S: Shakespeare Query time for Shakespeare, Protein and Auction data sets

  24. Scalability The performance of D-labeling, Split and Push up for the suffix path query

  25. Conclusion • P-labeling scheme is proposed to evaluate suffix path queries efficiently. • BLAS combines P-labeling and D-labeling to evaluate XPath queries. • BLAS is more efficient because the queries translated from XPath queries require: • fewer disk accesses • fewer joins • Experiments show the effectiveness of BLAS

  26. [1]J. Clark and S. DeRose. XML Path language (XPath), November1999. http://www.w3.org/TR/xpath. • [13] D. DeHaan, D. Toman, M. Consens, and M. T. Ozsu. A comprehensive XQuery to SQL translation using dynamic intervalencoding. In Proceedings of SIGMOD, 2001. • [26] J.-K. Min, M.-J. Park, and C.-W. Chung. XPRESS: A queriablecompression for XML data. In Proceedings of SIGMOD, 2003.

  27. Thank you! Question ?

More Related