470 likes | 582 Vues
This paper addresses the inefficiencies in processing complex XPath queries, primarily caused by excessive disk accesses and joins. It introduces the Bi-LAbeling-based System (BLAS), which leverages D-labeling and P-labeling to optimize XPath processing using relational technology. By using a single join for descendant axis traversal instead of multiple transitive closures, BLAS significantly reduces the computational overhead. Experimental results validate its effectiveness, showcasing improved performance in extraction of nodes from large XML datasets, making query evaluations more efficient.
E N D
BLAS: An Efficient XPath Processing SystemChen Y., Davidson S., Zheng Y. Νίκος Λούτας
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Problem • Number of disk accesses and joins is the primary bottleneck for evaluating complex queries efficiently!
Motivation • Can we improve XPath processing which uses relational technology? • D-labeling • Processes descendant axis traversal using a single join rather than a transitive closure of joins. • Observation: D-labeling processes / and // in the same way using joins. • XPRESS – queriable compressed XML files • Reverse arithmetic encoding • A label path as a distinct interval in[0.0, 1.0) • Handling of path expressions : containment relationships
Goals • Process / (simple path expressions) more efficiently • Reduce the number of disk accesses and joins • Optimize the join operations
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Related work • XML storage and query processing • Store XML data naively as a file • The whole file needs to be traversed whenever a query is processed not efficient for large XML data sets • Store XML using a commercial RDBMS • Indexing, query processing capabilities
Related work (cont’d) • XML storage and query processing • An XML document as a graph generate a tuple for every edge • Simple, general and automatic generation of XML query – SQL mapping • An XML query may involve many self-joins • Self-joins can be eliminated by inlining the distinct child information into the parent tuple complex XML query – SQL mapping Problem:In all above approaches, wetypically need to rely on auxiliary code in a general-purpose programminglanguage together with SQL to express an XML query
Related work (cont’d) • Indexing • Structural indexes create a structural summary which is extracted from the XML document as a directed graph queries evaluated by pruning the search space • Path / tree queries • Indexing for branching path queries restrict the class of queries indexed to achieve performance benefits • Materialized views
Related work (cont’d) • Labeling • D-labeling • Build minimum label size D-labels • Build a B+ tree over D-labels to support tree queries • Effective for translating XQuery to SQL • XPRESS an XML data compression technique which uses reverse arithmetic encoding to encode label paths as a distinct interval within [0.0,1). Furthermore, it supports query evaluation over the compressed document using the containment relationship among the intervals.
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Bi-LAbeling based System (BLAS) • Based on D-labeling and P-labeling • Process XPath queries which can be represented as trees • Index generator stores D-labeling, P-labeling, data values of an XML document • Query engine RDBMS or twig join
BLAS (cont’d) • Query translator • Decomposes an XPath query into a set of suffix path queries • encodes each suffix path query using P-labeling • generates a corresponding SQL query for each suffix path query • composes the SQL subqueries into a complete SQL query plan using D-labeling
Subquery Suffix Path Query Subquery Generator (based on P-labeling) Query … … XPath Query Query decomposition Subquery composition (based on D-labeling) Subquery Suffix Path Query Ancestor-descendant relationship between the results of the suffix path queries Query Translator Query Engine P-labeling generator P-labelings SAX Parser XML Events Storage Data values Query result Data loader D-labeling generator D-labelings Architecture of BLAS
BLAS: D-labeling • A D-label of an XML node is a triplet <d1,d2,d3>, such that for any two nodes n and m, n ≠ m: • n.d1 ≤ n.d2 (validation) • m is a descendant of n, if and only if n.d1 < m.d1 and n.d2 > m.d2 (descendant) • m is a child of n, if and only if m is a descendant of n and n.d3 + 1 = m.d3 (child) • n and m have no ancestor-descendant relationship, if and only if n.d2 < m.d1 and n.d1 > m.d2 (nonoverlap)
BLAS: D-labeling (cont’d) • Where for a node n: • d1 the position of the start tag of n in the XML document • d2 the position of the end tag of n in the XML document • d3 level of n in the XML trees
BLAS: D-labeling (cont’d) • Descendant axis query //t1//t2 • Retrieve all the nodes reachable by t1 and t2 two lists, l1 and l2 • Test for ancestor-descendant relationships between nodes in l1 and in l2 (D-join) • //proteinDatabase//refinfo, pDB and refinfo relations which store node tagged by proteinDatabase and refinfo • Select pDB.start, pDB.end, refinfo.start, refinfo.end • From pDB, refinfo • Where pDB.start < refinfo.start and pDB.end > refinfo.end
The labeling (start, end, level) can be used to detect ancestor-descendant relationships between nodes in a tree. books ... (1, 20000, 1) book (6, 1200, 2) (10,80,3) (81, 250,3) ... title section “The lord of the rings …” (100, 200,4) section title “Locating middle-earth” ... title figure “A hall fit for a king” description “King Theoden's golden hall” D-labeling scheme
BLAS: P-labeling • Efficiently process consecutive child axis steps (suffix path query) • A P-label for a suffix path P is an interval IP =< p1, p2 >, such that for any two suffix path expressions P, Q: • P.p1 ≤ P.p2(Validation ) • P Q if and only if interval IP is contained inIQ, i.e. Q.p1 ≤ P.p1 and Q.p2 ≤ P.p2(Containment) • P Q = , if and only if IP and IQ do notoverlap, i.e. P.p1 > Q.p2 or P.p2 < Q.p1(Nonintersection)
BLAS: P-labeling (cont’d) • For an XML node n, such that SP(n) =< p1, p2 >,the P-label for this XML node,denoted as n.plabel, is the integerp1 • Findall nodes n such that Q.p1 ≤ SP(n).p1≤ Q.p2and evaluate suffixpath query Q by obtaining the set of XML nodes whose P-labelsare contained in the P-label of Q • [[Q]] = {n | Q.p1 ≤n.plabel≤ Q.p2 }
BLAS: Intuition for P-labels • Assign each node a number, and each suffix path an interval such that: • For any two suffix paths Q1 and Q2, Q1contained in Q2 iff Q1’s interval is contained in Q2’s • A node is contained in the suffix path iff its number is contained in the path interval. • Replaces a sequence of joins by a selection.
BLAS: P-labeling Construction • For paths • For XML Trees • Assign / ratio r0 and each tag ratio ri = 1 / (n+1) • Define domain [0,m-1], m (n + 1)h • Construct P-labels for suffix paths • Assign // an interval of <0, m-1> • Partition the interval I tag order proportional to ti’s ri • allocate < 0, p1 > to suffix paths starting with /, and < pi, pi+1 - 1 > to suffix paths starting with //ti • Partition over each subinterval of path //ti by tags according to their ratios.
/books/book ... 2.11*103 2.1*104 2.2*104 //books/book /book //book/book ... 3*104 2*104 2.1*104 2.2*104 2.3*104 //book //title //section / //books ... 104 2*104 3*104 4*104 5*104 105 0 BLAS: Constructing P-label for paths
BLAS: P-labeling Construction (cont’d) • m = 1012 and99 tags • Each tag is assigned a r = 0.01 • construct a P-label for suffix path • P= /ProteinDatabase/ProteinEntry/protein/name
BLAS: Constructing P-label for XML nodes (cont’d) books ... P-label of an XML node: m, where the P-label for the path from root is [m,n] book ... title section 42100 E.g. /books/book/section: [42100, 42110] “The lord of the rings …” section title “Locating middle-earth” ... Evaluating a suffix path query Q finding all nodes whose P-label is contained in the P-label of Q title figure “A hall fit for a king” description “King Theoden's golden hall”
BLAS: Query Language • XPath queries containing /, //, *, and predicates (branches) tree queries • The evaluation of a path expression P returns the set of nodes [[P]] in an XML tree T which are reachable by P starting from the root of T • A source path SP(n) of a node n in an XML tree T, is the unique simple path P from the root to itself. • A path expression P is contained in a path expression • Q, P Q, if and only if for any XML tree T [[P]] [[Q]] • Path expressions P and Q are non-overlapping,P Q = , if and only if for any XML tree T, [[P]] [[Q]] =
BLAS: Query Translator • Split • Steps: • Descendent axis elimination • Branch elimination • Dfs traversal • p//q p and //q • D-elimination – D-join
BLAS: Query Translator: (I) Decomposition book section title figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (I) Decomposition (cont’d) book book section figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (I) Decomposition (cont’d) book section figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (I) Decomposition (cont’d) book book section figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (II) Selection on P-labels book book section figure Q: //book[//title]/section/figure
title BLAS: Query Translator: (III) Join on D-labels book book section figure Q: //book[//title]/section/figure
BLAS: Query Translator - Push-up • Used when schema information is absent • Descendent axis elimination • Push-up branch elimination • P[q1…qn]/r p, p/q1, …, p/qn, p/r
BLAS: Query Translator - Unfold • Used when schema information is present • Both non-recursive and recursive schemas • replace D-joins with a process that first performs selections on P-labels and then unions the results very efficient • selections using an index are cheap • the union is very simple since there are no duplicates • subqueries are all simple path queries, which can be implemented as a select operation with equality predicates • reduce the number of disk accesses
BLAS: Comparison with D-labeling book book book section title section title figure figure BLAS D-labeling BLAS: Fewer joins, fewer disk accesses
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Experiment Setup • Data sets • Query sets • Suffix path queries • Path queries • XPath queries • Benchmark queries • Query Engine: TwigStack Join
Query Execution Time Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query
Number of data elements visited Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query
Scalability BLAS
Outline • Problem being addressed in the paper • Related work • BLAS • Experimental Results • Evaluation
Contributions • P-labeling scheme is proposed to evaluate suffix path queries efficiently. • BLAS combines P-labeling and D-labeling to evaluate XPath queries. • BLAS is more efficient than state-of-the-art work because the queries translated from XPath queries require: • fewer disk accesses • fewer joins • Experiments show the effectiveness of BLAS
Evaluation • Successful effort • Trade off between additional cost and execution time • BLAS vs RDBMS ?