1 / 53

XQuery: Technology Overview

Jerome Simeon Scalable XML Infrastructure IBM T.J. Watson Research Center. XQuery: Technology Overview. The “Lazy” Camp. Reuse as much as possible, Invent as little as possible From Programming Languages: Implementation based on formal semantics Static analysis methods Rewriting systems

stuart
Télécharger la présentation

XQuery: Technology Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jerome Simeon Scalable XML Infrastructure IBM T.J. Watson Research Center XQuery: Technology Overview

  2. The “Lazy” Camp • Reuse as much as possible, Invent as little as possible • From Programming Languages: • Implementation based on formal semantics • Static analysis methods • Rewriting systems • From Databases: • Algebraic compilation • Logical/Physical independence • Specific optimizations (join etc) • Indexing, data scalability

  3. Algebraic Compilation and Optimization

  4. Motivation Data integration: Querying different sources from different origins Using specialized algorithms, to optimize XQuery evaluation (Streaming, Staircase Join, Nested Loop, …) Questions: which strategies do we need? How do we integrate them? Different indices Different data models Different XPath coverage Store XPath Query web Service • Trees • Relations Stream XQuery Engine …

  5. You need more than one technique! Query can mix different representations “persons” as a file or web message “auction” in local XML database Query mixes XPath with other query operations Need for combination of techniques: Streaming for access to persons StaircaseJoin or TwigJoin for access to auctions Relational-style query unnesting for the nested query and predicate (join + group-by) for $p indoc(“person.xml”)//person return <person name=“{$p/name}”>{ doc(“auctions.xml”)//open_auctions[ bidder//@person=$p/@id]/happiness }</person>

  6. Our Approach Integration of those techniques is: Desirable Feasible 2 Key principles: Declarativity: let the user say what she wants, not how to get it. Find the best evaluation strategy independently of the way the query is written. Logical/Physical independence: Compiler separation between what is logical optimization (does not depends on document representation) from physical optimization (depends on the document representation)

  7. Current status 2002-2004: Compiler architecture to support both PL and DB techniques 2004: Algebra Definition of a complete logical algebra for XQuery (Chris Re, [ICDE’2006]) Complete compilation of XQuery 1.0 into that algebra 2003-2005: Logical optimizations: 2003: Functional optimizations: inlining, loop-fusion, type-based optimizations 2004: Join optimization and query unnesting 2005: XPath logical operation optimization 2005-2006: Physical optimizations: 2004-2005: streaming techniques 2006?: cost-based optimization

  8. Rest of the talk 2002-2004: Compiler architecture to support both PL and DB techniques 2004: Algebra Definition of a complete logical algebra for XQuery (Chris Re, [ICDE’2006]) Complete compilation of XQuery 1.0 into that algebra 2003-2005: Logical optimizations: 2003: Functional optimizations: inlining, loop-fusion, type-based optimizations 2004: Join optimization and query unnesting 2005: XPath logical operation optimization 2005-2006: Physical optimizations: 2004-2005: streaming techniques 2006?: cost-based optimization

  9. Tuple-Based Algebra Example and Challenges Algebra by Example Logical Rewrites Physical Algorithms Experiments

  10. Tuple-Based Algebra Example and Challenges Algebra by Example Logical Rewrites Physical Algorithms Experiments

  11. Tuple-Algebra: Motivation Summary Technical Challenges and Highly Nested Queries Reuse relational experience as much as possible Performance XMark from 3m30s to 1.7s These aren’t our best numbers

  12. Key XQuery Challenges Turing Complete Recursive function calls Order Sequence & Document Order Complex Semantics Atomization Existential Semantics exists $x’ in fn:data($x) satisfies exists $y’ in fn:data($y) satisfies op:eq($x’,$y’) $x = $y

  13. Related Work Numerous work on tree pattern algorithms Algebras, compilation rules described only for fragments for XQuery Tree algebras: TAX [Chet et al 2003] supports TreePattern operators, focuses on simple FLWOR / paths Nested relational algebras (with order): NEXT [Deutsch et al 2004] [May et al 2003, Branter et al 2005] Handles query decorrelation & all of XPath 1.0 System RX (DB2) [SIGMOD 2005]. Hybrid tree/tuples, closer to our approach.

  14. Tuple-Based Algebra Example and Challenges Algebra by Example Logical Rewrites Physical Algorithms Experiments

  15. The Compilation Pipeline parse XQuery AST Core Rewriting Normalize XQuery Core AST Logical Optimization Compile Logical Algebra Physical Decision parameters Code FOCUS: Compilation and Logical Optimization eval

  16. Algebra Overview Three parts of the Algebra XML Operators XPath, Constructors, Conditional, Type, Function calls, I/O Tuple Operators Our Focus XML/Tuple Operators MapToItem,MapFromItem

  17. Introduction to the Algebra for $x in (1,2,3) for $y in (3,4,5) where $x=$y return<a>{$x}</a> for $x in (1,2,3) for $y in (3,4,5) where $x=$y return<a>{$x}</a>

  18. x 1 2 3 MapFromItem MapConcat for $x in (1,2,3) for $y in (3,4,5) where $x=$y return<a>{$x}</a> MapFromItem $x MapFromItem $y (1,2,3) (3,4,5)

  19. x y 1 3 1 4 1 5 2 3 … … MapConcat Select MapConcat for $x in (1,2,3) for $y in (3,4,5) where $x=$y return<a>{$x}</a> MapFromItem $x MapFromItem $y (1,2,3) (3,4,5)

  20. x y 1 1 1 4 1 5 2 2 … … MapConcat (Detour) • Not necessarily a Product! • Tuple fields contain ONLYXML Items MapConcat for $x in (1,2,3) for $y in ($x,4,5) where $x=$y return<a>{$x}</a> MapFromItem $x MapFromItem $y (1,2,3) (IN#x,4,5)

  21. x y 3 3 Select Select #x = #y MapConcat for $x in (1,2,3) for $y in (3,4,5) where $x=$y return<a>{$x}</a> MapFromItem $x MapFromItem $y (1,2,3) (3,4,5)

  22. MapToItem MapToItem <a>3</a> <a>{ #x } </a> Select #x = #y MapConcat for $x in (1,2,3) for $y in (3,4,5) where $x=$y return<a>{$x}</a> MapFromItem $x MapFromItem $y (1,2,3) (3,4,5)

  23. Normalization of XPath $d/descendant::person[position()=1] Normalized for $fs:dot in $d return for $fs:dot at $fs:position in $fs:dot/descendant::person where $fs:position = 1 return $fs:dot Can Handle XPath Using FLWORs Not Using Tree Pattern Algo! (See later)

  24. Overview Example and Challenges Algebra by Example Logical Rewrites Physical Algorithms Experiments

  25. Product Rewrite • Cleary identified dependency • Require Op2 independent of Op1 MapToItem <a>{ #x } </a> Op1 Select #x = #y MapConcat Product for $x in (1,2,3) for $y in (3,4,5) where $x=$y return<a>{$x}</a> MapFromItem Op2 $x MapFromItem $y (1,2,3) (3,4,5)

  26. Select #x = #y Join #x = #y Product Standard Join Rewrite MapToItem • Asymptotic Savings • Physical Join Operators <a>{ #x } </a> for $x in (1,2,3) for $y in (3,4,5) where $x=$y return<a>{$x}</a> MapFromItem $x MapFromItem $y (1,2,3) (3,4,5)

  27. New Op: GroupBy (Motivation) for $x in (1,2,3) let $a:= for $y in (1,1,2,3) where $x = $y return<b>{$y}</b> return<a>{$a}</a> Want to join but, let is in the way $x = 1 is associated with $a = (<b>1</b>,<b>1</b>) • GroupBy allows Join and Recover Association • Key Differences: • Not Just for Value Aggregates • Partitions not on value equality (sequence index)

  28. GroupBy Example ind x 1 1 2 1 3 2 … … Technical Difference: Groups on fields of Map Index Technical Difference: fn:avg is standard XQuery function MapToItem GroupBy <a>{ #a } </a> #y+2 fn:avg(IN) MapIndex LeftOuterJoin for $x in (1,1,2,5) let $a:= avg(for $y in (3,4,5) where $y>$x return $y + 2) return<a>{$a}</a> ind #y > #x MapFromItem $x MapFromItem (1,1,2,5) $y (3,4,5)

  29. GroupBy Example ind x y 1 1 3 1 1 4 1 1 5 … … … MapToItem GroupBy <a>{ #a } </a> #y+2 fn:avg(IN) MapIndex LeftOuterJoin for $x in (1,1,2,5) let $a:= avg(for $y in (3,4,5) where $y>$x return $y + 2) return<a>{$a}</a> ind #y > #x MapFromItem $x MapFromItem (1,1,2,5) $y (3,4,5)

  30. GroupBy Example (5,6,7) [x:1; a: 6] MapToItem GroupBy <a>{ #a } </a> #y+2 fn:avg(IN) MapIndex LeftOuterJoin for $x in (1,1,2,5) let $a:= avg(for $y in (3,4,5) where $y>$x return $y + 2) return<a>{$a}</a> ind #y > #x MapFromItem $x MapFromItem (1,1,2,5) $y (3,4,5)

  31. Overview Example and Challenges Algebra by Example Logical Rewrites Physical Algorithms Experiments

  32. Join Algorithm Challenges Type promotion Ex. 5 ? 5.0 See paper for details Complex Predicates Join Condition Duplicate Elimination exists $x’ in fn:data($x) satisfies exists $y’ in fn:data($y) satisfies op:eq($x’,$y’) $x = $y

  33. Existential Semantics (Hash) Build y’ R -index 5 0 3 0 … … Input: Tuples Streams L and R (Ordered) Output: (l,r) in L x R that satisfy predicate exists $x’ in fn:data(L#x) satisfies exists $y’ in fn:data(R#y) satisfies op:eq($x’,$y’) Hash Table Inner Relation tuple [y: (5,3);…] 1st tuple in R (Index 0)

  34. Existential Semantics (Hash) Build y’ R -index 5 0,1 3 0 … … Input: Tuples Streams L and R (Ordered) Output: (l,r) in LxR that satisfy predicate exists $x’ in fn:data(L#x) satisfies exists $y’ in fn:data(R#y) satisfies op:eq($x’,$y’) Ordered Inner Relation tuple [y: (5,8);…] 2nd tuple in R

  35. Existential Semantics (Hash) Probe y’ R -index 5 0,1 3 0 … … Input: Tuples Streams L and R (Ordered) Output: (L,R) that satisfy predicate exists $x’ in fn:data(L#x) satisfies exists $y’ in fn:data(R#y) satisfies op:eq($x’,$y’) Outer Relation tuple (L) Merge for Duplicate Elimination Indexes may be repeated [x:(5,6,3);…] (0,1), (0),… (0,1…)

  36. Overview Example and Challenges Algebra by Example Logical Rewrites Physical Algorithms Experiments

  37. Experiments Implementation Total Time No Algebra 3m30s Algebra (No Opts) 50.0s Optim + Nested Loop 5.1s Optim + XQuery Join 1.7s XMark 1-20 on 1MB document

  38. Clio Experiments Query Joins Depth N2 2 2 N3 3 3 N4 6 4 CLIO: Schema Matching Tool Queries Factor of 30x 250K Document

  39. Extending the Algebra for XPath Challenges Algebra by Example Logical Rewrites Physical Algorithms Experiments

  40. XPath: Back to The Compilation Pipeline parse XQuery AST Core Rewriting Normalize XQuery Core AST Logical Optimization Compile Logical Algebra Physical Decision parameters Code FOCUS: Compilation and Logical Optimization eval

  41. Overview of XPath-specific Algorithms Twig Joins: twig pattern-matching one scan of the document for multiple steps rely on access to all elements of a given qname Staircase Joins: Using efficiency of a relational join With an R-Tree index over the region encoding, allowing of skipping irrelevant parts of the document With pruning for avoiding duplicates Streaming: [Barton et al] FSA-based algorithms, XAOS [Fernandez, Stark] Step at a time streaming All: Work for limited (disjoint!) subsets of XPath No work on how to detect / integrate those in arbitrary XQuery

  42. Issue 1: Recovering Declarativity TwigJoins/Staircase/streaming will work very well for: $auctions//a Rely on syntactic detection. Following will not work: $auctions//a[position() … ] Following should work, but typically not detected: for $x in $auction return $x/descendant::a

  43. Algebraic approach: Tuple Algebra Overview Logical level: Relational Operators: Select, Join, Product, OuterJoin, GroupBy, etc. Map, MapConcat, i.e., dependant join XML Operators: Step (for navigation), Parse, Element construction, etc. Physical level: Physical algorithms for each algebraic operator Tied to specific XML representation Tree in memory XML stream Native XML index in DB Relational shredding Virtual XML

  44. The Algebra & XPath MapToItem doc() Step MapFromItem dot7 child::person node() Input # dot7 MapToItem person MapFromItem MapToItem dot3 name email… Input # dot5 MapToItem MapConcat Step MapFromItem MapFromItem Input d-o-s::node dot5 dot1 Input # dot1 Step Select Parse child::person Input # dot3 Step child::email… Input # dot5 doc(“xmark.xml”)//person[emailaddress]/name ?

  45. The Compilation Pipeline (goal) parse XQuery AST Introduce tree patterns (back) Core Rewriting Normalize XQuery Core AST Logical Optimization Compile Logical Algebra Physical Optimization Compile Physical Algebra Tree patterns are ‘normalized’ to FLWORs Code Select Code eval

  46. Logical Rewrites MapToItem MapToItem Step Op TupleTreePattern Input # out axis,nt in,out,axis,nt Op Input # in MapFromItem TupleTreePattern input in,out,axis,nt Step Input axis,nt Input # input MapConcat TupleTreePattern TupleTreePattern Op in,out,axis,nt in,out,axis,nt Op Input

  47. Rewritten Tree Path expressions visible Several physical plans possible TwigJoin Staircase Join SortJoin Nested Loop Streaming doc() node() TreePattern Rewrites person name email… MapToItem Input # dot_n1 TupleTreePattern dot7,dot_n1,child::name Select TupleTreePattern TreePattern email… dot3,dot4,child::person child::emailaddress Input # dot7 TupleTreePattern name person dot1,dot3,d-o-s::node MapFromItem node() dot1 Parse doc()

  48. Issue 2: Which algorithm for which Physical XML? Stream Tree Model Shredded Nested Loop Staircase Join ? Prune + Sort Join Twig Join Streaming

  49. Physical XPath algorithms Logical Operator Algorithm Stream Tree Model Shredded MapConcat + TreePattern Nested Loop ? Standard Standard TreePattern Staircase Join ? ? [Grust et al] “” Sort Join + Index ? ? ? “” Twig Join ? [Bruno et al] [Jiang et al] “” Streaming [Stark et al] ? ?

  50. Physical XPath algorithms Logical Operator Algorithm Stream Tree Model Shredded MapConcat + TreePattern Nested Loop n/a Standard Standard TreePattern Staircase Join n/a ? [Grust et al] “” Sort Join + Index n/a New New “” Twig Join New [Bruno et al] [Jiang et al] “” Streaming [Stark et al] n/a n/a

More Related