1 / 158

Adaptive Query Processing

Adaptive Query Processing. Amol Deshpande, University of Maryland Joseph M. Hellerstein, University of California, Berkeley Vijayshankar Raman, IBM Almaden Research Center. Outline. 20 th Century Adaptivity: Intuition from the Classical Systems Adaptive Selection Ordering

cardea
Télécharger la présentation

Adaptive Query Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Query Processing Amol Deshpande, University of Maryland Joseph M. Hellerstein, University of California, Berkeley Vijayshankar Raman, IBM Almaden Research Center

  2. Outline • 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Adaptive Join Processing • Research Roundup

  3. Outline • 20th Century Adaptivity: Intuition from the Classical Systems • Data Independence and Adaptivity • The Adaptivity Loop • Two case studies • System R • INGRES • Tangential topics • Adaptive Selection Ordering • Adaptive Join Processing • Research Roundup

  4. Data Independence Redux dapp denv << dt dt • The taproot of modern database technology • Separation of specification (“what”) from implementation (“how”) • Refamiliarizing ourselves: Why do we care about data independence?

  5. D. I.  Adaptivity • Query Optimization: the key to data independence • bridges specification and implementation • isolates static applications from dynamic environments • How does a DBMS account for dynamics in the environment? • This tutorial is on a 30-year-old topic • With a 21st-Century renaissance ADAPTIVITY

  6. Why the Renaissance? ? denv dt • Breakdown of traditional query optimization • Queries over many tables • Unreliability of traditional cost estimation • Success & maturity make problems more apparent, critical • c.f. Oracle v6! • Query processing in new environments • E.g. data integration, web services, streams, P2P, sensornets, hosting, etc. • Unknown and dynamic characteristics for data and runtime • Increasingly aggressive sharing of resources and computation • Interactivity in query processing • Note two separate themes? • Unknowns: even static properties often unknown in new environments • and often unknowable a priori • Dynamics: can be very high -- motivates intra-query adaptivity

  7. The Adaptivity Loop Measure/Model Actuate Plan Need not happen at the same timescales!

  8. An example query Student • SQL: SELECT * FROM Professor P, Course C, Student S WHERE P.pid = C.pid AND S.sid = C.sid • QUEL: range of P is Professor range of C is Course range of S is Student RETRIEVE (P.ALL, C.ALL, S.ALL) WHERE P.pid = C.pid AND S.sid = C.sid Professor Course

  9. System R Optimizer cardinalitiesindex lo/hi key Dynamic Programming > UPDATE STATISTICS ❚ > SELECT * FROM ... ❚

  10. System R Adaptivity Measure/Model Actuate Plan Note different timescales

  11. INGRES “Query Decomposition” 1 S P C > RANGE OF P IS ... ❚ OVQP(selections) hashed temps c > RANGE OF C_T IS … WHERE C_T.pid=44…❚ OVQP hashed temps

  12. INGRES “Query Decomposition” 1 S P C > RANGE OF P IS ... ❚ OVQP(selections) hashed temps c > RANGE OF C_T IS … WHERE C_T.pid=44…❚ OVQP hashed temps > RANGE OF T_T IS … WHERE T_T.sid = 273❚ OVQP output tuples

  13. INGRES “Query Decomposition” 2 S P C > RANGE OF P IS ... ❚ OVQP(selections) hashed temps > RANGE OF CT IS … WHERE CT.pid=26…❚ OVQP hashed temps > RANGE OF ST IS … WHERE ST.sid=441❚ OVQP output tuples

  14. INGRES: Post-Mortem Hash Hash   P Hash Hash INL INL C   Hash Hash INL S   Hash Hash P C INL S • Case 1:P, S, PC • Case 2: P, PC, S Plan choice determined by number of C matches per P Each P tuple either Type1 or Type2.

  15. Horizontal Partitioning • “Post-mortem” behavior • Horizontal partitioning of inputs into different static plans [Ives02] • “Driving” input relation effectively partitioned by join keys • Each partition participates in a different static plan • Recurses up each different join tree • End result can be described as a union of static plans over partitions • In general, many such plans! • Note: post-mortem always has a relational description of some sort • But often “unusual”: plans that are simply not considered in System R! • Often cannot know the partitioning prior to query execution • So: plan-space and adaptivity loop settings have strong interactions! • A theme we’ll see throughout.

  16. INGRES Adaptivity Measure/Model Actuate Plan All ‘round the loop each time…

  17. Observations on 20thC Systems • Both INGRES & System R used adaptive query processing • To achieve data independence • They “adapt” at different timescales • Ingres goes ‘round the whole loop many times per query • System R decouples parts of loop, and is coarser-grained • measurement/modeling: periodic • planning/actuation: once per query • Query Post-Mortem reveals different relational plan spaces • System R is direct: each query mapped to a single relational algebra stmt • Ingres’ decision space generates a union of plans over horizontal partitions • this “super-plan” not materialized -- recomputed via FindMin • Both have zero-overhead actuation • Never waste query processing work

  18. 20th Century Summary • System R’s optimization scheme deemed the winner for 25 years • Nearly all 20thC research varied System R’s individual steps • More efficient measurement (e.g. sampling) • More efficient/effective models (samples, histograms, sketches) • Expanded plan spaces (new operators, bushy trees, richer queries and data models, materialized views, parallelism, remote data sources, etc) • Alternative planning strategies (heuristic and enumerative) • Speaks to the strength of the scheme • independent innovation on multiple fronts • as compared with tight coupling of INGRES • But… minimal focus on the interrelationship of the steps • Which, as we saw from Ingres, also affects the plan space

  19. 21st Century Adaptive Query Processing • (well, starts in late 1990’s) • Revisit basic architecture of System R • In effect, change the basic adaptivity loop! • As you examine schemes, keep an eye on: • Rate of change in the environment that is targeted • How radical the scheme is wrt the System R scheme • ease of evolutionary change • Increase in plan space: are there new, important opportunities? • even if environment is ostensibly static! • New overheads introduced • How amenable the scheme is to independent innovation at each step • Measure/Analyze/Plan/Actuate

  20. Tangentially Related Work • An incomplete list!!! • Competitive Optimization [Antoshenkov93] • Choose multiple plans, run in parallel for a time, let the most promising finish • 1x feedback: execution doesn’t affect planning after the competition • Parametric Query Optimization [INSS92, CG94, etc.] • Given partial stats in advance. Do some planning and prune the space. At runtime, given the rest of statistics, quickly finish planning. • Changes interaction of Measure/Model and Planning • No feedback whatsoever, so nothing to adapt to! • “Self-Tuning”/“Autonomic” Optimizers [CR94, CN97, BC02, etc.] • Measure query execution (e.g. cardinalities, etc.) • Enhances measurement, on its own doesn’t change the loop • Consider building non-existent physical access paths (e.g. indexes, partitions) • In some senses a separate loop – adaptive database design • Longer timescales

  21. Tangentially Related Work II • Robust Query Optimization [CHG02, MRS+04, BC05, etc.] • Goals: • Pick plans that remain predictable across wide ranges of scenarios • Pick least expected cost plan • Changes cost function for planning, not necessarily the loop. • If such functions are used in adaptive schemes, less fluctuation [MRS+04] • Hence fewer adaptations, less adaptation overhead • Adaptive query operators [NKT88, KNT89, PCL93a, PCL93b] • E.g. memory-adaptive sort and hash-join • Doesn’t address whole-query optimization problems • However, if used with AQP, can result in complex feedback loops • Especially if their actions affect each other’s models!

  22. Extended Topics in Adaptive QP • An incomplete list!! • Parallelism & Distribution • River [A-D03] • FLuX [SHCF03, SHB04] • Distributed eddies [TD03] • Data Streams • Adaptive load shedding • Shared query processing

  23. Adaptive Selection Ordering Title slide

  24. Selection Ordering • Complex predicates on relations common • Eg., on an employee relation: ((salary > 120000) AND (status = 2)) OR ((salary between 90000 and 120000) AND (age < 30) AND (status = 1)) OR … • Selection ordering problem Decide the order in which to evaluate the individual predicates against the tuples • We focus on evaluating conjunctive predicates (containing only AND’s) Example Query select * from R where R.a = 10 and R.b < 20 and R.c like ‘%name%’;

  25. Why Study Selection Ordering • Many join queries reduce to this problem • Queries posed against a star schema • Queries where only pipelined left-deep plans are considered • Queries involving web indexes • Increasing interest in recent years • Web indexes [CDY’95, EHJKMW’96, GW’00] • Web services [SMWM’06] • Data streams [AH’00, BMMNW’04] • Sensor Networks [DGMH’05] • Similar to many problems in other domains • Sequential testing (e.g. for fault detection) [SF’01, K’01] • Learning with attribute costs [KKM’05]

  26. Why Study Selection Ordering • Simpler to understand and analyze • Many fundamental AQP ideas can be demonstrated with these queries • Very good analytical and theoretical results known • No analogues for general multi-way joins • Big differences to look out for • These queries are stateless; queries involving joins arenot stateless • No burden of routing history • Selections are typically very inexpensive • The costs of AQP techniques become important

  27. Execution Strategies Pipelined execution (tuple-at-a-time) result R • For each tuple r Є R • Apply predicate R.a = 10 first; • If tuple satisfies the selection, apply R.b < 20; • If both satisfied, apply R.c like ‘%name%’; Operator-at-a-time execution R.c like … R.a = 10 R.b < 20 R1 R2 Materialize R1 Materialize R2 R.c like … R.a = 10 R.b < 20 R result • Apply predicate R.a = 10 to all tuples of R; materialize result as R1, • Apply predicate R.b < 20 to all tuples of R1; materialize result as R2, • …

  28. Execution Strategies Pipelined execution (tuple-at-a-time) result R Operator-at-a-time execution R.c like … R.a = 10 R.b < 20 R1 R2 Materialize R1 Materialize R2 R.c like … R.a = 10 R.b < 20 R result Preferred for selection ordering Fundamentally different from adaptivity perspective

  29. Outline • 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Setting and motivation • Four Approaches • Static Selinger-style optimization • KBZ Algorithm for independent selections [KBZ’86] • A 4-approx greedy algorithm for correlated selections [BMMNW’04] • Mid-query reoptimization [KD’98] • Adapted to handle selection ordering • A-Greedy [BMMNW’04] • Eddies [AH’00] • Other related work • Adaptive Join Processing • Research Roundup

  30. Static Selinger-style Optimization R.c like … R.a = 10 R.a = 10 R.b < 20 R.b < 20 R.c like … R R result result • Find a single order of the selections to be used for all tuples Query select * from R where R.a = 10 and R.b < 20 and R.c like ‘%name%’; Query plans considered 3! = 6 distinct plans possible

  31. Static Selinger-style Optimization R.c like … R.a = 10 R.b < 20 R result Independence assumption • Cost metric: CPU instructions • Computing the cost of a plan • Need to know the costs and the selectivities of the predicates R1 R2 R3 • costs c1 c2 c3 • selectivities s1 s2 s3 • cost per c1 + s1 c2 + s1 s2 c3 • tuple cost(plan) = |R| * (c1 + s1 * c2 + s1 * s2 * c3)

  32. Static Selinger-style Optimization 1-subsets of predicates R.a = 10 R.b < 20 R.c like … 2-subsets of predicates R.a = 10 AND R.c like .. R.a = 10 AND R.b < 20 3-subsets of predicates R.a = 10 AND R.b < 20 AND R.c like … • Dynamic programming algorithm • Complexity: O(2n) Using 1-d histograms or random samples etc Compute optimal order and cost for Using 2-d histograms or random samples, or by assuming independence

  33. Static Selinger-style Optimization • KBZ algorithm for independent selections [KBZ’86] • Apply the predicates in the decreasing order of: (1 – s) / c where s = selectivity, c = cost • Correlated selections • NP-hard under several different formulations • E.g. when given a random sample of the relation • Greedy algorithm: • Apply the selection with the highest (1 - s)/c • Compute the selectivities of remaining selections over the result • Conditional selectivities • Repeat • Can be shown to be 4-approximate [BMMNW’04] • Best possible unless P = NP

  34. Static Selinger-Style Measure/Model Actuate Plan

  35. Outline • 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Setting and motivation • Four Approaches • Static Selinger-style optimization • KBZ Algorithm for independent selections [KBZ’86] • A 4-approx greedy algorithm for correlated selections [BMMNW’04] • Mid-query reoptimization [KD’98] • Adapted to handle selection ordering • A-Greedy [BMMNW’04] • Eddies [AH’00] • Other related work • Adaptive Join Processing • Research Roundup

  36. Mid-query Reoptimization R1 R2 R3 Materialize R1 R.c like … R.a = 10 R.b < 20 R result A free opportunity to re-evaluate the rest of the query plan - Exploit by gathering information about the materialized result • At materialization points, re-evaluate the rest of the query plan • Example: Initial query plan chosen Estimated selectivities 0.05 0.1 0.2

  37. Mid-query Reoptimization R1 R2 Materialize R1; build 1-d hists R3 R.c like … R.a = 10 R.b < 20 R result A free opportunity to re-evaluate the rest of the query plan - Exploit by gathering information about the materialized result • At materialization points, re-evaluate the rest of the query plan • Example: Initial query plan chosen Estimated selectivities 0.05 0.1 0.2

  38. Mid-query Reoptimization R2 R3 R.c like … R.b < 20 Significantly different  original plan probably sub-optimal Reoptimize the remainingpart of the query • At materialization points, re-evaluate the rest of the query plan • Example: Initial query plan chosen R1 Materialize R1; build 1-d hists Materialize R1 R.a = 10 R Estimated selectivities 0.05 0.1 0.2 Re-estimated selectivities 0.5 0.01

  39. Mid-query Reoptimization • Explored plan space identical to static • The operators are applied to the tuples in the same order • The order is determined lazily • The specific approach equivalent to the 4-Approx Greedy algorithm • Cost of adaptivity: • Materialization cost • Many (join) query plans typically have materialization points • May want to introduce materialization points if there is high uncertainty • Constructing statistics on intermediate results • Depends on the statistics maintained • Re-optimization cost • Optimizer should be re-invoked only if the estimates are significantly wrong

  40. Mid-query Reoptimization • Advantages: • Easy to implement in a traditional query processing system • Familiar plan space; easy to optimize and understand what's going on • Operator-at-a-time query processing • Disadvantages: • Granularity of adaptivity is coarse • Once an operator starts executing, can’t change that decision • Explored plan space identical to static optimization • Can’t apply different orders to different sets of tuples • Requires materialization • Cost of materialization can be high • Ill-suited for data streams and similar environments

  41. Mid-query Adaptivity Measure/Model Actuate Plan

  42. Outline • 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Setting and motivation • Four Approaches • Static Selinger-style optimization • KBZ Algorithm for independent selections [KBZ’86] • A 4-approx greedy algorithm for correlated selections [BMMNW’04] • Mid-query reoptimization [KD’98] • Adapted to handle selection ordering • A-Greedy [BMMNW’04] • Eddies [AH’00] • Other related work • Adaptive Join Processing • Research Roundup

  43. Adaptive Greedy [BMMNW’04] R1 R2 R3 R.c like … R.c like … R.a = 10 R.a = 10 R.b < 20 R.b < 20 R result Costs 1 unit 1 unit 1 unit Initial estimated selectivities 0.05 0.1 0.2 • Context: Pipelined query plans over streaming data • Example: Three independent predicates Optimal execution plan orders by selectivities (because costs are identical)

  44. Adaptive Greedy [BMMNW’04] R1 R2 R3 R.c like … R.a = 10 R.b < 20 R result Profile • Monitor the selectivities • Switch order if the predicates not ordered by selectivities Randomly sample R.a = 10 estimate selectivities of the predicates over the tuples of the profile R.b < 20 Rsample R.c like … Reoptimizer IF the current plan not optimal w.r.t. these new selectivities THEN reoptimize using the Profile

  45. Adaptive Greedy [BMMNW’04] R1 Randomly sample R2 R3 R.c like … R.a = 10 R.b < 20 R result R.a = 10 monitor selectivities sel(R.a = 10), sel(R.b < 20), sel(R.c …) R.b < 20 Rsample (Profile) R.c like … monitor conditional selectivities sel(R.b < 20 | R.a = 10) sel(R.c like … | R.a = 10) sel(R.c like … | R.a = 10 and R.b < 20) • Correlated Selections • Must monitor conditional selectivities Reoptimizer Uses conditional selectivities to detect violations Uses the profile to reoptimize O(n2) selectivities need to be monitored

  46. Adaptive Greedy [BMMNW’04] • Cost of adaptivity: • Profile maintenance • Must evaluate a (random) fraction of tuples against all operators • Detecting violations • Periodic checks for detecting if the current order is optimal • Doing this per tuple too expensive • Reoptimization cost • Can require multiple passes over the profile

  47. Adaptive Greedy: Post-Mortem Plan switch point R.c like … R.c like … R.a = 10 R.b < 20 R.b < 20 R.a= 10 order of arrival • Plan Space explored • “Horizontal partitioning” by order of arrival . . If the selectivities correlated with tuple arrival order, this can lead to huge savings

  48. Adaptive Greedy [BMMNW’04] • Advantages: • Can adapt very rapidly • Theoretical guarantees on performance • Not known for any other AQP protocols • Disadvantages: • Limited applicability • Only applies to selection ordering and specific types of join queries • Possibly high runtime overheads • Several heuristics described in the paper

  49. A-Greedy Adaptivity Measure/Model Actuate Plan

  50. Outline • 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Setting and motivation • Four Approaches • Static Selinger-style optimization • KBZ Algorithm for independent selections [KBZ’86] • A 4-approx greedy algorithm for correlated selections [BMMNW’04] • Mid-query reoptimization [KD’98] • Adapted to handle selection ordering • A-Greedy [BMMNW’04] • Eddies [AH’00] • Other related work • Adaptive Join Processing • Research Roundup

More Related