Adaptive Query Processing

Adaptive Query Processing Amol Deshpande, University of Maryland Joseph M. Hellerstein, University of California, Berkeley Vijayshankar Raman, IBM Almaden Research Center

Outline • 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Adaptive Join Processing • Research Roundup

Outline • 20th Century Adaptivity: Intuition from the Classical Systems • Data Independence and Adaptivity • The Adaptivity Loop • Two case studies • System R • INGRES • Tangential topics • Adaptive Selection Ordering • Adaptive Join Processing • Research Roundup

Data Independence Redux dapp denv << dt dt • The taproot of modern database technology • Separation of specification (“what”) from implementation (“how”) • Refamiliarizing ourselves: Why do we care about data independence?

D. I.  Adaptivity • Query Optimization: the key to data independence • bridges specification and implementation • isolates static applications from dynamic environments • How does a DBMS account for dynamics in the environment? • This tutorial is on a 30-year-old topic • With a 21st-Century renaissance ADAPTIVITY

Why the Renaissance? ? denv dt • Breakdown of traditional query optimization • Queries over many tables • Unreliability of traditional cost estimation • Success & maturity make problems more apparent, critical • c.f. Oracle v6! • Query processing in new environments • E.g. data integration, web services, streams, P2P, sensornets, hosting, etc. • Unknown and dynamic characteristics for data and runtime • Increasingly aggressive sharing of resources and computation • Interactivity in query processing • Note two separate themes? • Unknowns: even static properties often unknown in new environments • and often unknowable a priori • Dynamics: can be very high -- motivates intra-query adaptivity

The Adaptivity Loop Measure/Model Actuate Plan Need not happen at the same timescales!

An example query Student • SQL: SELECT * FROM Professor P, Course C, Student S WHERE P.pid = C.pid AND S.sid = C.sid • QUEL: range of P is Professor range of C is Course range of S is Student RETRIEVE (P.ALL, C.ALL, S.ALL) WHERE P.pid = C.pid AND S.sid = C.sid Professor Course

System R Optimizer cardinalitiesindex lo/hi key Dynamic Programming > UPDATE STATISTICS ❚ > SELECT * FROM ... ❚

System R Adaptivity Measure/Model Actuate Plan Note different timescales

INGRES “Query Decomposition” 1 S P C > RANGE OF P IS ... ❚ OVQP(selections) hashed temps c > RANGE OF C_T IS … WHERE C_T.pid=44…❚ OVQP hashed temps

INGRES “Query Decomposition” 1 S P C > RANGE OF P IS ... ❚ OVQP(selections) hashed temps c > RANGE OF C_T IS … WHERE C_T.pid=44…❚ OVQP hashed temps > RANGE OF T_T IS … WHERE T_T.sid = 273❚ OVQP output tuples

INGRES “Query Decomposition” 2 S P C > RANGE OF P IS ... ❚ OVQP(selections) hashed temps > RANGE OF CT IS … WHERE CT.pid=26…❚ OVQP hashed temps > RANGE OF ST IS … WHERE ST.sid=441❚ OVQP output tuples

INGRES: Post-Mortem Hash Hash   P Hash Hash INL INL C   Hash Hash INL S   Hash Hash P C INL S • Case 1:P, S, PC • Case 2: P, PC, S Plan choice determined by number of C matches per P Each P tuple either Type1 or Type2.

Horizontal Partitioning • “Post-mortem” behavior • Horizontal partitioning of inputs into different static plans [Ives02] • “Driving” input relation effectively partitioned by join keys • Each partition participates in a different static plan • Recurses up each different join tree • End result can be described as a union of static plans over partitions • In general, many such plans! • Note: post-mortem always has a relational description of some sort • But often “unusual”: plans that are simply not considered in System R! • Often cannot know the partitioning prior to query execution • So: plan-space and adaptivity loop settings have strong interactions! • A theme we’ll see throughout.

INGRES Adaptivity Measure/Model Actuate Plan All ‘round the loop each time…

Observations on 20thC Systems • Both INGRES & System R used adaptive query processing • To achieve data independence • They “adapt” at different timescales • Ingres goes ‘round the whole loop many times per query • System R decouples parts of loop, and is coarser-grained • measurement/modeling: periodic • planning/actuation: once per query • Query Post-Mortem reveals different relational plan spaces • System R is direct: each query mapped to a single relational algebra stmt • Ingres’ decision space generates a union of plans over horizontal partitions • this “super-plan” not materialized -- recomputed via FindMin • Both have zero-overhead actuation • Never waste query processing work

20th Century Summary • System R’s optimization scheme deemed the winner for 25 years • Nearly all 20thC research varied System R’s individual steps • More efficient measurement (e.g. sampling) • More efficient/effective models (samples, histograms, sketches) • Expanded plan spaces (new operators, bushy trees, richer queries and data models, materialized views, parallelism, remote data sources, etc) • Alternative planning strategies (heuristic and enumerative) • Speaks to the strength of the scheme • independent innovation on multiple fronts • as compared with tight coupling of INGRES • But… minimal focus on the interrelationship of the steps • Which, as we saw from Ingres, also affects the plan space

21st Century Adaptive Query Processing • (well, starts in late 1990’s) • Revisit basic architecture of System R • In effect, change the basic adaptivity loop! • As you examine schemes, keep an eye on: • Rate of change in the environment that is targeted • How radical the scheme is wrt the System R scheme • ease of evolutionary change • Increase in plan space: are there new, important opportunities? • even if environment is ostensibly static! • New overheads introduced • How amenable the scheme is to independent innovation at each step • Measure/Analyze/Plan/Actuate

Tangentially Related Work • An incomplete list!!! • Competitive Optimization [Antoshenkov93] • Choose multiple plans, run in parallel for a time, let the most promising finish • 1x feedback: execution doesn’t affect planning after the competition • Parametric Query Optimization [INSS92, CG94, etc.] • Given partial stats in advance. Do some planning and prune the space. At runtime, given the rest of statistics, quickly finish planning. • Changes interaction of Measure/Model and Planning • No feedback whatsoever, so nothing to adapt to! • “Self-Tuning”/“Autonomic” Optimizers [CR94, CN97, BC02, etc.] • Measure query execution (e.g. cardinalities, etc.) • Enhances measurement, on its own doesn’t change the loop • Consider building non-existent physical access paths (e.g. indexes, partitions) • In some senses a separate loop – adaptive database design • Longer timescales

Tangentially Related Work II • Robust Query Optimization [CHG02, MRS+04, BC05, etc.] • Goals: • Pick plans that remain predictable across wide ranges of scenarios • Pick least expected cost plan • Changes cost function for planning, not necessarily the loop. • If such functions are used in adaptive schemes, less fluctuation [MRS+04] • Hence fewer adaptations, less adaptation overhead • Adaptive query operators [NKT88, KNT89, PCL93a, PCL93b] • E.g. memory-adaptive sort and hash-join • Doesn’t address whole-query optimization problems • However, if used with AQP, can result in complex feedback loops • Especially if their actions affect each other’s models!

Extended Topics in Adaptive QP • An incomplete list!! • Parallelism & Distribution • River [A-D03] • FLuX [SHCF03, SHB04] • Distributed eddies [TD03] • Data Streams • Adaptive load shedding • Shared query processing

Adaptive Selection Ordering Title slide

Selection Ordering • Complex predicates on relations common • Eg., on an employee relation: ((salary > 120000) AND (status = 2)) OR ((salary between 90000 and 120000) AND (age < 30) AND (status = 1)) OR … • Selection ordering problem Decide the order in which to evaluate the individual predicates against the tuples • We focus on evaluating conjunctive predicates (containing only AND’s) Example Query select * from R where R.a = 10 and R.b < 20 and R.c like ‘%name%’;

Why Study Selection Ordering • Many join queries reduce to this problem • Queries posed against a star schema • Queries where only pipelined left-deep plans are considered • Queries involving web indexes • Increasing interest in recent years • Web indexes [CDY’95, EHJKMW’96, GW’00] • Web services [SMWM’06] • Data streams [AH’00, BMMNW’04] • Sensor Networks [DGMH’05] • Similar to many problems in other domains • Sequential testing (e.g. for fault detection) [SF’01, K’01] • Learning with attribute costs [KKM’05]

Why Study Selection Ordering • Simpler to understand and analyze • Many fundamental AQP ideas can be demonstrated with these queries • Very good analytical and theoretical results known • No analogues for general multi-way joins • Big differences to look out for • These queries are stateless; queries involving joins arenot stateless • No burden of routing history • Selections are typically very inexpensive • The costs of AQP techniques become important

Execution Strategies Pipelined execution (tuple-at-a-time) result R • For each tuple r Є R • Apply predicate R.a = 10 first; • If tuple satisfies the selection, apply R.b < 20; • If both satisfied, apply R.c like ‘%name%’; Operator-at-a-time execution R.c like … R.a = 10 R.b < 20 R1 R2 Materialize R1 Materialize R2 R.c like … R.a = 10 R.b < 20 R result • Apply predicate R.a = 10 to all tuples of R; materialize result as R1, • Apply predicate R.b < 20 to all tuples of R1; materialize result as R2, • …

Execution Strategies Pipelined execution (tuple-at-a-time) result R Operator-at-a-time execution R.c like … R.a = 10 R.b < 20 R1 R2 Materialize R1 Materialize R2 R.c like … R.a = 10 R.b < 20 R result Preferred for selection ordering Fundamentally different from adaptivity perspective

Outline • 20th Century Adaptivity: Intuition from the Classical Systems • Adaptive Selection Ordering • Setting and motivation • Four Approaches • Static Selinger-style optimization • KBZ Algorithm for independent selections [KBZ’86] • A 4-approx greedy algorithm for correlated selections [BMMNW’04] • Mid-query reoptimization [KD’98] • Adapted to handle selection ordering • A-Greedy [BMMNW’04] • Eddies [AH’00] • Other related work • Adaptive Join Processing • Research Roundup

Static Selinger-style Optimization R.c like … R.a = 10 R.a = 10 R.b < 20 R.b < 20 R.c like … R R result result • Find a single order of the selections to be used for all tuples Query select * from R where R.a = 10 and R.b < 20 and R.c like ‘%name%’; Query plans considered 3! = 6 distinct plans possible

Static Selinger-style Optimization R.c like … R.a = 10 R.b < 20 R result Independence assumption • Cost metric: CPU instructions • Computing the cost of a plan • Need to know the costs and the selectivities of the predicates R1 R2 R3 • costs c1 c2 c3 • selectivities s1 s2 s3 • cost per c1 + s1 c2 + s1 s2 c3 • tuple cost(plan) = |R| * (c1 + s1 * c2 + s1 * s2 * c3)

Static Selinger-style Optimization 1-subsets of predicates R.a = 10 R.b < 20 R.c like … 2-subsets of predicates R.a = 10 AND R.c like .. R.a = 10 AND R.b < 20 3-subsets of predicates R.a = 10 AND R.b < 20 AND R.c like … • Dynamic programming algorithm • Complexity: O(2n) Using 1-d histograms or random samples etc Compute optimal order and cost for Using 2-d histograms or random samples, or by assuming independence

Static Selinger-style Optimization • KBZ algorithm for independent selections [KBZ’86] • Apply the predicates in the decreasing order of: (1 – s) / c where s = selectivity, c = cost • Correlated selections • NP-hard under several different formulations • E.g. when given a random sample of the relation • Greedy algorithm: • Apply the selection with the highest (1 - s)/c • Compute the selectivities of remaining selections over the result • Conditional selectivities • Repeat • Can be shown to be 4-approximate [BMMNW’04] • Best possible unless P = NP

Static Selinger-Style Measure/Model Actuate Plan

Mid-query Reoptimization R1 R2 R3 Materialize R1 R.c like … R.a = 10 R.b < 20 R result A free opportunity to re-evaluate the rest of the query plan - Exploit by gathering information about the materialized result • At materialization points, re-evaluate the rest of the query plan • Example: Initial query plan chosen Estimated selectivities 0.05 0.1 0.2

Mid-query Reoptimization R1 R2 Materialize R1; build 1-d hists R3 R.c like … R.a = 10 R.b < 20 R result A free opportunity to re-evaluate the rest of the query plan - Exploit by gathering information about the materialized result • At materialization points, re-evaluate the rest of the query plan • Example: Initial query plan chosen Estimated selectivities 0.05 0.1 0.2

Mid-query Reoptimization R2 R3 R.c like … R.b < 20 Significantly different  original plan probably sub-optimal Reoptimize the remainingpart of the query • At materialization points, re-evaluate the rest of the query plan • Example: Initial query plan chosen R1 Materialize R1; build 1-d hists Materialize R1 R.a = 10 R Estimated selectivities 0.05 0.1 0.2 Re-estimated selectivities 0.5 0.01

Mid-query Reoptimization • Explored plan space identical to static • The operators are applied to the tuples in the same order • The order is determined lazily • The specific approach equivalent to the 4-Approx Greedy algorithm • Cost of adaptivity: • Materialization cost • Many (join) query plans typically have materialization points • May want to introduce materialization points if there is high uncertainty • Constructing statistics on intermediate results • Depends on the statistics maintained • Re-optimization cost • Optimizer should be re-invoked only if the estimates are significantly wrong

Mid-query Reoptimization • Advantages: • Easy to implement in a traditional query processing system • Familiar plan space; easy to optimize and understand what's going on • Operator-at-a-time query processing • Disadvantages: • Granularity of adaptivity is coarse • Once an operator starts executing, can’t change that decision • Explored plan space identical to static optimization • Can’t apply different orders to different sets of tuples • Requires materialization • Cost of materialization can be high • Ill-suited for data streams and similar environments

Mid-query Adaptivity Measure/Model Actuate Plan

Adaptive Greedy [BMMNW’04] R1 R2 R3 R.c like … R.c like … R.a = 10 R.a = 10 R.b < 20 R.b < 20 R result Costs 1 unit 1 unit 1 unit Initial estimated selectivities 0.05 0.1 0.2 • Context: Pipelined query plans over streaming data • Example: Three independent predicates Optimal execution plan orders by selectivities (because costs are identical)

Adaptive Greedy [BMMNW’04] R1 R2 R3 R.c like … R.a = 10 R.b < 20 R result Profile • Monitor the selectivities • Switch order if the predicates not ordered by selectivities Randomly sample R.a = 10 estimate selectivities of the predicates over the tuples of the profile R.b < 20 Rsample R.c like … Reoptimizer IF the current plan not optimal w.r.t. these new selectivities THEN reoptimize using the Profile

Adaptive Greedy [BMMNW’04] R1 Randomly sample R2 R3 R.c like … R.a = 10 R.b < 20 R result R.a = 10 monitor selectivities sel(R.a = 10), sel(R.b < 20), sel(R.c …) R.b < 20 Rsample (Profile) R.c like … monitor conditional selectivities sel(R.b < 20 | R.a = 10) sel(R.c like … | R.a = 10) sel(R.c like … | R.a = 10 and R.b < 20) • Correlated Selections • Must monitor conditional selectivities Reoptimizer Uses conditional selectivities to detect violations Uses the profile to reoptimize O(n2) selectivities need to be monitored

Adaptive Greedy [BMMNW’04] • Cost of adaptivity: • Profile maintenance • Must evaluate a (random) fraction of tuples against all operators • Detecting violations • Periodic checks for detecting if the current order is optimal • Doing this per tuple too expensive • Reoptimization cost • Can require multiple passes over the profile

Adaptive Greedy: Post-Mortem Plan switch point R.c like … R.c like … R.a = 10 R.b < 20 R.b < 20 R.a= 10 order of arrival • Plan Space explored • “Horizontal partitioning” by order of arrival . . If the selectivities correlated with tuple arrival order, this can lead to huge savings

Adaptive Greedy [BMMNW’04] • Advantages: • Can adapt very rapidly • Theoretical guarantees on performance • Not known for any other AQP protocols • Disadvantages: • Limited applicability • Only applies to selection ordering and specific types of join queries • Possibly high runtime overheads • Several heuristics described in the paper

A-Greedy Adaptivity Measure/Model Actuate Plan

Adaptive Query Processing