Adaptive Query Processing

Adaptive Query Processing Amol Deshpande, University of Maryland Vijayshankar Raman, IBM Almaden Research Center Title slide

Goal • Research in query processing and optimization has taken a new track into adaptive query processing (QP) • Most new research projects that need QP use some adaptive approach • Whether regular QP, XML QP, text, continuous QP, web QP, … • Selinger-style query processing fallen out of flavor • Also true in industrial projects, especially new ones. • Educate audience on what this is all about • Where adaptive QP works, where it has known problems, what is unknown • What is Out of Scope of this tutorial • Parallelism, distributed eddies, continuous query processing…

Agenda • Background on non-adaptive Query Processing • History, where it works, and where it breaks • Adaptive Query Processing • Dimensions of Adaptivity • Evolutionary Adaptive query processing • Late binding and parametric optimization • Competition • Mid-query reoptimization • Tuple-wise adaptive Query Processing • Eddies, SteMs, STAIRs • Commentary • What works and what fails • Open problems

Background on Non-Adaptive Query Processing • Central value proposition of relational model: declarative queries • You decide what data access (query) you want to do; the DBMS will figure out the “right” algorithm (plan) • “right” usually translated to “good enough” • Standard method of realizing this: cost-based query optimization

Cost-based Query Optimization Results Query Optimizer Query Executor Compiled Query Plan Declarative Query Disk(s)

Non-Adaptive Query Processing • Several Improvements over the years • Wider plan space • Started out with left deep tree of binary join operators • Now: • lots more operators • Many more query transformations • multi-block SQL • bushy plans • … • Plan space has grown to be huge and messy • Use of statistics to get better costs • Better dynamic programming techniques (transformational, memoization) • Good surveys exist: [Graefe 1993], [Chaudhuri 1998]

Success of Non-Adaptive QP • Used pretty much everywhere • every commercial DBMS uses some form of cost-based optimization • Why? • Plan space is too wide and combinatorial • Other ideas like randomized optimization simply don’t search widely enough(e.g. all possible ways of using indexes) • Benefits from 25 years of tuning • Widely viewed as success story for relational databases, and as vindication of relational model • Some qualifications • DBAs love and demand control (hints, syntax-direction, plan-fixing across releases, …) • Applications often help the query processor by splitting a query into multiple phases (precursor of adaptivity)

Where does traditional optimization break (1) • In recent years, optimizers have started to break at the seams as they are extended to new environments SELECT p.profile_id, p.account_id, p.profile_title, p.profile_default, r.profile_detail_id AS registrant_detail_id, r.profile_first_name AS registrant_first_name, r.profile_last_name AS registrant_last_name, r.profile_job_title AS registrant_job_title, r.profile_organization_name AS registrant_organization_name, r.profile_address1 AS registrant_address1, r.profile_address2 AS registrant_address2, r.profile_city AS registrant_city, r.state_id … FROM TABLE_ACCOUNT_PROFILES p LEFT JOIN TABLE_ACCOUNT_PROFILE_DETAILS r ON p.profile_id = r.profile_id AND r.profile_type_id = 1 LEFT JOIN TABLE_STATES rs ON r.state_id = rs.state_id LEFT JOIN TABLE_COUNTRIES rc ON … Results very complex user queries Query Optimizer Query Executor Declarative Query Disk(s)

Where does traditional optimization break (1) Network Disk(s) • In recent years, optimizers have started to break at the seams as they are extended to new environments wide area federations SELECT p.profile_id, p.account_id, p.profile_title, p.profile_default, r.profile_detail_id AS registrant_detail_id, r.profile_first_name AS registrant_first_name, r.profile_last_name AS registrant_last_name, r.profile_job_title AS registrant_job_title, r.profile_organization_name AS registrant_organization_name, r.profile_address1 AS registrant_address1, r.profile_address2 AS registrant_address2, r.profile_city AS registrant_city, r.state_id … FROM TABLE_ACCOUNT_PROFILES p LEFT JOIN TABLE_ACCOUNT_PROFILE_DETAILS r ON p.profile_id = r.profile_id AND r.profile_type_id = 1 LEFT JOIN TABLE_STATES rs ON r.state_id = rs.state_id LEFT JOIN TABLE_COUNTRIES rc ON … Results very complex user queries Query Optimizer Query Executor Declarative Query X M L data streams

Where does traditional optimization break (2) • Bad Statistics • DBA-challenged environment is the norm • New data types like XML and text • Deep Web and federated systems • Correlations (many optimizers implement only marginal histograms) • Very dynamic environments • Continuous queries – query runs for ever, so things change • Wide area federations

Where does traditional optimization break (3) • Complex queries • Large numbers of tables (usually bail to greedy beyond 10-20 tables in a join) • Queries with Parameter markers • Select average(salary) from Employee where age > $1 and age < $2 • Need for sharing work between queries – esp. in continuous queries • Asynchronous (push) data sources (e.g. data streams) • New Metrics for QP • E.g., Interactive metric • User preferences change too quickly for optimization to pay off • Pipelined plans: focus on early results

Idea of Adaptive Query Processing • Interleave the optimization and execution stages • We learn better statistics as the query executes • Two main styles Evolutionary Revolutionary Static Late Inter Intra Per Plans Binding Operator Operator Tuple Traditional Dynamic QEP Query Scrambling Xjoin, DPHJ, Eddies DBMS Parametric Mid-query Reopt. Convergent QP Competitive Progressive Opt.

Idea of Adaptive Query Processing • Interleave the optimization and execution stages • We learn better statistics as the query executes • Two main styles Evolutionary Revolutionary Static Late Inter Intra Per Plans Binding Operator Operator Tuple Traditional Dynamic QEP Query Scrambling Xjoin, DPHJ, Eddies DBMSParametricMid-query Reopt. Convergent QP Competitive Progressive Opt.

Dimensions of Adaptation • Measure: monitor cardinalities, resource utilization, etc • Analyze: verify compliance with performance goals (very tricky problem during intermediate stages of a query) • Plan: can involve optimizer, or be a routing policy • Actuate: switch to a new query plan, hopefully without wasting too much work Measure Actuate Analyze Plan

Dimensions of Adaptation • Traditional database system use a simple form of inter-query adaptation • Measurements done on the tables directly, and analyzed during query optimization time • Also, by measuring the cardinalities during execution: • Adaptive selectivity estimation [Chen/Roussopoulos 1994] • LEO [Volker et al 2001] • SITS [Bruno/Chaudhuri 2002] • But always a single plan is used to execute the query fully Measure Actuate Analyze Plan

Adaptive Query Processing Evolutionary Revolutionary Static Late Inter Intra Per Plans Binding Operator Operator Tuple Traditional Dynamic QEP Query Scrambling Xjoin, DPHJ, Eddies DBMSParametricMid-query Reopt. Convergent QP Competitive Progressive Opt.

Late Binding and Parametric Optimization • Idea: let optimizer pick multiple plans for each query • Choose the best one just before execution Pick one and Actuate Plan Q Plan Plans

Late Binding and Parametric Optimization Plan 2 Plan 3 Plan 4 Plan 1 Plan 5 • Parametric optimization [Graefe/Cole 1994, Ioannidis 1992, Ganguly 1998, etc.] • Pick optimal plans for various regions of the parameter spaces • Do this during dynamic programming itself • Using plausible assumptions about cost functions • Challenges: • makes optimization too expensive • Many plausible assumptions don’t hold in practice (see, e.g., [Haritsa 2005]) • Cannot address problems that 1st crop up during execution selectivity 2 selectivity 1

Late Binding and Parametric Optimization • Variation: Validity Ranges ([Markl/Raman/Lohman/Piraheh/Simmens 2005]) • Attach to each edge of a query plan, a range of cardinalities within which plan is optimal • Easier to compute during dynamic programming optimizer • Use numerical root finding techniques; works well with real cost functions • Extends nicely to more general notions of “robust query optimization” • e.g., range within which plan is within 50% of optimal (Raman/Haas/Reder/Markl 2005). • But, Limited effectiveness • only tells you if a plan is broken, not how to fix it • Meshes well with mid-query reoptimization

Competition Measure • E.g. DEC RDB [Antonshenkov 1993]: run access methods competitively • Measure: monitor cardinalities, costs of access methods • Analyze: decide which access method has higher rate • Advantage: simple method • Challenge: avoid redundant work, avoid duplicates • Use a shared hash table & global timestamp [Raman/Deshpande/Hellerstein 2003] • Needs more evaluation of overheads Run for a while Actuate a few Analyze Pick faster one Q Plan Plan Plans

Mid-Query Reoptimization Measure Actuate Analyze New Plan Q Plan • Kabra et al 1998, Markl et al 2005, Babu et al 2005 • Switch query execution plans at well-defined points • Typically, materialization points • Advantage: • seems to work well in practice • Independent implementations • Challenge: • coarse grained adaptation • Cannot always reuse work done in prior executions If plan is sufficiently sub-optimal, switch to new plan

Measure: monitoring cardinalities • During execution, monitor cardinalities on each edge of query plan • E.g., each call to getNext() increments count • must refine for operators that are executed multiple times(e.g., inner of nested loops joins) [e.g. Stillger/Markl/Lohman 2001] • As we monitor, we narrow down range of possible values for true cardinality (primarily the lower bound) • Can extrapolate to expected value of true cardinality • E.g.: • Expected value of c2 = expected value of c1 * (current c2) / (current c1) c1 c2 op

Analyze: Switching plans Mid-Query T S R re-optimize S T Re-optimize the original query to form a new plan • Getting a better plan: • Plug in all cardinality information learned during this query • Reusing work: • Plug in fully materialized relations from current executionas materialized views • Materialization points • Sorts • Inner of hash joins • Explicit materializations (uncorrelated inners, common subexpressions, …)

Analyze: when to switch plans • When no tuples have been output, and • When current plan is sufficiently sub-optimal, and • Range of true cardinalities on an edge has no overlap with validity range for that edge • Or, expected true cardinality falls far outside of validity range • Cost of reoptimization > cost of new plan – remaining cost of current plan • Hard to gauge this accurately; many variables • How many intermediate results we will succeed in reusing • how much better is the new plan • Relatively easy to gauge this if we switch at a materialization point • Because cost of reoptimization is negligible

Query Scrambling Measure Actuate Analyze New Plan Q Plan • Amsaleg/Franklin/Tomasic/Urhan 1996, Urhan/Franklin/Amsaleg 1998 • Designed for coping with delays in a wide area setting Query Processor Measure delays Network Response time optimization Wide Area Data Sources

Agenda • Background on non-adaptive Query Processing • History, where it works, and where it breaks • Adaptive Query Processing • Dimensions of Adaptivity • Evolutionary Adaptive query processing • Tuple-wise adaptive Query Processing • Eddies, SteMs, STAIRs • Commentary • What works and what fails • Open problems

Eddies [Avnur/Hellerstein 2000] Eddies combine all the dimensions of adaptivity into a single operator Measure Eddy Actuate Analyze Plan

Eddies [Avnur/Hellerstein 2000] SE EC SE EC A traditional query plan: queries executed using iterator (getNext) model Query execution using an eddy Output Eddy S Output E C C • An eddy operator • Intercepts tuples from sources and output tuples from operators • Uses feedback from the operators to route S E

Example Database StudentsEnrolled EnrolledCourses select * from students, enrolled, courses where students.name = enrolled.name and enrolled.course = courses.course Courses Students Enrolled

Query Execution using Eddies SE Probe to find matches Insert with key hash(joe) HashTable E.Name HashTable S.Name Eddy S E Output C HashTable C.Course HashTable E.Course No matches; Eddy processes the next tuple EC

Query Execution using Eddies SE E C Probe Insert HashTable E.Name HashTable S.Name Eddy S E Output C HashTable C.Course HashTable E.Course

Query Execution using Eddies SE EC Probe HashTable E.Name HashTable S.Name Eddy S E Output C HashTable C.Course HashTable E.Course Probe Note: this is a symmetric hash join

Eddies: Postmortem Output Output • Eddy executes different query execution plans for different parts of data • Eddy adapts the join order during query execution • Access methods and join algorithms still chosen up front EC ES SE Courses CE Students Students Enrolled Courses Enrolled

Eddies and Pipelined Joins • Simplest way to describe eddies is in terms of symmetric hash join • This is the approach we used in our example • Approach extends directly to index nested loops joins • Can also extend this to non-pipelined joins (e.g. sort-merge, or hybrid hash) • BUT, opportunities for adaptation are much less • Only at end of building whole hash table (just like with mid-query re-optimizations) • Generalized using SteMs and STAIRs

Eddies: Routing Policy – Non-adaptive Output EC SE EC SE Courses Students Enrolled • Choosing which operator to route a given tuple to • The brain of the eddy Send S and E tuples here Returns S JOIN E (SE) tuples Non-Adaptive Optimization 1. Run a regular optimizer to figure out the join order 2. Map join order into order of tuple routing Eddy S E Output C Send SE, and C tuples here Returns SEC tuples

Eddies: Routing Policy – Lottery Scheduling SE EC • Choosing which operator to route a given tuple to • The brain of the eddy Send here 99% of the time Send to the other operator 1% of the time sent = 100 received = 2 Lottery Scheduling [Avnur 00] Simplified Description 1. Maintain for each operator: tuples sent tuples returned cost per tuple 2. Choose (roughly) based on the above 3. Explore by randomly sending tuples in the wrong orders Eddy S E Output C sent = 10 received = 20

Eddies: Routing Policy – Statistics Based SE EC • Choosing which operator to route a given tuple to • The brain of the eddy [Deshpande, Hellerstein 04] 1. Learn statistics on each input table as data streams in 2. Choose (roughly) based on the statistics (analogous to traditional QO) Eddy S Output E C

Eddies: Routing Policy – Interactivity Metric SE EC • Choosing which operator to route a given tuple to • The brain of the eddy [Raman, Hellerstein 02] Use user preferences for different kinds of tuples and different kinds of result columns to determine the routing order Eddy S Output E C

Agenda • Background on non-adaptive Query Processing • History, where it works, and where it breaks • Adaptive Query Processing • Dimensions of Adaptivity • Evolutionary Adaptive query processing • Tuple-wise adaptive Query Processing • Eddies, SteMs, STAIRs • Commentary • What works and what fails • Open problems

State Modules (SteMs): More aggressive adaptation P SteMR SteMP hash jn + ind.jn Eddy Eddy Eddy R P P R P R • Eddies allow dynamic adaptation of join order • Within a particular spanning tree of the join graph • Other adaptations: • Access methods: especially index vs scan access methods • Join algorithms: e.g., index join (IJ) vs hash join (HJ) • To guard against wrong cardinality estimates • To simultaneously optimize for both pipelining & completion time (IJHJ) • To deal with memory overflows (HJIJ)

Query Processing with SteMs SteMR build R probe ST RSTmatches • Eddy execs queries by routing tuples to SteMs and AMs • SteM • Dictionary of homogeneous tuples • (insert) build and (search) probe operations • Access modules (AMs) • probe with tuples and get matches • One Example: Symmetric Hash Join • Build SteMR • Probe SteMS with R tuple • Build SteMS • Probe SteMR with R tuple • Generalizes to n-ary join • Can simulate other join algorithms, by using other routing policy • Thus, changing the routing allows adaptation of join algorithm • Subtle routing constraints needed to ensure correct execution [Raman et al 2003] S bld R bld S probe R probe Eddy repeat S R

Another application of SteMs • Sharing State across queries • E.g. 1000 queries of the form:select * from R where R.a = parameter • Equivalent to:select * from R, Q where R.a = Q.parameter • Eddy routes tuples from R SteM and queries from Q SteM • Useful for continuous-query proc. and multi-query proc • Need to extend SteMs with eviction (deletion) operation – for window joins • See Madden et. al paper in SIGMOD 2002

Agenda • Background on non-adaptive Query Processing • History, where it works, and where it breaks • Adaptive Query Processing • Dimensions of Adaptivity • Evolutionary Adaptive query processing • Tuple-wise adaptive Query Processing • Eddies, SteMs, STAIRs … • Commentary • What works and what fails • Open problems

Query execution using STAIRS [Deshpande/Hellerstein 2004](Storage, Transformation and Access for Intermediate Results) S.Name STAIR Build intoS.Name STAIR HashTable E.Name STAIR HashTable Eddy S Output E C HashTable HashTable E.Course STAIR C.Course STAIR Extend SteMs to handle reuse of intermediate results Probe into E.Name STAIR s1 s1 s1 s1

Agenda • Background on non-adaptive Query Processing • History, where it works, and where it breaks • Adaptive Query Processing • Dimensions of Adaptivity • Evolutionary Adaptive query processing • Tuple-wise adaptive Query Processing • Eddies, SteMs, STAIRs … • Commentary • What works and what fails • Open problems

Adaptive Query Processing