Lecture 9: Query Optimization

Lecture 9: Query Optimization Sept. 26 2006 ChengXiang Zhai Most slides are adapted from Kevin Chang’s lecture slides

DBMS Architecture Today’s lecture Past lectures User/Web Forms/Applications/DBA query transaction Query Parser Transaction Manager Query Rewriter Logging & Recovery Query Optimizer Lock Manager Query Executor Files & Access Methods Lock Tables Buffers Buffer Manager Main Memory Storage Manager Storage

SQL Constructs • SECLECT (DISTINCT) <list of columns> • FROM <list of tables> • WHERE <list of Boolean Factors> • GROUP BY <list of columns> • HAVING <list of Boolean Factors> • ORDER BY <list of columns>

SQL Semantics • Take Cartesian product of FROM tables • Project only those referenced columns • WHERE: apply all filters in WHERE • GROUP BY: form groups on results • HAVING: apply filter to groups • ORDER BY: make sure results in right order • DISTINCT: remove duplicates • Q: Is this “operational semantics” efficient? • Different plans: mainly different in the first three

Optimization: Different Strategies • Optimal approach: • enumerate each possible plan • measure its performance by running it • pick the fastest one • Heuristics approach: • fixed heuristics all the way through plan construction • e.g.: always nested loop joins, indexed relation as inner • e.g.: order relations from smallest to biggest • How does Selinger/System R differ from them? • What are the main components in their approach?

New Paradigm: Cost-based Optimization • Plan space: • what is the space of query plans? • Cost estimation: • how to estimate the cost, without executing each? • Search algorithm: • how to search the space, as guided by cost estimates

Space of Query Plans What can you do differently? Parameters to tune: • Selections: • Joins:

Space of Query Plans • Selections: • algorithms: sequential, index scan • ordering: why this will matter? • Joins: • algorithms: nested-loop, sort merge, hash • ordering • Ordering/Grouping: • can an “interesting order” be produced by join/selections? • algorithm: sorting, hash-based • They interleave with each other!

Assumptions to Reduce the Space? • Projections: • pushed down to reduce # of columns • Selections: • Joins:

Huge Space! Assumptions to Help Typical assumptions to help reduce the space: • Projections: • pushed down to reduce # of columns • Selections: • pushed down to reduce # of rows • Joins: • left-deep joins (what else can you do?) • avoid Cartesian products; delay it in the plan • Q: how to avoid Cartesian products? • May miss an optimal plan!

Cost/Size Estimation • Accurate relatively: • goal is to compare plans, not to predict exact cost • more of an art than an exact science • Each operator: input size, cost, output size • estimate cost based on input size • estimate output size (for next operator) or selectivity • selectivity = ratio of output to input

Cost Estimation: Selinger Style • Input: simple DB statistics • # of tuples & disk pages • # of distinct values per column • statistics updated periodically • Assumption of attribute/predicate independence • When no estimate available, use magic number • New/Alternative approaches: • sampling, histogram of DB

Selectivity Factors: Point Selection • [column = value] • [dept = CS] • input size: NCARD(Student) = 200, ICARD(dept) = 10 • output size = ? selectivity = ?

Selectivity Factors: Range Selection • What is assumed in these formulas? • column > value1: • F = (maxValue - value1) / rangeOfValue • column in value1:value2 • F = (value2 - value1) / rangeOfValue • column in (set of values): • F: as union of point selections

Selectivity Factors: Join Predicates • column1 = column2: • when both columns are keys: • student.sid = employee.sid • one key, the other foreign key • student.sid = enrollment.sid • student: NCARD = 200, SID distinct values = 200 • enrollment: NCARD = 800, SID distinct values = 100 • 800 * (200/200) = 800 • none are keys: • user.sid = enrollement.sid • user: NCARD = 1000, SID distinct values = 200 • enrollment: NCARD = 800, SID distinct values = 100 • 800 * (1000/200) = 4000 • given formula covers all?

Goal: Cost Estimate • System R cost model: • Sum of I/O and CPU • (# PAGE) + W * (RSI CALLS) • why RSI calls model workload of RSS?

Search the Plan Space • Baseline: exhaustive search • enumerate all combinations, and compare their cost • Search method parameters: • plan tree development: • construction: bottom-up, top-down • modification: improve a somehow-constructed tree • algorithms: • heuristic selections: make choices based on heuristics • branch and bound: search bounded by the current best tree • hill climbing: find “nearby” plans with lowest cost • dynamic programming: construction by greedy selections • where does System-R approach fit in?

Plan Search: System-R Style AKA: Selinger style optimization • Bottom-up • start from ground relations (in FROM) • work up the tree to form a plan • Dynamic programming • greedily prune subtrees that are obviously useless • but not necessarily local best at every step: ? why • Many other approaches…

DP: Local Optimal -> Global Optimal? Think of optimizing (T1 QT2 Q T3) Q T4 Q T5… • Sometime it does hold • does plan (T1 QT2 Q T3) affect the rest of the plan? • Not always true • example? • but hopefully at least give good plans

System R Search: Start from Relations • Base relations access: • find all plans for accessing each base relations • push down “sargable” arguments • ? example of non-sargable predicates? • choose good plans, discard bad ones • keep cheapest for unordered & each interesting order • Join ordering: • find all ways of joining a pair of base relations • choose good plans, discard bad ones

WHERE Clause and Sargable Predicates • WHERE conditions: [dept = CS] OR ([dept = EE] AND [GPA > 3.5]) • Normal forms: • DNF: (C) OR (E AND G) • CNF: (C OR E) AND (C OR G) • Difference in using the CNF/DNF “factors”

System R Search: Left-Deep Join Plans • Join ordering: • consider only left-deep trees: n! ordering for n tables • basis: • find all ways of joining a pair of base relations • choose good plans, discard bad ones • induction: until you have a full plan • k to k+1: given plans of k-relation join, add one more • prefer one with predicate: postpone Cartesian product • can throw away k-plans • Finally: grouping/ordering • using interesting order • additional sorting

Nested Subqueries • Subqueries optimized separately • Correlation: order of evaluations • uncorrelated queries are like “constants” • correlated queries are like “function calls”

Beyond System-R • Parallel/Distributed DB • User-defined predicates • Materialized views • Adaptive query optimization • Fuzzy queries

What You Should Know • Query optimization is critical for any commercial DBMS • good/bad can be orders of magnitude • The 3 components of the System-R style optimization • Plan space definition • Cost estimation • Search algorithm • huge number of alternative, semantically equivalent plans • Ideal goal: • map a declarative query to the most efficient plan • Conventional wisdom: avoid bad plans • State of the art: • industry: most optimizers are System-R style • academic: always a core database research topic

Carry Away Messages • Appropriate formulation of the problem leads to good solutions • Query optimization as a search problem • Messy problems always require a clean model • Make assumptions to make the problem tractable • An approximate solution could be good enough • Opens up opportunities for further improvement

Lecture 9: Query Optimization