1 / 28

Query Optimization

Query Optimization. R&G, Chapter 15 Lecture 17. Administrivia. Homework 3 available from class website Due date: Tuesday, March 20 by end of class period Homework 4 available today Implement nested loops and hash join operators for (new!) minibase Due date: April 10 (after Spring Break)

alika
Télécharger la présentation

Query Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Optimization R&G, Chapter 15 Lecture 17

  2. Administrivia • Homework 3 available from class website • Due date: Tuesday, March 20 by end of class period • Homework 4 available today • Implement nested loops and hash join operators for (new!) minibase • Due date: April 10 (after Spring Break) • Midterm 2 is 3/22, 2 weeks from today • In class, covers lectures 10-17 • Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall • Internships at Google this summer… • See http://www.postgresql.org/developer/summerofcode • Booth at the UCB TechExpo this Thursday: • http://csba.berkeley.edu/tech_expo.html • Contact josh@postgresql.org

  3. Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Review We are here • Query plans are a tree of operators that compute the result of a query • Optimization is the process of picking the best plan • Execution is the process of executing the plan

  4. Query Plans: turning text into tuples Query Result Query Shift Name SELECT A.aname, max(F.feedingshift) FROM Animals A, Feeding F WHERE A.aid = F.aid AND (A.species = 'Big Cat' or A.species = 'Bear') GROUP BY A.aname HAVING COUNT(*) > 1 Operators Query Plan

  5. Query Optimization steps SELECT A.aname, max(F.feedingshift) FROM Animals A, Feeding F WHERE A.aid = F.aid AND (A.species = 'Big Cat' or A.species = 'Bear') GROUP BY A.aname HAVING COUNT(*) > 1 Query parser • Parse query from text to ‘intermediate model’ • Traverse ‘intermediate model’ and produce alternate query plans • Query plan = relational algebra tree • Plan cost = cumulative cost of tree • Consider equivalent but alternative relational algebra tress • Optimizer keeps track of cost and properties of plans • Pick the cheapest plan Block 3 Block 2 Block 1 Query optimizer Cost = 200 Cost = 500 Cost = 150 To execution engine

  6. (On-the-fly) sname (On-the-fly) (Page-Oriented Nested loops) sname sid=sid (Scan & Write to temp T2) (On-the-fly) rating > 5 bid=100 bid=100 rating>5 (On-the-fly) (Page-Oriented Nested loops) Sailors Reserves sid=sid Reserves Sailors Optimized query is 124x cheaper than the original! SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 500,500 IOs 4010 IOs

  7. System R-Style Optimization • Impact: • Most widely used currently; works well for < 10 joins. • Cost estimation: • Plan cost = cumulative cost of plan tree • Cost = function(I/O, CPU) costs • Very inexact, but works ok in practice. • Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. • Plan Space:Too large, must be pruned. • Many plans share common, “overpriced” subtrees • ignore them all! • Only left-deep plans are considered. • May cause optimizer to miss good plans • Avoid Cartesian products.

  8. Query Optimization steps Block 3 Block 2 Block 1 • Parse query from text to ‘intermediate model’ -> A list of query blocks • For each query block, pick the best plan • Convert block to tree of relational algebra operators • Build alternative plans bottom up: • Leaf nodes represent access paths to relations • Consider reordering relational algebra tree • Push selections and projections down • Consider left-deep join plans • Avoid cross products • Consider all possible join methods • Prune expensive plans with the same properties Rel 1 access path Rel 2 access path X BNL Cost = 200 no sorted order X BNL Cost = 150 no sorted order X HJ Cost = 125 no sorted order HJ Cost = 125 no sorted order X SM Cost = 300 Sorted by bid SM Cost = 300 Sorted by bid X

  9. Schema for Examples • Reserves: • Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. 100 distinct bids. • Sailors: • Each tuple is 50 bytes long, 80 tuples per page, 500 pages. 10 ratings, 40,000 sids. Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)

  10. Translating SQL to Relational Algebra SELECT S.sid, MIN (R.day) FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5 GROUP BY S.sid HAVING COUNT (*) >= 2 For each sailor with a rating > 5 that has reserved at least 2 red boats, find the sailor id and the earliest date on which the sailor has a reservation for a red boat.

  11. p S.sid, MIN(R.day) Translating SQL to Relational Algebra SELECT S.sid, MIN (R.day) FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5 GROUP BY S.sid HAVING COUNT (*) >= 2 HAVING COUNT(*)>2 GROUP BY S.Sid sB.color = “red” S.rating > 5 V Sailors Reserves Boats

  12. p S.sid, MIN(R.day) Relational Algebra Equivalences • Allow us to choose different join orders and to `push’ selections and projections ahead of joins. • Selections can be cascaded: c1…cn(R)  c1(…(cn(R))…) HAVING COUNT(*)>2 SELECT S.sid, MIN (R.day) FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5 GROUP BY S.sid HAVING COUNT (*) >= 2 GROUP BY S.Sid sB.color = “red” Reserves Boats • Can apply these predicates separately • Can ‘push’ S.rating > 5 down to Sailors sS.rating > 5 Sailors

  13. p S.sid, MIN(R.day) Relational Algebra Equivalences • Selections can be commuted: c1(c2(R))  c2(c1(R)) SELECT S.sid, MIN (R.day) FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5 GROUP BY S.sid HAVING COUNT (*) >= 2 HAVING COUNT(*)>2 GROUP BY S.Sid sS.Rating > 5 Can apply these predicates in different order Reserves Sailors sB.color = “red” Boats

  14. p p S.sid, MIN(R.day) S.sid Relational Algebra Equivalences • Projections can be cascaded: a1(R)  a1(…(a1, …, an(R))…) HAVING COUNT(*)>2 SELECT S.sid, MIN (R.day) FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5 GROUP BY S.sid HAVING COUNT (*) >= 2 GROUP BY S.Sid sB.color = “red” Can project S.sid to reduce size of tuples Reserves Boats sS.rating > 5 Sailors

  15. p p p p p p p p p B.bid, B.color S.sid ?? S.sid, MIN(R.day) S.sid, R.day ?? ?? R.sid, R.bid, R.day ?? Relational Algebra Equivalences SELECT S.sid, MIN (R.day) FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” AND S.rating > 5 GROUP BY S.sid HAVING COUNT (*) >= 2 • Eager projection • Can cascade and “push” some projections thru selection • Can cascade and “push” some projections below one side of a join • Rule of thumb: can project anything not needed “downstream” HAVING COUNT(*)>2 GROUP BY S.Sid sB.color = “red” sS.rating > 5 Boats Sailors Reserves

  16. Relational Algebra Equivalences • Cartesian Product and Joins • R  (S  T)  (R  S)  T (associative) • R  S  S  R (commutative) • This means we can do joins in any order. • But…beware of cartesian product! • Only consider R X S if there is a join predicate between R & Select S.sid, R.bid, B.bid From Sailors S, Reserves R, Boats B Where S.sid = R.sid and R.bid = B.bid What about this query? Select S.sid, B.bid From Sailors S, Reserves R, Boats B Where S.sid = R.sid No need to consider plans with S X B User asked for Cartesian product so you have to compute it!

  17. Cost Estimation • For each plan considered, must estimate total cost: • Must estimate costof each operation in plan tree. • Plan cost = cumulative cost of the operators • Operator cost is computed from: • Input cardinalities. • Cost of each operator • e.g. (sequential scan, index scan, joins, etc.) • Must estimate size of result for each operation in tree! • It will contribute to the input cardinality for the next operator up the tree! • Computed from cardinality of input relations + selecitivity of the predicates. • For selections and joins, assume predicates are independent. • Q: Is “cost” the same as estimated “run time”?

  18. Statistics and Catalogs • Optimizer needs statistics about relations and their indexes to compute cardinality and selectivity estimates. • System Catalogs contain metadata about tables and relations • Just another set of relations; you can query them like any other table! • For optimizer, they typically contain: • # tuples (cardinality) and # pages (NPages) per rel’n. • # distinct key values (NKeys) for each index. • low/high key values (Low/High) for each index. • Index height (IHeight) for each tree index. • # index pages (INPages) for each index. • Statistics must be kept up to date • Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok. • Updated periodically • E.g. DB2: runstats on table feeding • Statistics out of date -> optimizer choosing very bad plans!

  19. The optimizer relies on good statistics Table Stats Name NumPages Sailors 1 Reserves 1 T1: Create table Sailors(…) T2: Create table Reserves(…) T3: Runstats on table Sailors T4: Runstats on table Reserves T5: Insert 100 pages of Sailors into Sailors T6: Runstats on Sailors T6: Insert 1000 pages of Reservations into Reserves T7: Run Query: X 100 SELECT S.sname, R.bid FROM Sailors S, Reserves R WHERE S.sid = R.sid Optimizer would incorrectly pick this plan! • Assume 7 pages of memory and only BNL • BNLJ: SailorsXReserves • 100 + 100/5x1 = 120 ReservesXSailors 1 + 1/5x100 = 101

  20. Size Estimation and Reduction Factors SELECT attribute list FROM relation list WHERE pred1 AND ... ANDpredk • Consider a query block: • Maximum Cardinality = product of the cardinalities of relations in the FROM clause. • Reduction factor (RF) associated with eachpredicatereflects the impact of the predicate in reducing result size. • Resultcardinality = Max # tuples * product of all RF’s. • RF usually called “selectivity” • only R&G seem to call it Reduction Factor

  21. Result Size Estimation • Resultcardinality = Max # tuples * product of all RF’s. • Term col=value (given index I on col, or knowledge about # column values) RF = 1/NKeys(I) • Term col1=col2 (This is handy for joins too…) RF = 1/MAX(NKeys(I1), NKeys(I2)) • Term col>value RF = (High(I)-value)/(High(I)-Low(I)) (Implicit assumptions: values are uniformly distributed and predicates are independent!) • Question: What if the optimizer has no statistics?

  22. /* default selectivity estimate for equalities such as "A = b" */ #define DEFAULT_EQ_SEL 0.005 /* default selectivity estimate for inequalities such as "A < b" */ #define DEFAULT_INEQ_SEL 0.3333333333333333 /* default selectivity estimate for range inequalities "A > b AND A < c" */ #define DEFAULT_RANGE_INEQ_SEL 0.005 /* default selectivity estimate for pattern-match operators such as LIKE */ #define DEFAULT_MATCH_SEL 0.005 /* default number of distinct values in a table */ #define DEFAULT_NUM_DISTINCT 200 /* default selectivity estimate for boolean and null test nodes */ #define DEFAULT_UNK_SEL 0.005 #define DEFAULT_NOT_UNK_SEL (1.0 - DEFAULT_UNK_SEL) No statistics is a common case • RF = 1/NKeys(I) for a column with an index • What about a non-index column, • e.g. R.day = ’01/31/2007’? • The original System R paper suggested 1/10 for this case, and many systems today still use that! • Here is what POSTGRES does:

  23. Estimating Join Cardinality • Q: Given a join of R and S, what is the range of possible result sizes (in #of tuples)? • Suppose the join is on a key for R and S • e.g. Students(sid, sname, did), Dorm(did,d.addr) Select S.sid, D.address From Students S, Dorms D Where S.did = D.did What is the cardinality? A student can only live in at most 1 dorm -> each S tuple can match with at most 1 D tuple -> cardinality (S join D) = cardinality of S

  24. Estimating Join Cardinality • General case: join on {A} ({A} is key for neither) • estimate each tuple r of R generates uniform number of matches in S and each tuple s of S generates uniform number of matches in R, e.g. RF = min(NTuples(R) * NTuples(S)/NKeys(A,S) NTuples(S) * NTuples(R)/NKeys(A,R)) Sailors: 100 tuples, 75 unique names -> 1.3 tuples for each sailor name Boats: 20 tuples, 10 unique names -> 2 tuples for each boat name e.g. SELECT S.name, B.name FROM Sailors S, Boats B WHERE S.name = B.name RF = 100*20/10 = 200 RF = 20*100/75 = 26.6

  25. D D C C D B A C B A B A Plan Enumeration • A heuristic decision in System R:only left-deep join treesare considered. • As the number of joins increases, the number of alternative plans grows rapidly; we need to restrict the search space. • Left-deep trees allow us to generate all fully pipelined plans. • Intermediate results not written to temporary files. • Not all left-deep trees are fully pipelined (e.g., SM join).

  26. Enumeration of Left-Deep Plans • Left-deep plans differ: • Order of relations • Access method for each relation, • Join method for each join. • Enumerated using N passes (if N relations joined): • Pass 1: Find best 1-relation plan for each relation. • Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.) • Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation. (All N-relation plans.) • For each subset of relations, retain only: • Cheapest plan overall, plus • Cheapest plan for each interesting order of the tuples.

  27. A Note on “Interesting Orders” • An intermediate result has an “interesting order” if it is sorted by any of: • ORDER BY attributes • GROUP BY attributes • Join attributes of yet-to-be-added (downstream) joins

  28. Enumeration of Plans (Contd.) • Avoid Cartesian products! • An N-1 way plan is not combined with an additional relation unless there is a join condition between them, or all predicates in WHERE have been used up. • i.e., consider a Cartesian product only if the user specified one! • ORDER BY, GROUP BY, aggregates • handled as a final step, using either an `interestingly ordered’ plan or an additonal sort/hash operator. • In spite of pruning plan space, System R approach is still exponential in the # of tables.

More Related