DBMS Review: Database Systems Architecture, Operators, and Query Processing

EECS 647: Introduction to Database Systems Instructor: Luke Huan Spring 2007

Administrative • Homework 4 is due today • Homework 5 is assigned today, it is due April 23. Luke Huan Univ. of Kansas

A DBMS Review Luke Huan Univ. of Kansas

Review DBMS Architecture User/Web Forms/Applications/DBA query transaction Query Parser Transaction Manager Query Rewriter Logging & Recovery Query Optimizer Lock Manager Query Executor Files & Access Methods Lock Tables Buffers Buffer Manager Main Memory Storage Manager Storage Luke Huan Univ. of Kansas

Review Operators in QP • Logical operators: • what they do • Physical operators: • how they do it • Exercise: L: logical operator, P: physical operator • Union • Join • Nested loop join • Sort-merge join L L P P Luke Huan Univ. of Kansas

 buyer,item  City=‘urbana’ (Simple nested loops) buyer=name (Table scan) (Index scan) Person Purchase Review Query Tree and Physical Operators SELECT P.buyer, P.item FROM Purchase P, Person Q WHERE P.buyer=Q.name ANDQ.city=‘urbana’ Luke Huan Univ. of Kansas

SQL query SELECT title, SIDFROM Enroll, CourseWHERE Enroll.CID = Course.CID; <Query> Parser <SFW> Parse tree <select-list> <where-cond> <from-list> … … Rewritor <table> <table> ¼title, SID ¾Enroll.CID = Course.CID Logical plan Enroll Course £ Optimizer PROJECT (title, SID) Enroll Course Physical plan Sort MERGE-JOIN (CID) SORT (CID) Executor SORT (CID) SCAN (Enroll) Result SCAN (Course) Preview A query’s trip through the DBMS Luke Huan Univ. of Kansas

Today’s Topic • A system view of DBMS query processing process • Parser • Rewriter • Optimizer • Executor • Query optimization • Heuristic based methods • Cost based methods Luke Huan Univ. of Kansas

Parsing and validation • Parser: SQL ! parse tree • Good old lex & yacc • Detect and reject syntax errors • Validator: validate parse tree • Detect and reject semantic errors • Nonexistent tables/views/columns? • Insufficient access privileges? • Type mismatches? • Examples: AVG(name), name + GPA, Student UNION Enroll • Also • Expand * • Expand view definitions • Information required for semantic checking is found in system catalog (contains all schema information) Luke Huan Univ. of Kansas

¼title ¾Student.name=“Bart” ÆStudent.SID = Enroll.SIDÆEnroll.CID = Course.CID £ £ Course Student Enroll ¼title An equivalent plan: !Enroll.CID = Course.CID Course !Student.SID = Enroll.SID Enroll ¾name = “Bart” Student Logical plan • Nodes are logical operators (often relational algebra operators) • There are many equivalent logical plans Luke Huan Univ. of Kansas

¼title PROJECT (title) PROJECT (title) !Enroll.CID = Course.CID INDEX-NESTED-LOOP-JOIN (CID) MERGE-JOIN (CID) Course Index on Course(CID) SCAN (Course) !Student.SID = Enroll.SID SORT (CID) INDEX-NESTED-LOOP-JOIN (SID) MERGE-JOIN (SID) Enroll Index on Enroll(SID) ¾name = “Bart” SORT (SID) FILTER (name = “Bart”) INDEX-SCAN (name = “Bart”) SCAN (Enroll) Student Index on Student(name) SCAN (Student) Examples of physical plans • Many physical plans for a single query • Equivalent results, but different costs and assumptions! • DBMS query optimizer picks the “best” possible physical plan Luke Huan Univ. of Kansas

Physical plan execution • How are intermediate results passed from child operators to parent operators? • Temporary files • Compute the tree bottom-up • Children write intermediate results to temporary files • Parents read temporary files • Iterators • Do not materialize intermediate results • Children pipeline their results to parents Luke Huan Univ. of Kansas

Iterator interface • Every physical operator maintains its own execution state and implements the following methods: • open(): Initialize state and get ready for processing • getNext(): Return the next tuple in the result (or a null pointer if there are no more tuples); adjust state to allow subsequent tuples to be obtained • close(): Clean up Luke Huan Univ. of Kansas

An iterator for table scan • State: a block of memory for buffering input R; a pointer to a tuple within the block • open(): allocate a block of memory • getNext() • If no block of R has been read yet, read the first block from the disk and return the first tuple in the block • Or the null pointer if R is empty • If there is no more tuple left in the current block, read the next block of R from the disk and return the first tuple in the block • Or the null pointer if there are no more blocks in R • Otherwise, return the next tuple in the memory block • close(): deallocate the block of memory Luke Huan Univ. of Kansas

An iterator for SORT-MERGE-JOIN R: An iterator for the left subtree S: An iterator for the right subtree sort R; sort S;R.open(); S.open();r = R.next(); s = S.next(); while (r != NULL && s != NULL) { while (r.c < s.c) r = R.next(); if (r.c = s.c) { while (r.c = s.c) { output rs; r = R.next(); } "rewind" r to first tuple of R where r.c = s.c; s = S.next(); } while (r.c > s.c) s = S.next();} SORT-MERGE-JOIN R S Complexity: some of the DBMSs (e.g. pgSQL) predate C++! Luke Huan Univ. of Kansas

Any of these will do 1 second 1 minute 1 hour Query optimization • One logical plan ! “best” physical plan • Questions • How to enumerate possible plans • How to estimate costs • How to pick the “best” one • Often the goal is not getting the optimum plan, but instead avoiding the horrible ones Luke Huan Univ. of Kansas

! ! ! … = = = ! ! ! T T S R S S R R T Plan enumeration in relational algebra • Apply relational algebra equivalences • Join reordering: £ and ! are associative and commutative (except column ordering, but that is unimportant) Luke Huan Univ. of Kansas

More relational algebra equivalences • Convert ¾p-£ to/from !p: ¾p(R£S) = R!pS • Merge/split ¾’s: ¾p1(¾p2R) = ¾p1 Æp2R • Merge/split ¼’s: ¼L1(¼L2R) = ¼L1R, where L1 µL2 • Push down/pull up ¾:¾pÆprÆps (R!p’S) = (¾prR) !pÆp’ (¾psS), where • pr is a predicate involving only R columns • ps is a predicate involving only S columns • p and p’ are predicates involving both R and S columns • Push down ¼: ¼L (¾pR) = ¼L (¾p (¼L L’R)), where • L’ is the set of columns referenced by p that are not in L • Many more (seemingly trivial) equivalences… • Can be systematically used to transform a plan to new ones Luke Huan Univ. of Kansas

¼title ¾Student.name=“Bart” ÆStudent.SID = Enroll.SIDÆEnroll.CID = Course.CID £ £ Course ¼title Student Enroll Convert ¾p-£ to !p Push down ¾ ¾Enroll.CID = Course.CID ¼title £ !Enroll.CID = Course.CID Course ¾Student.SID = Enroll.SID Course £ !Student.SID = Enroll.SID Enroll ¾Student.name = “Bart” Enroll ¾name = “Bart” Student Student Relational query rewrite example Luke Huan Univ. of Kansas

Heuristics-based query optimization • Start with a logical plan • Push selections/projections down as much as possible • Why? Reduce the size of intermediate results • Why not? May be expensive; maybe joins filter better • Join smaller relations first, and avoid cross product • Why? Reduce the size of intermediate results • Why not? Size depends on join selectivity too • Convert the transformed logical plan to a physical plan (by choosing appropriate physical operators) Luke Huan Univ. of Kansas

SQL query rewrite • More complicated—subqueries and views divide a query into nested “blocks” • Processing each block separately forces particular join methods and join order • Even if the plan is optimal for each block, it may not be optimal for the entire query • Unnest query: convert subqueries/views to joins • We can just deal with select-project-join queries • Where the clean rules of relational algebra apply Luke Huan Univ. of Kansas

SQL query rewrite example • SELECT nameFROM StudentWHERE SID = ANY (SELECT SID FROM Enroll); • SELECT nameFROM Student, EnrollWHERE Student.SID = Enroll.SID; • Wrong—consider two Bart’s, each taking two classes • SELECT nameFROM (SELECT DISTINCT Student.SID, name FROM Student, Enroll WHERE Student.SID = Enroll.SID); • Right—assuming Student.SID is a key Luke Huan Univ. of Kansas

Heuristics- vs. cost-based optimization • Heuristics-based optimization • Apply heuristics to rewrite plans into cheaper ones • Cost-based optimization • Rewrite logical plan to combine “blocks” as much as possible • Optimize query block by block • Enumerate logical plans (already covered) • Estimate the cost of plans • Pick a plan with acceptable cost • Focus: select-project-join blocks Luke Huan Univ. of Kansas

Recap of Query Processing in DBMS • Parser: SQL ! parse tree • Rewriter: parse tree ! Logical plan • Next: • Optimizer: logical plan ! Physcial plan Luke Huan Univ. of Kansas

Input to SORT(CID): PROJECT (title) MERGE-JOIN (CID) SCAN (Course) SORT (CID) MERGE-JOIN (SID) SORT (SID) FILTER (name = “Bart”) SCAN (Enroll) SCAN (Student) Cost estimation • We have: cost estimation for each operator • Example: SORT(CID) takes log2B(input) £B(input) • But what is B(input)? • We need: size of intermediate results Physical plan example: Luke Huan Univ. of Kansas

Selections with equality predicates • Q: ¾A = vR • Suppose the following information is available • Size of R: |R| • Number of distinct A values in R: |¼AR| • Assumptions • Values of A are uniformly distributed in R • Values of v in Q are uniformly distributed over all R.A values • |Q| ¼ |R| ⁄ |¼AR| • Selectivity factor of (A = v) is 1 ⁄ |¼AR| Luke Huan Univ. of Kansas

Conjunctive predicates • Q: ¾A = u and B = vR • Additional assumptions • (A = u) and (B = v) are independent • Counterexample: major and advisor • No “over”-selection • Counterexample: A is the key • |Q| ¼ |R| ⁄ (|¼AR| · |¼BR|) • Reduce total size by all selectivity factors Luke Huan Univ. of Kansas

Negated and disjunctive predicates • Q: ¾A¹vR • |Q| ¼ |R| · (1 – 1 ⁄ |¼AR|) • Selectivity factor of :p is (1 – selectivity factor of p) • Q: ¾A = u or B = vR • |Q| ¼ |R| · (1 ⁄ |¼AR| + 1 ⁄ |¼BR|)? • No! Tuples satisfying (A = u) and (B = v) are counted twice • |Q| ¼ |R| · (1 – (1 – 1 ⁄ |¼AR|) · (1 – 1 ⁄ |¼BR|)) • Intuition: (A = u) or (B = v) is equivalent to: ( : (A = u) AND : (B = v)) Luke Huan Univ. of Kansas

Cost estimation: summary • Using similar ideas, we can estimate the size of projection, duplicate elimination, union, difference, aggregation (with grouping) • Lots of assumptions and very rough estimation • Accurate estimate is not needed • Maybe okay if we overestimate or underestimate consistently • May lead to very nasty optimizer “hints” SELECT * FROM Student WHERE GPA > 3.9; SELECT * FROM Student WHERE GPA > 3.9 AND GPA > 3.9; • Not covered: better estimation using histograms Luke Huan Univ. of Kansas

! ! ! ! R2 R1 R3 R4 R5 Search for the best plan • Huge search space • “Bushy” plan example: • Just considering different join orders, there are (2n – 2)! / (n – 1) bushy plans for R1!L!Rn • 30240 for n = 6 • And there are more if we consider: • Multiway joins • Different join methods • Placement of selection and projection operators Luke Huan Univ. of Kansas

! ! R5 ! R4 ! R3 R2 R1 Left-deep plans • Heuristic: consider only “left-deep” plans, in which only the left child can be a join • Tend to be better than plans of other shapes, because many join algorithms scan inner (right) relation multiple times—you will not want it to be a complex subtree • How many left-deep plans are there for R1!L!Rn? • Significantly fewer, but still lots— n! (720 for n = 6) Luke Huan Univ. of Kansas

! Minimize expected size Sk A greedy algorithm • S1, …, Sn • Say selections have been pushed down; i.e., Si = ¾p Ri • Start with the pair Si, Sj with the smallest estimated size for Si!Sj • Repeat until no relation is left:Pick Sk from the remaining relations such that the join of Sk and the current result yields an intermediate result of the smallest size Pick most efficient join method Remainingrelationsto be joined …, Sk,Sl,Sm, … Current subplan Luke Huan Univ. of Kansas

A dynamic programming approach • Generate optimal plans bottom-up • Pass 1: Find the best single-table plans (for each table) • Pass 2: Find the best two-table plans (for each pair of tables) by combining best single-table plans • … • Pass k: Find the best k-table plans (for each combination of k tables) by combining two smaller best plans found in previous passes • … • Rationale: Any subplan of an optimal plan must also be optimal (otherwise, just replace the subplan to get a better overall plan) • Well, not quite… Luke Huan Univ. of Kansas

The need for “interesting order” • Example: R(A, B) !S(A, C) !T(A, D) • Best plan for R!S: hash join (beats sort-merge join) • Best overall plan: sort-merge join R and S, and then sort-merge join with T • Subplan of the optimal plan is not optimal! • Why? • The result of the sort-merge join of R and S is sorted on A • This is an interesting order that can be exploited by later processing (e.g., join, duplicate elimination, GROUPBY, ORDERBY, etc.)! Luke Huan Univ. of Kansas

Summary • Relational algebra equivalence • SQL rewrite tricks • Heuristics-based optimization • Cost-based optimization • Need statistics to estimate sizes of intermediate results • Greedy approach • Dynamic programming approach Luke Huan Univ. of Kansas

DBMS Review: Database Systems Architecture, Operators, and Query Processing

DBMS Review: Database Systems Architecture, Operators, and Query Processing

Presentation Transcript

Introduction to Database Systems

Introduction to Database Systems

Introduction to Database Systems

CS186: Introduction to Database Systems

Introduction to Relational Database Systems

Introduction to Database Systems

An Introduction to Database Systems

Introduction to Database Systems

Introduction to Database Management Systems

Introduction to Database Management Systems

Introduction to Database Systems

Introduction to Database Systems

Introduction to Database Systems

Introduction to Database Systems

EECS 647: Introduction to Database Systems

Introduction to Database Systems

Introduction to Introduction to Database Systems

CS186: Introduction to Database Systems

Introduction to Database Systems

EECS 647: Introduction to Database Systems

Introduction to Database Systems