1 / 83

Query Evaluation

Donghui Zhang Northeastern University. Query Evaluation. Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) http://www-db.stanford.edu/~ullman/dscb.html. Server. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50. Michael Jordan Donghui Zhang. Query Evaluation.

amina
Télécharger la présentation

Query Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Donghui Zhang Northeastern University Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) http://www-db.stanford.edu/~ullman/dscb.html

  2. Server SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Michael Jordan Donghui Zhang Query Evaluation SQL Query  Query Result ??? • Check the data and • meta data; • Produce query result Michael Jordan Donghui Zhang

  3. Query Evaluation Steps • Query Compiling: get logical Q.P. • Query Optimization: choose a physical Q.P. • Query Execution: execute

  4. query compiling query execution query optimization answer SQL query parse parse tree convert logical query plan execute apply laws statistics Pi “improved” l.q.p pick best estimate result sizes {(P1,C1),(P2,C2)...} l.q.p. +sizes estimate costs consider physical plans {P1,P2,…..}

  5. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Query Compiling  Parse • Background knowledge: Grammar. • Input: SQL query. • Output: a parse tree. • Start with a simple grammar: • Only SFW (no group by, having, nested query) • Simple AND condition (no OR, UNION, EXISTS, IN, …) • One table (no conditions like E.did=D.did)

  6. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Query Compiling  Parse  Grammar • <SFW> := SELECT <SelList> FROM <Table> WHERE <CondList> • <SelList> := <Attribute> | <Attribute>, <SelList> • <CondList> := <Condition> | <Condition> AND <CondList> • <Condition> := <Attribute> <op> <value> • <op>:= > | < | = | >= | <=

  7. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 <Attribute> Emp E <Condition> AND <CondList> <Condition> E.Name <Attribute> <op> <value> E.SSN < 5000 <Attribute> <op> <value> E.Age > 50 Query Compiling  Parse  Parse Tree <SFW> SELECT <SelList> FROM <Table> WHERE <CondList>

  8. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Compiling  Convert • Input: a parse tree. • Output: a logical query plan. • Algorithm:  followed by . • E.Name(E.SSN<5000 AND E.Age>50(E) ) • Alternatively, a l.q.p tree.

  9. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Compiling  Apply Laws • Replace  with , push  [and ] down. • Only used for multiple tables. So skip.

  10. query compiling query execution query optimization answer SQL query parse parse tree convert logical query plan execute apply laws statistics Pi “improved” l.q.p pick best estimate result sizes {(P1,C1),(P2,C2)...} l.q.p. +sizes estimate costs consider physical plans {P1,P2,…..}

  11. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Query Optimization  Estimate Result Sizes • The size of each input table is stored as meta data. • Intermediate result: size not known, but needed to estimate I/O cost of physical plan. • But for the simple case,  can be evaluated on the fly. So no need to estimate the size of . So skip.

  12. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization  Consider Physical Plans • Associate each RA operator with an implementation scheme. • Multiple implementation schemes? Enumerate all. Plan 1 (always work!) on-the-fly scan

  13. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 SSN index 1 2 3 4 5 6 Query Optimization  Consider Physical Plans • For the other physical plans, need to know what indices exist. • Primary index: controls the actual storage of a table. • Suppose a primary B+-tree index exists on SSN. • Secondary index: built on some other attribute. Does not store the actual record. Each leaf entry stores a set of page IDs in the primary index. • Suppose a secondary B+-tree index exists on Age. e.g. entry in Age index: Age=50, pageIDs= {1, 4, 6}

  14. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization  Consider Physical Plans Plan 2 on-the-fly range search in SSN index

  15. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization  Consider Physical Plans Plan 3 on-the-fly range search in Age index, follow pointers to SSN index

  16. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Query Optimization  Estimate Costs • Estimate #I/Os for each physical plan. • Pick the cheapest one. • Input: physical plan. • Additional input: • meta data (e.g. how many levels a B+-tree has) • assumptions (e.g. the root node of every B+-tree is pinned) • memory buffer size.

  17. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Query Optimization  Estimate Costs  Meta Data • All the database tables. • For each table R: • Schema • T(R): #records in R • For every attribute A: • V(R, A): #distinct values of A • min(R, A): minimum value of A • max(R, A): maximum value of A • Primary index: #levels, #leaf nodes. • Secondary index: #levels, #leaf nodes, average #pageIDs per leaf entry.

  18. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Query Optimization  Estimate Costs  sample input • Assume for table E: • Schema = (SSN: int, Name: string, Age: int, Salary: int) • T(E) = 100 tuples. • For attribute SSN: • V(E, SSN)=100, min(E, SSN)=0000, max(E, SSN)=9999 • For attribute Age: • V(E, Age)=20, min(E, Age)=21, max(E, Age)=60 • Primary index on SSN: 3 level B+-tree, 50 leaf nodes. • Secondary index on Age: 2 level B+-tree, 10 leaf nodes, every leaf entry points to 3.5 pageIDs (on average). • Assumptions: all B+-tree roots are pinned. Can reach the first leaf page of a B+-tree directly. • Memory buffer size: 2 pages.

  19. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization  Estimate Costs • Cost = 50. (The primary index has 50 leaf nodes. Assume we can reach the first leaf page of a B+-tree directly.) Plan 1 (always work!) on-the-fly scan

  20. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization  Estimate Costs • Cost = 25. SSN<5000 selects half of the employees, so 50/2=25 leaf nodes. • Note: if condition is E.SSN>5000, needs 1 more I/O. Plan 2 on-the-fly range search in SSN index

  21. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 #I/Os in the SSN index #I/Os in the Age index E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization  Estimate Costs • Cost = 10/4 +  20/4 * 3.5 = 21. Plan 3 on-the-fly range search in Age index, follow pointers to SSN index

  22. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Check 1/4 of them, since [51,60] is 1/4 of [21,60]. Age index has 10 leaf nodes. E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization  Estimate Costs • Cost = 10/4 +  20/4 * 3.5 = 21. Plan 3 on-the-fly range search in Age index, follow pointers to SSN index

  23. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 times 3.5 (#pageIDs per page) to get #I/Os in the SSN index. 20 distinct ages divided by 4 to get #ages in [51,60]. E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization  Estimate Costs • Cost = 10/4 +  20/4 * 3.5 = 21. Plan 3 on-the-fly range search in Age index, follow pointers to SSN index

  24. SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Pick! Query Optimization  Pick Best

  25. query compiling query execution query optimization answer SQL query parse parse tree convert logical query plan execute apply laws statistics Pi “improved” l.q.p pick best estimate result sizes {(P1,C1),(P2,C2)...} l.q.p. +sizes estimate costs consider physical plans {P1,P2,…..}

  26. Another case study: two tables. • Extended grammar: • Only SFW (no group by, having, nested query) • Simple AND condition (no OR, UNION, EXISTS, IN, …) • Allow two tables (allow conditions like E.did=D.did) • Example query: SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

  27. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 Query Compiling  Parse  Grammar • <SFW> := SELECT <SelList> FROM <TableList> WHERE <CondList> • <SelList> := <Attribute> | <Attribute>, <SelList> • <TableList> := <Table> | <Table>, <Table> • <CondList> := <Condition> | <Condition> AND <CondList> • <Condition> := <Attribute> <op> <value> | <Attribute> = <Attribute> • <op>:= > | < | = | >= | <=

  28. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 <Attribute> , <SelList> E.Name <Attribute> D.Dname Query Compiling  Parse  Parse Tree <SFW> SELECT <SelList> FROM <TableList> WHERE <CondList>

  29. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 <Table> <Table> , Emp E Dept D Query Compiling  Parse  Parse Tree <SFW> SELECT <SelList> FROM <TableList> WHERE <CondList>

  30. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 <Condition> AND <CondList> <Attribute> = <Attribute> <Condition> AND <CondList> E.Did D.Did <Condition> Query Compiling  Parse  Parse Tree <SFW> SELECT <SelList> FROM <TableList> WHERE <CondList>

  31. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 E.Name, D.Dname E.Did=D.Did AND E.SSN<5000 AND D.budget=1000  Emp E Dept D Query Compiling  Convert • Algorithm:  then  then . • E.Name. D.Dname(E.Did=D.Did AND E.SSN<5000 AND D.budget=1000(ED) ) • The l.q.p tree:

  32. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 E.Name, D.Dname E.Did=D.Did AND E.SSN<5000 AND D.budget=1000  Emp E Dept D Query Compiling  Apply Laws • Always always: (try to) replace  with !

  33. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 Query Compiling  Apply Laws • Always always: (try to) replace  with ! • Also, push  down. E.Name, D.Dname E.SSN<5000 AND D.budget=1000 Emp E Dept D

  34. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 Query Compiling  Apply Laws • Always always: (try to) replace  with ! • Also, push  down. E.Name, D.Dname E.SSN<5000 AND D.budget=1000 Emp E Dept D

  35. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 Query Compiling  Apply Laws • Always always: (try to) replace  with ! • Also, push  down. E.Name, D.Dname E.SSN<5000 D.budget=1000 Emp E Dept D

  36. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 Query Compiling  Apply Laws Theory Behind • Let p = predicate with only E attributes q = predicate with only D attributes m = E & D’s common attributes are equal • We have: pqm (E  D) = p(E) q(D)

  37. query compiling query execution query optimization answer SQL query parse parse tree convert logical query plan execute apply laws statistics Pi “improved” l.q.p pick best estimate result sizes {(P1,C1),(P2,C2)...} l.q.p. +sizes estimate costs consider physical plans {P1,P2,…..}

  38. SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 Query Optimization  Consider Physical Plans • Because join is so important, let’s skip result size estimation for now, and let’s assume selections are not pushed down. E.Name, D.Dname E.SSN<5000 AND D.budget=1000 Emp E Dept D

  39. Four Join Algorithms • Iteration join (nested loop join) • Merge join • Hash join • Join with index

  40. Example E D over common attribute Did • E: • T(E)=10,000 • primary index on SSN, 3 levels. • |E|= 1,000 leaf nodes. • D: • T(D)=5,000 • primary index on Did. 3 levels. • |D| = 500 leaf nodes. • Memory available = 101 blocks

  41. Iteration Join • for every block in E • scan through D; • join records in the E block with records in the D block. • I/O cost = |E| + |E| * |D| = • 1000 + 1000*500 = 501,000. • Works good for small buffer (e.g. two blocks).

  42. Can we do better? Use our memory (1) Read 100 blocks of E (2) Read all of D (using 1 block) + join (3) Repeat until done • I/O cost = |E| + |E|/100 * |D| = • 1000 + 10*500 = 6,000.

  43. Can we do better? Reverse join order: D E. i.e. For every 100 D blocks, go through E. • I/O cost = |D| + |D|/100 * |E| = • 500 + 5*1000 = 5,500.

  44. Merge join (conceptually) (1) if R1 and R2 not sorted, sort them (2) i  1; j  1; While (i  T(R1))  (j  T(R2)) do if R1{ i }.C = R2{ j }.C then outputTuples else if R1{ i }.C > R2{ j }.C then j  j+1 else if R1{ i }.C < R2{ j }.C then i  i+1

  45. Procedure Output-Tuples While (R1{ i }.C = R2{ j }.C) (i  T(R1)) do [jj  j; while (R1{ i }.C = R2{ jj }.C) (jj  T(R2)) do [output pair R1{ i }, R2{ jj }; jj  jj+1 ] i  i+1 ]

  46. Example i R1{i}.C R2{j}.C j 1 10 5 1 2 20 20 2 3 20 20 3 4 30 30 4 5 40 30 5 50 6 52 7

  47. Merge Join Cost • Recall that |E|=1000, |D|=500. And |D| is already sorted on Did. • External sort E: pass 0, by reading and writing E, produces a file with 10 sorted runs. Another read is enough. • No need to write! Can pipeline to join operator. • Cost = 3*1000 + 500 = 3,500.

  48. Hash join (conceptual) • Hash function h, range 0  k • Buckets for R1: G0, G1, ... Gk • Buckets for R2: H0, H1, ... Hk Algorithm (1) Hash R1 tuples into G buckets (2) Hash R2 tuples into H buckets (3) For i = 0 to k do match tuples in Gi, Hi buckets

  49. Simple example hash: even/odd R1 R2 Buckets 2 5 Even 4 4 R1 R2 3 12 Odd: 5 3 8 13 9 8 11 14 2 4 8 4 12 8 14 3 5 9 5 3 13 11

  50. Hash Join Cost • Read + write both E and D for partitioning, then read to join. • Cost = 3 * (1000 + 500) = 4,500.

More Related