Download
query processing and optimization n.
Skip this Video
Loading SlideShow in 5 Seconds..
Query processing and optimization PowerPoint Presentation
Download Presentation
Query processing and optimization

Query processing and optimization

2 Vues Download Presentation
Télécharger la présentation

Query processing and optimization

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Query processing and optimization

  2. Definitions • Query processing • translation of query into low-level activities • evaluation of query • data extraction • Query optimization • selecting the most efficient query evaluation

  3. Query Processing (1/2) • SELECT * FROM student WHERE name=Paul • Parse query and translate • check syntax, verify names, etc • translate into relational algebra (RDBMS) • create evaluation plans • Find best plan (optimization) • Execute plan

  4. Query Processing (2/2) parser and translator relational algebra expression query optimizer evaluation engine evaluation plan output data statistics data data

  5. Relational Algebra (1/2) • Query language • Operations: • select: σ • project: π • union:  • difference: - • product: x • join: • Extended relational algebra operations:

  6. Relational Algebra (2/2) • SELECT * FROM student WHERE name=Paul • σname=Paul(student) • πname( σcid<00112235(student) ) • πname(σcoursename=Advanced DBs((student cid takes) courseid course) )

  7. Why Optimize? • Many alternative options to evaluate a query • πname(σcoursename=Advanced DBs((student cid takes) courseid course) ) • πname((student cid takes) courseidσcoursename=Advanced DBs(course)) ) • Several options to evaluate a single operation • σname=Paul(student) • scan file • use secondary index on student.name • Multiple access paths • access path: how can records be accessed (using index; which index?; etc.)

  8. πname σcoursename=Advanced DBs l courseid; index-nested loop σname=Paul ; use index i σname=Paul student cid; hash join course student student takes Evaluation plans • Specify which access path to follow • Specify which algorithm to use to evaluate operator • Specify how operators interleave • Optimization: • estimate the cost of each plan (not all plans) • select plan with lowest estimated cost

  9. Estimating Cost • What needs to be considered: • Disk I/Os • sequential (reading neighbouring pages is faster) • random • CPU time • Network communication • What are we going to consider: • Disk I/Os • page (data block) reads/writes • Ignoring cost of writing final output

  10. Operations and Costs (1/2) • Operations: σ, π, , , -, x, • Costs: • NR: number of records in R (other notation: TR or T(R) -> tuple) • LR: size of record in R (length of record) • bfR: blocking factor (other notation: FR) • number of records in a page (datablock) • BR: number of pages to store relation R • V(R, A): number of distinct values of attribute A in Rother notation: IA(R) (Image size) • SC(R,A): selection cardinality of A in R (number of matching rec.) • A key: SC(R, A)=1 • A nonkey: SC(R, A)= NR / V(R,A) (uniform distribution assumption) • HTi: number of levels in index I (-> height of tree) • rounding up fractions and logarithms

  11. Operations and Costs (2/2) • relation takes • 7000 tuples • student cid 8 bytes • course id 4 bytes • 40 courses • 1000 students • page size 512 bytes • output size (in pages) of query: which students take the Advanced DBs course? • Ttakes = 7000 • V(courseid, takes) = 40 • SC(courseid,takes)=ceil( Ttakes/V(courseid, takes) )=ceil(7000/40)=175 • bftakes = floor(512/12) = 42 Btakes = 7000/42 = 167pages • bfoutput = floor(512/8) = 64 Boutput = 175/64 = 3 pages

  12. Cost of Selection σ (1/2) • Linear search • read all pages, find records that match (assuming equality search) • average cost: • nonkey (multiple occurences): BR, key: 0.5*BR • Binary search • on ordered field • average cost: • m additional pages to be read(first found then read the duplicates) • m = ceil( SC(A,R)/bfR ) - 1 • Primary/Clustered Index (B+ tree) • average cost: • single record: HTi + 1 • multiple records: HTi + ceil( SC(R,A)/bfR )

  13. Cost of Selection σ (2/2) • Secondary Index (B+ tree) • average cost: • key field: HTi + 1 • nonkey field • worst case HTi + SC(A,R) • linear search more desirable if many matching records !!!

  14. Complex selection σexpr • conjunctive selections: • perform simple selection using θi with the lowest evaluation cost • e.g. using an index corresponding to θi • apply remaining conditions θ on the resulting records • cost: the cost of the simple selection on selected θ • multiple indices • select indices that correspond to θis • scan indices and return RIDs (ROWID in Oracle) • answer: intersection of RIDs • cost: the sum of costs + record retrieval • disjunctive selections: • multiple indices • union of RIDs • linear search

  15. Projection and set operations • SELECT DISTINCT cid FROM takes • π requires duplicate elimination • sorting • set operations require duplicate elimination • R  S • R  S • sorting

  16. Sorting • efficient evaluation for many operations • required by query: • SELECT cid,name FROM student ORDER BY name • implementations • internal sorting (if records fit in memory) • external sorting (that’s why we need temporary space on disk)

  17. External Sort-Merge Algorithm (1/3) • Sort stage: create sorted runs i=0; repeat read M pages of relation R into memory (M: size of Memory) sort the M pages write them into file Ri increment i until no more pages N = i // number of runs

  18. External Sort-Merge Algorithm (2/3) • Merge stage: merge sorted runs //assuming N < M (N <= M-1 we need 1 output buffer) allocate a page for each run file Ri // N pages allocated read a page Pi of each Ri repeat choose first record (in sort order) among N pages, say from page Pj write record to output and delete from page Pj if page is empty read next page Pj’ from Rj until all pages are empty

  19. External Sort-Merge Algorithm (3/3) • Merge stage: merge sorted runs • What if N >= M ? • perform multiple passes • each pass merges M-1 runs until relation is processed • in next pass number of runs is reduced • final pass generated sorted output

  20. d 95 d d 95 95 b d d d 38 95 95 95 a 12 a a 12 12 f d 95 12 a a a a 12 12 12 12 x 44 x x 44 44 o e 87 73 s 95 s f 12 95 f 12 f 12 x n 44 67 f 12 R2 o 73 o 73 pass s 95 t 45 o b 73 38 e 87 n 67 s e 95 87 file memory n 67 e 87 t n 45 67 t 45 R3 z 11 v t 45 22 b 38 v 22 x v 22 44 v 22 b 38 z z 11 11 z 11 R4 Sort-Merge Example run pass R1

  21. Sort-Merge cost • BR the number of pages of R • Sort stage: 2 * BR • read/write relation • Merge stage: • initially runs to be merged • each pass M-1 runs sorted • thus, total number of passes: • at each pass 2 * BRpages are read/written • read/write relation (BR+ BR) • apart from final write (BR) • Total cost: • 2 * BR + 2 * BR* - BR eg. BR = 1000000, M=100

  22. Projection • πΑ1,Α2… (R) • remove unwanted attributes • scan and drop attributes • remove duplicate records • sort resulting records using all attributes as sort order • scan sorted result, eliminate duplicates (adjacent) • cost • initial scan + sorting + final scan

  23. Join • πname(σcoursename=Advanced DBs((student cid takes) courseid course) ) • implementations • nested loop join • block-nested loop join • indexed nested loop join • sort-merge join • hash join

  24. Nested loop join (1/2) • R S for each tuple tR of R for each tS of S if (tR tS match) output tR.tS end end • Works for any join condition • S inner relation • R outer relation

  25. Nested loop join (2/2) • Costs: • best case when smaller relation fits in memory • use it as inner relation • BR+BS • worst case when memory holds one page of each relation • S scanned for each tuple in R • TR * Bs + BR

  26. Block nested loop join (1/2) for each page XR of R foreach page XS of S for each tuple tR in XR for each tS in XS if (tR tS match) output tR.tS end end end end

  27. Block nested loop join (2/2) • Costs: • best case when smaller relation fits in memory • use it as inner relation • BR+BS • worst case when memory holds one page of each relation • S scanned for each page in R • BR * Bs + BR

  28. Block nested loop join(an improvement) Memory size: M for each M – 1 page size chunkMR of R foreach page XS of S for each tuple tR in MR for each tS in XS if (tR tS match) output tR.tS end end end end 28

  29. Block nested loop join(an improvement) Costs: best case when smaller relation fits in memory use it as inner relation BR+BS general case S scanned for each M-1 size chunk in R (BR/ (M-1)) * Bs + BR 29

  30. Indexed nested loop join • R S • Index on inner relation (S) • for each tuple in outer relation (R) probe index of inner relation • Costs: • BR + TR * c • c the cost of index-based selection of inner relation • relation with fewer records as outer relation

  31. d D e 67 e E e 87 x X n 11 v V v 22 z 38 Sort-merge join • R S • Relations sorted on the join attribute • Merge sorted relations • pointers to first record in each relation • read in a group of records of S with the same values in the join attribute • read records of R and process • Relations in sorted order to be read once • Cost: • cost of sorting + BS + BR

  32. R0 S0 R S R1 S1 . . . . . . Rn-1 Sn-1 Hash join • R S • use h1on joining attribute to map records to partitions that fit in memory • records of R are partitioned into R0… Rn-1 • records of S are partitioned into S0… Sn-1 • join records in corresponding partitions • using a hash-based indexed block nested loop join • Cost: 2*(BR+BS) + (BR+BS)

  33. πname σcoursename=Advanced DBs courseid; index-nested loop cid; hash join course student takes Evaluation • evaluate multiple operations in a plan • materialization • pipelining

  34. πname σcoursename=Advanced DBs courseid; index-nested loop cid; hash join course student takes Materialization • create and read temporary relations • create implies writing to disk • more page writes

  35. πname σcoursename=Advanced DBs courseid; index-nested loop cid; hash join course student takes Pipelining (1/2) • creating a pipeline of operations • reduces number of read-write operations • implementations • demand-driven - data pull • producer-driven - data push

  36. Pipelining (2/2) • can pipelining always be used? • any algorithm? • cost of R S • materialization and hash join: BR + 3(BR+BS) • pipelining and indexed nested loop join: TR * HTi courseid pipelined materialized R S cid σcoursename=Advanced DBs student takes course

  37. Choosing evaluation plans • cost based optimization • enumeration of plans • R S T, 12 possible orders (3! * 2) (R S) T, R (S T) • cost estimation of each plan • overall cost • cannot optimize operation independently

  38. πname σcoursename=Advanced DBs courseid; index-nested loop cid; hash join course student takes Cost estimation • operation (σ, π, …) • implementation • size of inputs • size of outputs • sorting

  39. Expression Equivalence • conjunctive selection decomposition • commutativity of selection • combining selection with join and product • σθ1(R x S) = … R θ1 S = … • commutativity of joins • R θ1 S = S θ1 R • distribution of selection over join • σθ1^θ2(R S) = σθ1(R) σθ2 (S) • distribution of projection over join • πA1,A2(R S) = πA1(R) πA2 (S) • associativity of joins: R (S T) = (R S) T

  40. Cost Optimizer (1/2) • transforms expressions • equivalent expressions • heuristics, rules of thumb • perform selections early • perform projections early • replace products followed by selection σ (R x S) with joins R S • start with joins, selections with smallest result • create left-deep join trees

  41. πname σcoursename=Advanced DBs ccourseid; index-nested loop cid; hash join course student takes Cost Optimizer (2/2) πname ccourseid; index-nested loop σcoursenam = Advanced DBs cid; hash join student takes course

  42. Summary • Estimating the cost of a single operation • Estimating the cost of a query plan • Optimization • choose the most efficient plan