170 likes | 306 Vues
This comprehensive overview covers the essentials of query processing and optimization within SQL databases. It delves into the relational model, formal query languages, functional dependencies, normalization, and physical design. Key topics include indexing strategies, query execution plans, and the role of the query optimizer. Special attention is given to hash joins, their algorithms, and cost estimates associated with recursive partitioning. Additional operations such as duplicate elimination and aggregation techniques are also highlighted, emphasizing their implementation and optimization within database systems.
E N D
General Overview • Relational model - SQL • Formal & commercial query languages • Functional Dependencies • Normalization • Physical Design • Indexing • Query Processing and Optimization
Review: QP & O SQL Query Query Processor Parser Query Optimizer Algebraic Expression Execution plan Evaluator Data: result of the query
Review: QP & O Query Optimizer Algebraic Representation Query Rewriter Algebraic Representation Data Stats Plan Generator Query Execution Plan
Review-Plan Generation Metadata: DBMS maintains statistics about each relation, attribute and index. Plan generation: • Generate many alternative plans • We saw many for selections, joins • Estimate cost for each and choose best Plans examined: Selection: Linear, binary, PI, SI Range: PI, SI Joins: NLJ, BNLJ, INLJ, SMJ, HJ
Hash- joins • Applicable only to natural joins, equijoins Depends upon hash function h, used to partition both relations must map values of join attributes to { 0, ..., n-1} s.t. n = #partitions
Hash-Join Algorithm Algorithm: Hash Join • Partition the relation S using hashing function h so that each si fits in memory. Use 1 block of memory as the output buffer for each partition. (at least n blocks) 2. Partition R using h. • For each partition #i (0,… n-1) • Use BNLJ to compute the join between Ri and Si : Ri Si (optimal since si fits in memory, inner relation) S is called the build input and R is called the probe input. Note: can reduce CPU costs by building in-memory hash index for each si using a different hash function than h.
Hash Join Partitioning: must choose: • # of partitions, n • hashing function, h (each tuple {0, ..., n-1}) Goals (in order of importance) 1. Each partition of build relation should fit in memory (=> h is uniform, n is large) 2. For partitioning step, can fit 1 output block of each partition in memory (=> n is small (<= M-1)) Strategy: Ensure #1. Deal with violations of #2 when needed.
Hash Join Goal #1: Partitions of build relations should fit in memory: 0 ... Memory (M blocks) n-1 n should be? Maximum M-1 (reserving 2 blocks for R partition, output of BNLJ) (In practice, a little larger (fudge factor~1.2) as not all memory available for partition joins)
Hash Join Goal #2: keep n < M what if not possible? Recursive partitioning! Idea: Iteration #1: Partition S into M-1 partitions using h1 Iteration #2: Partition each partition of S into M-1 partitions using a different hash function h2 ...... repeat until partition S into >=
Cost of Hash-Join Cost: case 1: No recursive partitioning 1. Partition S: bS reads and bS + n writes. Why n? 2. Rartition R: bR reads and bR + n writes. 3. n partition joins: bR + bS + 2n Reads • Total: 3(bR + bS) +4 n Typically n is small enough (roughly ) so it can be ignored.
Cost of Hash-Join case 2: Recursive Partitioning Recall: partition build relation M-1 ways at each time. So, total number of iterations: logM–1(n) ~ logM–1(bS / M-2) ~ logM–1(bS / M-1) = = logM–1bS - 1 • Cost: 1. partition S : 2 bS (logM–1bS - 1) 2. partition R: 2 bR (logM–1bS - 1) 3. n partition joins: bR + bS Total cost estimate is: 2(bR + bS)(logM–1(bS)-1) + bR + bS
Example of Cost of Hash-Join customer depositor • Assume that memory size is M=3 blocks • bdepositor= 100 and bcustomer= 400. • depositor is to be used as build input. NO Recursive partitioning: 2(bcust + bdep) (log2(bdep) -1)+ bdep + bcust = 1000 (6) + 500 = 6500 I/O’s ! Why ever use SMJ? 1) both input relations already sorted 2) skewless hash functions hard sometimes.
Hybrid Hash–Join • Useful when memory sizes are relatively large, and the build input is bigger than memory. • Main feature of hybrid hash join: Keep the smaller partition of the build relation in memory. s0 Sn-2 Sn-1 input … R S s0 Rn-2 Rn-1 output input … R0
Complex Joins • Join with a conjunctive condition: r 1 2... ns • Either use nested loops/block nested loops, or • Compute the result of one of the simpler joins r is • final result comprises those tuples in the intermediate result that satisfy the remaining conditions 1 . . . i –1 i +1 . . . n • Join with a disjunctive condition r 1 2 ... ns • Either use nested loops/block nested loops, or • Compute as the union of the records in individual joins r is: (r 1s) (r 2s) . . . (r ns)
Other Operations • Duplicate elimination can be implemented via hashing or sorting. • Optimization: duplicates can be deleted during run generation as well as at intermediate merge steps in external sort-merge. • Hashing is similar – duplicates will come into the same bucket. • Projection is implemented by performing projection on each tuple followed by duplicate elimination.
Other Operations : Aggregation • Aggregation can be implemented in a manner similar to duplicate elimination. • Hashing or sorting • Optimization: combine tuples in the same group during run generation and intermediate merges, by computing partial aggregate values • For count, min, max, sum: keep aggregate values on tuples found so far in the group. • When combining partial aggregate for count, add up the aggregates • For avg, keep sum and count, and divide sum by count at the end • True for all distributive aggregates, i.e. aggr(S) = f(aggr(S1), aggr(S2)), S = S1 S2