Query Evaluation: HASH Join, general queries

Query Evaluation: HASH Join, general queries

General Overview • Relational model - SQL • Formal & commercial query languages • Functional Dependencies • Normalization • Physical Design • Indexing • Query Processing and Optimization

Review: QP & O SQL Query Query Processor Parser Query Optimizer Algebraic Expression Execution plan Evaluator Data: result of the query

Review: QP & O Query Optimizer Algebraic Representation Query Rewriter Algebraic Representation Data Stats Plan Generator Query Execution Plan

Review-Plan Generation Metadata: DBMS maintains statistics about each relation, attribute and index. Plan generation: • Generate many alternative plans • We saw many for selections, joins • Estimate cost for each and choose best Plans examined: Selection: Linear, binary, PI, SI Range: PI, SI Joins: NLJ, BNLJ, INLJ, SMJ, HJ

Hash- joins • Applicable only to natural joins, equijoins Depends upon hash function h, used to partition both relations  must map values of join attributes to { 0, ..., n-1} s.t. n = #partitions

Hash-Join Algorithm Algorithm: Hash Join • Partition the relation S using hashing function h so that each si fits in memory. Use 1 block of memory as the output buffer for each partition. (at least n blocks) 2. Partition R using h. • For each partition #i (0,… n-1) • Use BNLJ to compute the join between Ri and Si : Ri Si (optimal since si fits in memory, inner relation) S is called the build input and R is called the probe input. Note: can reduce CPU costs by building in-memory hash index for each si using a different hash function than h.

Hash Join Partitioning: must choose: • # of partitions, n • hashing function, h (each tuple  {0, ..., n-1}) Goals (in order of importance) 1. Each partition of build relation should fit in memory (=> h is uniform, n is large) 2. For partitioning step, can fit 1 output block of each partition in memory (=> n is small (<= M-1)) Strategy: Ensure #1. Deal with violations of #2 when needed.

Hash Join Goal #1: Partitions of build relations should fit in memory: 0 ... Memory (M blocks) n-1 n should be? Maximum M-1 (reserving 2 blocks for R partition, output of BNLJ) (In practice, a little larger (fudge factor~1.2) as not all memory available for partition joins)

Hash Join Goal #2: keep n < M what if not possible? Recursive partitioning! Idea: Iteration #1: Partition S into M-1 partitions using h1 Iteration #2: Partition each partition of S into M-1 partitions using a different hash function h2 ...... repeat until partition S into >=

Cost of Hash-Join Cost: case 1: No recursive partitioning 1. Partition S: bS reads and bS + n writes. Why n? 2. Rartition R: bR reads and bR + n writes. 3. n partition joins: bR + bS + 2n Reads • Total: 3(bR + bS) +4 n Typically n is small enough (roughly ) so it can be ignored.

Cost of Hash-Join case 2: Recursive Partitioning Recall: partition build relation M-1 ways at each time. So, total number of iterations: logM–1(n) ~ logM–1(bS / M-2) ~ logM–1(bS / M-1) = = logM–1bS - 1 • Cost: 1. partition S : 2 bS (logM–1bS - 1) 2. partition R: 2 bR (logM–1bS - 1) 3. n partition joins: bR + bS Total cost estimate is: 2(bR + bS)(logM–1(bS)-1) + bR + bS

Example of Cost of Hash-Join customer depositor • Assume that memory size is M=3 blocks • bdepositor= 100 and bcustomer= 400. • depositor is to be used as build input. NO Recursive partitioning: 2(bcust + bdep) (log2(bdep) -1)+ bdep + bcust = 1000 (6) + 500 = 6500 I/O’s ! Why ever use SMJ? 1) both input relations already sorted 2) skewless hash functions hard sometimes.

Hybrid Hash–Join • Useful when memory sizes are relatively large, and the build input is bigger than memory. • Main feature of hybrid hash join: Keep the smaller partition of the build relation in memory. s0 Sn-2 Sn-1 input … R S s0 Rn-2 Rn-1 output input … R0

Complex Joins • Join with a conjunctive condition: r 1  2...  ns • Either use nested loops/block nested loops, or • Compute the result of one of the simpler joins r is • final result comprises those tuples in the intermediate result that satisfy the remaining conditions 1 . . .  i –1 i +1 . . .  n • Join with a disjunctive condition r 1  2 ...  ns • Either use nested loops/block nested loops, or • Compute as the union of the records in individual joins r  is: (r 1s)  (r 2s)  . . .  (r ns)

Other Operations • Duplicate elimination can be implemented via hashing or sorting. • Optimization: duplicates can be deleted during run generation as well as at intermediate merge steps in external sort-merge. • Hashing is similar – duplicates will come into the same bucket. • Projection is implemented by performing projection on each tuple followed by duplicate elimination.

Other Operations : Aggregation • Aggregation can be implemented in a manner similar to duplicate elimination. • Hashing or sorting • Optimization: combine tuples in the same group during run generation and intermediate merges, by computing partial aggregate values • For count, min, max, sum: keep aggregate values on tuples found so far in the group. • When combining partial aggregate for count, add up the aggregates • For avg, keep sum and count, and divide sum by count at the end • True for all distributive aggregates, i.e. aggr(S) = f(aggr(S1), aggr(S2)), S = S1  S2

Query Evaluation: HASH Join, general queries