Quick Review of Apr 17 material • Multiple-Key Access • There are good and bad ways to run queries on multiple single keys • Indices on Multiple Attributes • Combining two keys into a single concatenated attribute • Grid files • crude array with one dimension and linear scale for each attribute • more than one cell may point to a given bucket of values • array grid may be dynamically resized during use • Other alternatives are spatial databases: R-tree, quad-trees, k-d tree • Bitmap Indices • linear array of bits: bit j is set if tuple j has the attribute that this bitmap tracks (e.g., “Moonroof”: bit j is 1 if record j is a car with moonroof) • Queries are answered by combining several bitmaps using and, or, not
Today • HW #4: due Thursday April 24 (next class) • Questions: 12.11, 12.12, 12.13, 12.16 • No HW for next week • Today: • Start Chapter 13: Query Processing
Query Processing • SQL is good for humans, but not as an internal (machine) representation of how to calculate a result • Processing an SQL (or other) query requires these steps: • parsing and translation • turning the query into a useful internal representation in the extended relational algebra • optimization • manipulating the relational algebra query into the most efficient form (one that gets results the fastest) • evaluation • actually computing the results of the query
Query Processing Steps 1. parsing and translation • details of parsing are covered in other places (texts and courses on compilers). We’ve already covered SQL and relational algebra; translating between the two should be relatively familiar ground 2. optimization • This is the meat of chapter 13. How to figure out which plan, among many, is the best way to execute a query 3. evaluation • actually computing the results of the query is mostly mechanical (doesn’t require much cleverness) once a good plan is in place.
Query Processing Example • Initial query: select balance from account where balance<2500 • Two different relational algebra expressions could represent this query: • sel balance<2500(Pro balance(account)) • Pro balance( sel balance<2500(account)) • which choice is better? It depends upon metadata (data about the data) and what indices are available for use on these operations.
Query Processing Metadata • Cost parameters(some are easy to maintain; some are very hard -- this is statistical info maintained in the system’s catalog) • n(r ): number of tuples in relation r • b(r ): number of disk blocks containing tuples of relation r • s(r ): average size of a tuple of relation r • f(r ): blocking factor of r: how many tuples fit in a disk block • V(A,r): number of distinct values of attribute A in r. (V(A,r)=n(r ) if A is a candidate key) • SC(A,r): average selectivity cardinality factor for attribute A of r. Equivalent to n(r )/V(A,r). (1 if A is a key) • min(A,r): minimum value of attribute A in r • max(A,r): maximum value of attribute A in r
Query Processing Metadata (2) • Cost parameters are used in two important computations: • I/O cost of an operation • the size of the result • In the following examination we’ll find it useful to differentiate three important operations: • Selection (search) for equality (R.A1=c) • Selection (search) for inequality (R.A1>c) (range queries) • Projection on attribute A1
Selection for Equality (no indices) • Selection (search) for equality (R.A1=c) • cost (sequential search on a sorted relation) = b(r )/2 average unsuccessful b(r )/2 + SC(A1,r) -1 average successful • cost (binary search on a sorted relation) = log b(r ) average unsuccessful log b(r ) + SC(A1,r) -1 average successful • size of the result n(select(R.A1=c)) = SC(A1,r) = n(r )/V(A1,r)
Selection for Inequality (no indices) • Selection (search) for inequality (R.A1>c) • cost (file unsorted) = b(r ) • cost (file sorted on A1) = b(r )/2 + b(r )/2 (if we assume that half the tuples qualify) b(r ) in general (regardless of the number of tuples that qualify. Why?) • size of the result = depends upon the query; unpredictable
Projection on A1 • Projection on attribute A1 • cost = b(r ) • size of the result n(Pro(R,A1)) = V(A1,r)
Selection (Indexed Scan) for Equality Primary Index on key: cost = (height+1) unsuccessful cost = (height+1) +1 successful Primary (clustering) Index on non-key: cost = (height+1) + SC(A1,r)/f(r ) all tuples with the same value are clustered Secondary Index cost = (height+1) + SC(A1,r) tuples with the same value are scattered
Selection (Indexed Scan) for Inequality Primary Index on key: search for first value and then pick tuples >= value cost = (height+1) +1+ size of the result (in disk pages) = height+2 + n(r ) * (max(A,r)-c)/(max(A,r)-min(A,r))/f(r ) Primary (clustering) Index on non-key: cost as above (all tuples with the same value are clustered) Secondary (non-clustering) Index cost = (height+1) +B-treeLeaves/2 + size of result (in tuples) = height+1 + B-treeLeaves/2 + n(r ) * (max(A,r)-c)/(max(A,r)-min(A,r))
Complex Selections • Conjunction (select where theta1 and theta2) (s1 = # of tuples satisfying selection condition theta1) combined SC = (s1/n(r )) * (s2/n(r )) = s1*s2/n(r )2 assuming independence of predicates • Disjunction (select where theta1 or theta2) combined SC = 1 - (1 - s1/n(r )) * (1 - s2/n(r )) = s1/n(r )) + s2/n(r ) - s1*s2/n(r )2 • Negation (select where not theta1) n(! Theta1) = n(r ) - n(Theta1)
Complex Selections with Indices GOAL: apply the most restrictive condition first and combined use of multiple indices to reduce the intermediate results as early as possible • Why? No index will be available on intermediate results! • Conjunctive selection using one index B: • select using B and then apply remaining predicates on intermediate results • Conjunctive selection using a composite key index (R.A1, R.A2): • create a composite key or range from the query values and search directly (range search on the first attribute (MSB of the composite key) only) • Conjunctive selection using two indices B1 and B2: • search each separately and intersect the tuple identifiers (TIDs) • Disjunctive selection using two indices B1 and B2: • search each separately and union the tuple identifiers (TIDs)