Query Processing

Query Processing

Steps in Query Processing • Validate and translate the query • Good syntax. • All referenced relations exist. • Translate the SQL to relational algebra. • Optimize • Make it run faster. • Evaluate

Translation Example Possible SQL Query: SELECT balance FROM account WHERE balance<2500 Possible Relational Algebra Query: balancebalance<2500(account))

Tree Representation of Relational Algebra • balancebalance<2500(account)) balance balance<2500 account

Making An Evaluation Plan • Annotate Query Tree with evaluation instructions: • The query can now be executed by the query execution engine. balance balance<2500 use index 1 account

Before Optimizing the Query • Must predict the cost of execution plans. • Measured by • CPU time, • Number of disk block reads, • Network communication (in distributed DBs), • where C(CPU) < C(Disk) < C(Network). • Major factor is buffer space. • Use statistics found in the catalog to help predict the work required to evaluate a query.

Disk Cost • Seek time = rotational latency + arm movement. • Scan time = time to read the data. • Typically, seek time is orders of magnitude greater. • Disk cost is assumed to be highest, so it can be used to approximate total cost.

Reading Data, No Indices • Linear scan • Cost is a function of file size. • Binary search on ordering attribute • Cost is lg of the file size. • Requires table to be sorted.

Reading Data with Indices • Primary index: index on sort key. • Can be dense or sparse. • Secondary index: index on non-sort key. • Queries can be point queries or range queries. • Point queries return a single record. • Range queries return a sequence of consecutive records.

Point Queries • Point queries • Cost = index cost + block read cost. • Range queries (c1 <= key <= c2) • Primary index: • Cost = index cost + scan of blocks • Secondary index: • Cost = #blocks(index cost + scan of block)

More on Range Queries • Range query on sort key (c1 <= key) • c1 <= key: Linear scan until you find key. • c1 >= key: Use index to find key, then linear scan. • Range query using secondary index • Scan through index blocks. Requires accessing index for every record.

More Complex Selections • Conditions on multiple attributes • Negations • Disjunctions • Grouping pointers when selection is on multiple attributes: • Find a set of solutions for each condition. • Either compute its union or intersection, depending on the condition (disjunction or conjunction.)

Sorting • Sorted relations are easier to scan. • The cost of sorting a relation before querying it can be less than querying an unsorted relation. • Two types of sorts: • In memory • Out of memory (a.k.a., external sorting)

External Merge Sort • Use this when you cannot fit the relation in memory. • Assume there are M memory buffers. • Two phases: • Create sorted runs. • Merge sorted runs.

External Merge Sort, Phase 1 • Fill the M memory buffers with the next M blocks of the relation. • Sort the M blocks. • Write the sorted blocks to disk.

External Merge Sort, Phase 2 • Assume there are at most M-1 runs. • Read the first block of each run into memory. • At each iteration, find the lowest record from the M-1 runs. • Place it into the memory buffer. • If any run is empty, read its next block.

External Merge Sort Notes • Can be extended to an arbitrarily large relation using multiple passes. • Cost is: • Br(2 * lg_(M-1) (Br/M) + 1) • Br is the number of blocks for the relation. • B is the size of a memory buffer.

Nested Loop Join • No indices (for now). • Nested Loop • R join S • R is the outer relation. • S is the inner relation. • Read a block of R, then read each block of S and compare their contents using the join condition. • Write any matching tuples to another block.

Nested Loop Join Cost • If you read tuple by tuple, it’s: • #tuples in R * #blocks in S + #blocks in R. • Question: Which should be in inner relation, and which should be the outer?

Block Nested Loop • Nested Loop Join, but block by block instead. • Cost for R join S, where R is outer, S is inner: • #blocks in R * #blocks in S + #blocks in S

Block Nested Loop Improvements • Sorted relations? • More memory?

Indexed Nested Loop Join • Assume we have an index on a join attribute of one of the relations, R or S. • Questions: • Which should the index be on? • Or, if both have indices on them, which should be the outer one?

Indexed Nested Loop Join Cost • #blocks in R + #rows in R * Ls • Ls is the cost of looking up a record in S using the index.

More Joins • Merge join • Sort R and S, and then merge them. • Hash join • Hash R and S into buckets, and compare the bucket contents.

Evaluation • Materialization: Build intermediate tables as the expression goes up the tree. • Here, one intermediate table is created for the select, and is the input of the project. balance balance<2500 account

Materialization Cost • Cost of writing out intermediate results to disk.

Pipelining • Compute several operations simultaneously. • As soon as a tuple is created from one operation, send it to the next. Here, send selected tuples straight to the projection. balance balance<2500 account

Implementation of Pipelining • Requires buffers for each operation. • Can be: • Demand driven – an operator must be asked to generate a tuple. • Producer driven – an operator generates a tuple whether its asked for or not.

Query Optimization

Some Actions of Query Optimization • Reordering joins. • Changing the positions of projects and selects. • Changing the access structures used to read data.

Catalog Info • Number of tuples in r. • Number of blocks for r. • Size of tuple of r. • Blocking factor a r – the number of r tuples that fit in a block. • The number of distinct values of each attribute of r.

Query Processing