Overview of Query Evaluation

Overview of Query Evaluation Chapter 12

Outline • Query Optimization Overview • Algorithm for Relational Operations

Overview of Query Evaluation • DBMS keeps descriptive data in system catalogs. • SQL queries are translated into an extended form of relational algebra: • Query Plan Reasoning: • Tree of operators • with choice of one among several algorithms for each operator

sname rating > 5 bid=100 sid=sid Sailors Reserves Query Plan Evaluation • Query Plan Execution: • Each operator typically implemented using a `pull’ interface • when an operator is `pulled’ for next output tuples, it `pulls’ on its inputs and computes them.

Overview of Query Evaluation • Query Plan Optimization : • Ideally: Want to find best plan. Practically: Avoid worst plans! • Two main issues in query optimization: • For a given query, what plans are considered? • Algorithm to search plan space for cheapest (estimated) plan. • How is the cost of a plan estimated? • Cost models based on I/O estimates

Query Processing • Common Techniques for Query Processing Algorithms: • Indexing: Can use WHERE conditions to retrieve small set of tuples from large relation • Iteration: Examine all tuples in an input table, one after the other (like in sorting algorithm). • Partitioning: By using sorting or hashing, we partition input tuples and replace expensive operation by similar operations on smaller inputs. Watch for these techniques as we discuss query evaluation!

Access Paths • An access path • A method of retrieving tuples from a table • Method: • File scan, • Index that matches a selection (in the query) • Note : • Contributes significantly to cost of relational operator.

Access Path • Most selective access path: An index or file scan that we estimate will require fewest page I/Os.

Matching an Access Path • A tree index matches (a conjunction of) terms that involve only attributes in aprefixof search key. • Example : Given tree index on <a, b, c> • selection a=5 AND b=3 ? • selection a=5 AND b>6 ? • selection b=3 ?

Matching an Access Path • A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in search key of index. • Example : Given hash index on <a, b, c> • selection a=5 AND b=3 and c =6 ? • selection c = 5 AND b = 6 ? • selection a = 5 ? • selection a > 5 AND b=3 and c =5 ?

Query Evaluation of Selection

Selection • Example :  R.attr OP value (R) • Case 1: No Index, NOT sorted on R.attr • Must scan the entire relation.Most selective access path = file scanCost: M

Selection • Example :  R.attr OP value (R) • Case 2: No Index, Sorted Data on R.attr • Binary search for first tuple.Scan R for all satisfied tuples.Cost: O(log2M)

Selection Using B+ tree index • Example :  R.attr OP value (R) • Case 3: B+ tree Index on R.attr • Costs : • cost I (finding qualifying data entries), and • cost II (retrieving records) • Cost I: 2-3 I/Os. (depth of B+ tree) • Cost II: • clustered index 1 I/O and few tuples that match • unclustered index up to one I/O per qualifying tuple.

Example : Using B+ Index for Selections SELECT * FROM Reserves R WHERE R.rname < ‘D%’ • Example : • Assume uniform distribution of names, • Then say about ~ 10% of tuples qualify • Assume R is on 100 pages and has 10,000 tuples. • Clustered index: • 2 IOs + little more than 10 I/Os • Unclustered index : • 2 IOs + up to 1,000 I/Os

Selection --- B+ Index • Refinement for unclustered indexes: 1. Find qualifying data entries. 2. Sort rids of data records to be retrieved. 3. Fetch rids in order. Avoid retrieving the same page multiple times. However, # of such pages likely to be still higher than with clustering. • Use of unclustered index for a range selection could be expensive. • Often better (and simpler) to instead scan data file.

Selection – Hash Index • Hash index is good for equality selection. • Costs: • Cost I (retrieve index bucket page) • + Cost II (retrieving qualifying tuples from R) • Cost I is one I/O • Cost II could up to one I/O per satisfying tuple.

General Condition : Conjunction • A condition with several predicates combined by conjunction (AND): • Example : day<8/9/94 AND bid=5 AND sid=3.

General Selections (Conjunction) First approach: (utilizing single index) • Find the most selective access path, retrieve tuples using it. • To reduce the number of tuples retrieved • Apply any remaining terms that don’t match the index: • To discard some retrieved tuples • This does not affect number of tuples/pages fetched. • Example : Consider day<8/9/94 AND bid=5 AND sid=3. • A B+ tree index on day can be used; • then bid=5 and sid=3 must be checked for each retrieved tuple. • Hash index on <bid, sid> could be used • day<8/9/94 must then be checked on fly.

General Selections Second approach (utilizing multiple indices) • Assuming 2 or more matching indexes that use Alternatives (2) or (3) for data entries. • Get sets of rids of data records using each matching index. • Then intersect these sets of rids • Retrieve records and apply any remaining terms. • Example : Consider day<8/9/94 AND bid=5 AND sid=3. • A B+ tree index I on day and an index II on sid, both Alternative (2). • - Retrieve rids of records satisfying day<8/9/94 using index I, • - Retrieve rids of recs satisfying sid=3 using Index II • - Intersect rids • - Retrieve records and check bid=5.

General Condition : Disjunction • Disjunction condition: one or more terms (R.attr op value) connected by OR (  ). • Example : (day<8/9/94) OR (bid=5 AND sid=3)

General Selection (Disjunction) • Case 1: Index is not available for one of terms. • Then : • Need a file scan anyways. • Thus check other conditions in this file scan. • E.g., Consider day<8/9/94 OR rname ='Joe' • No index on day. Need a File scan. • Even index is available in rname does not help.

General Selection (Disjunction) • Case 2: Every term has a matching index. • Retrieve candidate tuples using respective index. • Then Union the results • Example : consider day<8/9/94 OR rname ='Joe' • Assume two B+ tree indexes on day and rname. • Retrieve tuples satisfying day < 8/9/94 • Retrieve tuples satisfying rname = 'Joe' • Union the retrieved tuples. Duplicate Removal ???

Query Evaluation of Projection

Algorithms for Projection SELECT R.sid, R.bid FROM Reserves R

Algorithms for Projection SELECTDISTINCT R.sid, R.bid FROM Reserves R • The expensive part is removing duplicates. • SQL systems don’t remove duplicates unless keyword DISTINCT is specified in query. • Sorting Approach: • Sort on <sid, bid> and remove duplicates. (Can optimize this by dropping unwanted information while sorting.) • Hashing Approach: • Hash on <sid, bid> to create partitions. Load partitions into memory one at a time, build in-memory hash structure, and eliminate duplicates. • Indexing Approach : • If there is an index with both R.sid and R.bid in the search key, may be cheaper than sorting data entries

Summary • There are several alternative evaluation algorithms for each relational operator.

Conclusion Not one method wins. Optimizer must assess situation to select best possible candidate

Overview of Query Evaluation