Query optimization in relational DBs

Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.

Pre-processing a query • Convert SQL to algebra • Create a binary parse tree • Assign execution fragments to each interior node

Optimizing a query: stage 1 • Manipulate parse tree to create new, but equivalent trees that will be faster independently of the properties of the database • Select a handful that would most likely be the fastest

Stage 2: collect information about the DB’s physical state • Size of tables being manipulated (in both rows and space) • Sorting order of tables • Location of indices on tables • Distribution of PKs and FKs (to estimate hit ratio on joins) • Hit ratio of attributes referenced by 1-operator commands (select and project)

Stage 3: transform parse tree into a tree that is likely to be faster • Heuristics: unary operations • Push unary operations down stream • Bundle unary operations on common leaves • Goal is to perform unary operations together and before binary operations, to make binary operations faster • Heuristics: joins • Most join algorithms tree the two operands differently • Make use of sort orders and index locations

Stage 4: choose implementations of binary operations • If both tables sorted by desired attributes, do a merge join • If one table has an index, to a nested join where the inner table is accessed via the index and the outer one is accessed sequentially • Order unary in-line operations so that only necessary attributes are funneled upstream

Stage 4, Notes on joins • If no indices exist and the tables are not sorted the way we want, build an index on one, then do a sort merge or a nested join • If both tables have indices, use the one for the shortest table • If the hit ratio of a join will be very small, do it as early as you can

Stage 5: consider multiple execution plans • Estimate the costs of various execution plans and pick the best one • Balance time spent optimizing versus time spent executing • For one time only transactions, pick a plan fast

Other important issues • Use indices that are fast with respect to pulling in blocks from the disk • Wide, but shallow trees, like b-trees • An index on a sorted attribute can be interpolated • Hash indices good for key-based selections, but are not as versatile as b-trees

A note on b-trees • They are dynamic and therefore do not suffer from the migrating artificial primary key problems • Since they are dynamic, the tradeoff is that they are slightly less space efficient and often lead to an extra layer in the tree

Important: manipulating parse trees • There is a sort of algebra of parse tree manipulation • For example, A join B and B join A are equivalent • (A join B) join C = A join (B join C) • If a and b are used as join attributes, we can strip away all but those and the ones that will be chosen by a later projection or selection

More notes on optimization • The smallest parse tree might not be the best • Sometimes we add unary operations to make the operands of binary operations smaller • If a unary operation will have a very small hit ratio, do it early • The cost of a complete execution plan is complex to calculate • Statistics of the data size and value distribution • Choice of parse tree • Choice of execution plan for each tree • !! It is not a linear process

In light of modern database applications • Often the cost of optimization is not worth it • Many one of a kind transactions • Database with very small transaction load, perhaps do to multiple servers and hard drives • Tables are small and queries are simple, and so almost any execution plan is good • This is a key reason by traditional relational database servers are often far less efficient than no-SQL DBs – because they were built to efficiently run a high volume of similar transactions that involve multiple tables

Making use of parallelism • SQL, compared to various non-declarative languages, is very east to parallelize • But communication costs can be very high if the servers are not co-located • Often, disk arrays are the chosen form of parallelism

Optimization in the real world today • Languages and queries are often not set-based • The code in a query can be very imperative • Binary operations might be in the minority • Databases can be so huge that optimization has to be heavily biased toward certain kinds of queries

Query optimization in relational DBs

Query optimization in relational DBs

Presentation Transcript

Relational Query Optimization

Relational Query Optimization

Query Optimization

Relational Query Optimization

Alternatives to relational DBs

ICS 421 Spring 2010 Relational Query Optimization

Relational Query Optimization

Relational Query Optimization

Query Optimization: Relational Queries to Data Mining

Relational Query Optimization

Relational Query Optimization

Relational Query Optimization

Relational Query Optimization

Relational DBs

Overview of relational dbs

Relational Query Optimization I: Cost Estimation

Relational Query Optimization

Relational Query Optimization

Relational Query Optimization

Relational Query Optimization

Relational Query Optimization