You’re Smarter than a Database

You’re Smarter than a Database Overcoming the optimizer’s bad cardinality estimates

About me • Bobby Durrett • US Foodservice • bobbyandmarielle@yahoo.com • Scripts in http://www.geocities.com/bobbyandmarielle/sqltuning.zip

What you know

What the database knows

Before SQL • Example - mainframe Datacom/DB COBOL • List index names • Write loops read a from one index i1 where one.c=10 while more table one rows exist get next row read b from two index i2 where two.a = one.a while more table two rows exist get next row print one.a,two.b end while end while

SQL • Tell what you want, not how to get it select one.a,two.b from one,two where one.c=10 and one.a=two.a;

Pre-SQL versus SQL • Pre-SQL code very efficient – runs in megabytes – VSE mainframe COBOL • Labor intensive • SQL can be inefficient – runs in gigabytes (if you are lucky!) • Much more productive – do in minutes what took hours before – create tables

What the database doesn’t know • Optimizer has a limited set of statistics that describe the data • It can miscalculate the number of rows a query will return, its cardinality • A cardinality error can lead optimizer to choose a slow way to run the SQL

How to fix cardinality problems • Find out if it really is a cardinality issue • Determine the reason it occurred • Single column • Multiple columns • Choose a strategy • Give the optimizer more information • Override optimizer decision • Change the application

Four examples • Four examples of how the optimizer calculates cardinality • Full scripts and their outputs on portal, pieces on slides – edited for simplicity

Step 1: Find out if it really is a cardinality issue • Example 1 • Data select a,count(*) from test1 group by a; A COUNT(*) ---------- ---------- 1 10 • Query select * from test1 where a=1;

Step 1: Find out if it really is a cardinality issue • Get estimated cardinality from plan ------------------------------------------- | Id | Operation | Name | Rows | ------------------------------------------- | 0 | SELECT STATEMENT | | 10 | |* 1 | TABLE ACCESS FULL| TEST1 | 10 | ------------------------------------------- • Do query for actual number of rows select count(*) from test1 where a=1;

join table join table table Step 1: Find out if it really is a cardinality issue • Plan is a tree – find cardinality and select count(*) on part of query represented by that part of plan.

Step 2: Understand the reason for the wrong cardinality • Unequal distribution of data: • Within a single column • Last name • “Smith” or “Jones” • Among multiple columns – • Address • Zipcode and State

Step 2: Understand the reason for the wrong cardinality • Example 2 - Unequal distribution of values in a single column • 1,000,000 rows with value 1 • 1 row with value 2 select a,count(*) from TEST2 group by a; A COUNT(*) ---------- ---------- 1 1000000 2 1

Step 2: Understand the reason for the wrong cardinality • SQL statement – returns one row select * from TEST2 where a=2;

Step 2: Understand the reason for the wrong cardinality • Plan with wrong number of rows = 500,000 • Full scan instead of range scan – 100 times slower --------------------------------------------- | Operation | Name | Rows | --------------------------------------------- | SELECT STATEMENT | | 500K| | INDEX FAST FULL SCAN| TEST2INDEX | 500K| ---------------------------------------------

Step 2: Understand the reason for the wrong cardinality • Column statistics – two distinct values LOW HIGH NUM_DISTINCT ---------- ---------- ------------ 1 2 2 • Table statistic – total # of rows – 1,000,001 NUM_ROWS ---------- 1000001

Step 2: Understand the reason for the wrong cardinality • Rows in plan = (rows in table)/ (distinct values of column) • 500000=1000001/2 • Optimizer knew that there were only two values – assumed they had equal number of rows

Step 2: Understand the reason for the wrong cardinality • Example 3 - Combinations of column values not equally distributed • 1,000,000 rows with values 1,1 • 1,000,000 rows with values 2,2 • 1 row with value 1,2 • ~ Equal numbers of 1’s and 2’s in each column A B COUNT(*) ---------- ---------- ---------- 1 1 1000000 1 2 1 2 2 1000000

Step 2: Understand the reason for the wrong cardinality • SQL statement – retrieves one row select sum(a+b) from TEST3 where a=1 and b=2;

Step 2: Understand the reason for the wrong cardinality • Plan with wrong number of rows = 500,000 • Inefficient full scan ---------------------------------------------- | Operation | Name | Rows | ---------------------------------------------- | SELECT STATEMENT | | 1 | | SORT AGGREGATE | | 1 | | INDEX FAST FULL SCAN| TEST3INDEX | 500K| ----------------------------------------------

Step 2: Understand the reason for the wrong cardinality • Column statistics C LOW HIGH NUM_DISTINCT - ---------- ---------- ------------ A 1 2 2 B 1 2 2 • Table statistic – total # of rows – 2,000,001 NUM_ROWS ---------- 2000001

Step 2: Understand the reason for the wrong cardinality • Rows in plan = (rows in table)/ (distinct values A * distinct values B) • 500000=2000001/(2 * 2) • Optimizer assumes all four combinations (1,1),(1,2),(2,1),(2,2) equally likely

Step 2: Understand the reason for the wrong cardinality • How to tell which assumption is in play? • Select count(*) each column select a,count(*) from TEST3 group by a; select b,count(*) from TEST3 group by b; • count(*) each column combination select a,b,count(*) from TEST3 group by a,b;

Step 3: Choose the best strategy for fixing the cardinality problem • Giving the optimizer more information • Histograms • SQL Profiles • Overriding optimizer decisions • Hints • Changing the application Try to use optimizer as much as possible to minimize development work

Step 3: Choose the best strategy for fixing the cardinality problem • Giving the optimizer more information – using histograms • Works for unequal distribution within a single column • A histogram records the distribution of values within a column in up to 254 “buckets” • Works best on columns with fewer than 255 distinct values

Step 3: Choose the best strategy for fixing the cardinality problem • Run gather_table_stats command to get histograms on the column – 254 is max number of buckets method_opt=>'FOR ALL COLUMNS SIZE 254'

Step 3: Choose the best strategy for fixing the cardinality problem • Plan for Example 2 with correct number of rows with histogram • Uses range scan ----------------------------------------- | Operation | Name | Rows | ----------------------------------------- | SELECT STATEMENT | | 1 | | INDEX RANGE SCAN| TEST2INDEX | 1 | -----------------------------------------

Step 3: Choose the best strategy for fixing the cardinality problem • Column statistics – two buckets LOW HIGH NUM_DISTINCT NUM_BUCKETS ---------- ---------- ------------ ----------- 1 2 2 2 • Table statistic – unchanged NUM_ROWS ---------- 1000001

Step 3: Choose the best strategy for fixing the cardinality problem • Time without histograms (1 second): Elapsed: 00:00:01.00 • Time with histograms(1/100th second): Elapsed: 00:00:00.01

Step 3: Choose the best strategy for fixing the cardinality problem • Giving the optimizer more information – using SQL Profiles • Works for unequal distribution among multiple columns • Includes information about the relationship between columns in the SQL – correlated columns or predicates

Step 3: Choose the best strategy for fixing the cardinality problem • SQL Tuning Advisor gathers statistics on the columns ...DBMS_SQLTUNE.CREATE_TUNING_TASK(... ...DBMS_SQLTUNE.EXECUTE_TUNING_TASK(... • Accept the SQL Profile it creates to use the new statistics ...DBMS_SQLTUNE.ACCEPT_SQL_PROFILE (...

Step 3: Choose the best strategy for fixing the cardinality problem • Example 3 plan with correct number of rows = 1 using SQL profile -------------------------------------------------- | Operation | Name | Rows | Bytes | -------------------------------------------------- | SELECT STATEMENT | | 1 | 6 | | SORT AGGREGATE | | 1 | 6 | | INDEX RANGE SCAN| TEST3INDEX | 1 | 6 | -------------------------------------------------|

Step 3: Choose the best strategy for fixing the cardinality problem • Time without a profile (1 second): Elapsed: 00:00:01.09 • Time with a profile(1/100th second): Elapsed: 00:00:00.01

Step 3: Choose the best strategy for fixing the cardinality problem • Overriding optimizer decisions – using hints • Example 4 has unequal distribution of column values across two tables – histograms and SQL Profiles don’t work • Hint forces index range scan • Small amount of additional code – not like Cobol on mainframe

Step 3: Choose the best strategy for fixing the cardinality problem • Example 4 - SMALL table • MANY relates to 1 – there are many rows with value 1 • FEW relates to 2 – there are few with value 2 insert into SMALL values ('MANY',1); insert into SMALL values ('FEW',2);

Step 3: Choose the best strategy for fixing the cardinality problem • Example 4 - LARGE table: • 1,000,000 rows with value 1 • 1 row with value 2 NUM COUNT(*) ---------- ---------- 1 1000000 2 1

Step 3: Choose the best strategy for fixing the cardinality problem • SQL statement – returns one row select B.NUM from SMALL A,LARGE B where A.NUM=B.NUM and A.NAME='FEW';

Step 3: Choose the best strategy for fixing the cardinality problem • Plan with wrong number of rows = 125,000 ---------------------------------------------- | Operation | Name | Rows | ---------------------------------------------- | SELECT STATEMENT | | 125K| | HASH JOIN | | 125K| | TABLE ACCESS FULL | SMALL | 1 | | INDEX FAST FULL SCAN| LARGEINDEX | 1000K| ----------------------------------------------

Step 3: Choose the best strategy for fixing the cardinality problem • Column statistics – two buckets on all columns – using histograms LOW HIGH NUM_DISTINCT NUM_BUCKETS ---------- ---------- ------------ ----------- 1 2 2 2 LOW HIGH NUM_DISTINCT NUM_BUCKETS ---- ---- ------------ ----------- FEW MANY 2 2

Step 3: Choose the best strategy for fixing the cardinality problem • Table statistics – SMALL has 2 rows, LARGE 1000001 NUM_ROWS ---------- 2 NUM_ROWS ---------- 1000001

Step 3: Choose the best strategy for fixing the cardinality problem • 125000=1000001/8 • Optimizer appears to assume all eight combinations of the three columns’ values are equally likely • Can’t verify formula – references don’t include formula with histograms • Even worse without histograms – cardinality is 500000

Step 3: Choose the best strategy for fixing the cardinality problem • No SQL profile from SQL Tuning Advisor: There are no recommendations to improve the statement. • Neither histograms nor SQL profiles help example 4

Step 3: Choose the best strategy for fixing the cardinality problem • Statement with hints: • Use index • Don’t do full scan select /*+ INDEX(B LARGEINDEX) NO_INDEX_FFS(B LARGEINDEX) */ B.NUM from SMALL A,LARGE B where a.NUM=B.NUM and A.NAME='FEW';

Step 3: Choose the best strategy for fixing the cardinality problem • Time without a hint (1 second): Elapsed: 00:00:01.03 • Time with a hint (1/100th second): Elapsed: 00:00:00.01

Step 3: Choose the best strategy for fixing the cardinality problem • Changing the application • Change your tables so that the optimizer gets your SQL’s cardinality right • Requires more work designing tables, but keeps productivity benefits of SQL

Step 3: Choose the best strategy for fixing the cardinality problem • Example 4 – moved NAME column to LARGE table and split table in two • One million (‘MANY’,1) rows in LARGEA • One (‘FEW’,2) row in LARGEB • Query: select NUM from (select * from largea union select * from largeb) where NAME='FEW';

Step 3: Choose the best strategy for fixing the cardinality problem • Plan is just as efficient as with hint: • Number of rows = 2 (reality is 1) • Range Scan -------------------------------------------------------------- | Id | Operation | Name | Rows | -------------------------------------------------------------- | 0 | SELECT STATEMENT | | 2 | | 1 | VIEW | | 2 | | 2 | SORT UNIQUE | | 2 | | 3 | UNION-ALL | | | | 4 | TABLE ACCESS BY INDEX ROWID| LARGEA | 1 | |* 5 | INDEX RANGE SCAN | LARGEAINDEX | 1 | | 6 | TABLE ACCESS BY INDEX ROWID| LARGEB | 1 | |* 7 | INDEX RANGE SCAN | LARGEBINDEX | 1 | --------------------------------------------------------------

You’re Smarter than a Database