350 likes | 361 Vues
Oracle/SQL Tutorial CBIL Lab. Meeting August 15, 2002 Jonathan Crabtree crabtree@pcbi.upenn.edu. Benefits. Master the black art of query optimization! Discover which of your co-workers has locked the tables you want to modify! Win friends and influence people with recursive queries!.
E N D
Oracle/SQL TutorialCBIL Lab. Meeting August 15, 2002Jonathan Crabtreecrabtree@pcbi.upenn.edu
Benefits • Master the black art of query optimization! • Discover which of your co-workers has locked the tables you want to modify! • Win friends and influence people with recursive queries!
Outline • Introduction • Overview of query optimization. • Why fast databases sometimes run slowly. • Speeding things up: the virtue of impatience. • A smattering of SQL.
Some Oracle terminology • God: Larry Ellison • Database instance:an Oracle server • SID: unique System IDentifier for an instance • SID=GUSDEV (erebus.pcbi.upenn.edu) • SID=GUS (nemesis.pcbi.upenn.edu) • Schema/User: a database user/login • each database object is owned by one of these • Role: a group that can be granted privileges
GUSdev (erebus.pcbi.upenn.edu:1521) ORACLE INSTANCE(S) TABLES & INDEXES TABLESPACES DATAFILES DISKS Core.Taxon Core.PK_Taxon USERS INDX I1 I2 U1 U2 I3 DISK1 DISK2 DISK3
SQL: Querying the database • SQL is used to tell a database what information you want, but not how to compute it. • compare this to an "imperative" language like Perl. • e.g., "display the primary accession number of each externally-generated Plasmodium falciparum sequence longer than 50 kb." SELECT ena.source_id FROM gusdev.ExternalNASequence ena, gusdev.Taxon t WHERE ena.taxon_id = t.taxon_id AND t.scientific_name = 'Plasmodium falciparum' AND ena.length >= 50000;
Query Access Plans • PLAN A: • 1. look up Plasmodium'staxon_id in Taxon • 2. search through ExternalNASequence; print the source_id whenever taxon_id = 211 and length >= 50,000 • PLAN B: • 1. retrieve source_id and taxon_id for all ExternalNASequence entries with length >= 50,000 • 2. look up Plasmodium's taxon_id in Taxon • 3. scan the result of 1. for entries with taxon_id = 211 • Which plan is faster (both here and in general)?
Query optimization • Optimizer's job is to find the best query access plan for a given query (and database). • In practice an extremely difficult problem. • Huge number of possible query access plans. • Two main approaches: • Rule(s)-based query optimization • Cost-based query optimization • Also logical versus physical optimizations.
What can affect the query optimization process? • What does the optimizer have to work with? • order of operations (e.g., join order) • use of auxiliary data structures such as indexes • different join implementations (e.g., hash join) • logical transformations (e.g., pushing projections) • What is the goal? • Minimize overall response time (the usual goal) • Minimize initial response time • Minimize CPU time, disk accesses, etc., etc.
The role of indexes • Like Perl hash tables; provide fast lookups • CREATE INDEX T1 ON Taxon(scientific_name) TABLESPACE INDX; • CREATE INDEX T2 ON Taxon(Genus,Species) TABLESPACE INDX; • Optimizer will typically try to match columns used in your query with a prefix of the index cols. • Can be defined as unique indexes • Can allow for fast range queries
Oracle diagnostic tools • Queries at http://www.cbil.upenn.edu/dba/ • current logins/running SQL queries • indexes, foreign key constraints for a table • locks on database tables • EXPLAIN PLAN • describe query access plan without running it • TRACE/TKPROF facility • collect data while a query is running
Hands-on demo time! • http://www.cbil.upenn.edu/dba/ • EXPLAIN PLAN example using our earlier SQL query for long Plasmodium falciparum DNA sequences in ExternalNASequence: SELECT ena.source_id FROM gusdev.ExternalNASequence ena, gusdev.Taxon t WHERE ena.taxon_id = t.taxon_id AND t.scientific_name = 'Plasmodium falciparum' AND ena.length >= 50000;
Using EXPLAIN PLAN • Requires a PLAN_TABLE for computed plans • Can be used on any SQL statement: • EXPLAIN PLAN FOR SELECT * FROM NASequenceImp; • A different table can be used and/or a name given to the stored plan: • EXPLAIN PLAN INTO MY_PLAN_TABLE SET STATEMENT_ID = 'myquery1' SELECT * FROM NASequenceImp; • A script is then used to query the plan table
EXPLAIN PLAN shows two full table scans • "TABLE ACCESS FULL" is often bad news! using "PLAN_TABLE" by default this script shows the most recent plan in PLAN_TABLE note the cost estimates generated by Oracle's Cost-Based Optimizer (CBO)
Taxon has no indexes, so we create one • Follow by "ANALYZE TABLE" to update stats. • And then note the change in plan
But why does this query have a different plan? • optimizer can be sensitive to small changes • Note also the estimated number of rows is 7K! (actual number = 172) this is an index on (external_db_id, taxon_id); SKIP SCAN is used because (taxon_id) is NOT a prefix of the index columns
Can we force our query to use a similar plan? • Almost, by using the following optimizer "hint" (contained in an SQL comment): • INDEX(ena, NASequenceImp_IND15) multiple hints can be given, separated by spaces
How fast is each one? • Original - double table scan (>700M used!): • 11min 23sec: cost=91453 rows=136 actual=172 • With index -table scan of NASequenceImp: • 13min 45sec: cost=91384 rows=136 actual=172 • Index skip scan on ExternalNASequence only: • 7 sec: cost=4224 rows=7K (why?) actual=172 • Original (2-table) query with INDEX hint: • 34 sec: cost=31419 rows=136 actual=172
Causes of slow or stalled updates • Slow-running subquery (for insert/update) • Missing foreign key indexes in tables that reference the one you're modifying (delete) • Contention - tables are locked or being heavily modified by someone else • e.g., running plugins with "nocommit" flag
Categorizing performance problems • Optimizer-related performance problems: • Cases in which optimizer is actually choosing an optimal or (more likely) near-optimal plan. • can generally be solved by giving the optimizer more options to work with e.g., creating a new index. • Those in which it's making a Bad Call. • can generally be solved by giving the optimizer more information about the choices it already has e.g., by generating more accurate statistics or providing it with explicit hints. Try variants of your original query. • Performance problems not related to optimization
Attacking the problem • Use EXPLAIN PLAN to see what's going on • ANALYZE TABLE/CREATE INDEX as needed • Add some optimizer hints • suggest an index with INDEX(tname, ind_name) • or a join order with ORDERED • try a NO_UNNEST or a RULE • Build some intermediate and/or temporary tables; the database is better at simpler operations
SPEED 2 (Starring You) • Living dangerously: using unlogged operations • TRUNCATE TABLE (versus DELETE *) • The "NOLOGGING" option for large updates • Don't recommend doing this in general • But truncate table is a good way to remove data quickly (if you're sure you don't need it!)
A SMATTERING OF SQL • SELECTing from DUAL • Using ROWNUM to limit output • Recursive queries • Fancy coercions • Using a private workspace/schema • Database links and distributed queries
The "DUAL" table This kind of syntax (i.e. no "FROM") is OK in Sybase, but not Oracle So Oracle has a "dummy" table called dual that can be used in such situations
Limiting output with the ROWNUM pseudocolumn SELECT external_db_id, name FROM ExternalDatabase WHERE ROWNUM < 3; 1 assembly 2 GenBank (nrdb) 1 assembly 2 GenBank (nrdb) 105 zfin 100 yepd SELECT external_db_id, name FROM ExternalDatabase WHERE ROWNUM < 3 ORDER BY name desc; SELECT * FROM (SELECT external_db_id, name FROM ExternalDatabase ORDER BY name desc) WHERE ROWNUM <3;
Recursive Queries • Basic restriction of SQL: number of tables to be joined is fixed when the query is written • recursive queries are one exception to this rule • allow a table to be joined with itself • SQL syntax (single table select): SELECT c1, c2, ... FROM table1 START WITH <initial row condition> CONNECT BY PRIOR cx = cy;
gusdev.Anatomy "hier_level" is actually a column of the gusdev.Anatomy table ...and we use it to make sure that we only retrieve the first two "levels" of the table in our recursive query, starting with anatomy_id = 0 (the top)
Another pseudocolumn: LEVEL But if we don't make a "hier_level" column ourselves, Oracle provides a pseudocolumn, LEVEL, that serves the same purpose.
Climbing up the tree This time we're starting at an internal node of the tree/graph and traversing back up to the root; note the difference in the CONNECT BY clause that's responsible for this change. Note also how LEVEL is now reversed.
Fancy stuff INSERT INTO IsExpressed SELECT /*+ RULE */ IsExpressed_sq.nextval, aap.anatomy_id, 56 AS table_id, aap.na_sequence_id AS row_id, nvl2(cast( (select 1 from dual where round((aap.percent / 100.0) * est_count) >= 2) as number), 1, 0) AS is_confirmed, SYSDATE, 1, 1, 1, 1, 1, 0, 3, 0, 0, 0 FROM gusdev.AssemblyAnatomyPercent aap WHERE aap.anatomy_id IN ( SELECT DISTINCT a.anatomy_id FROM gusdev.Anatomy a WHERE hier_level > 1 START WITH a.anatomy_id in (321,60, ...) CONNECT BY prior a.parent_id = a.anatomy_id );
Benefits of using SQL like this? • Original AssemblyAnatomyPercent plugin • 313,368 rows in 5.5 hours; 15.8 rows/second • Improved plugin • 73612 rows in 55 minutes; 22.1 rows/second • SQL insert method • 2,523,838 rows in 33 minutes; 1245 rows/second • >50X faster than the improved plugin
Private workspacesand "CREATE TABLE AS" There's no SYNONYM for this gusdev table in my workspace So I have to explicitly say which Taxon table I want to copy This query tells me which tables are owned by the current login (crabtree@gusdev)
Database links GUSDEV You can query gus (nemesis) tables from gusdev (erebus) GUS And vice versa!
Linking to other schemas • CREATE TABLE AS The same goes for other public tables on the same server as the one to which the link has been created. GUSDEV
Where to go from here? • Complete Oracle documentation online: • http://technet.oracle.com • http://technet.oracle.com/docs/products/oracle9i/content.html • Standard database textbooks • e.g., Ramakrishnan: Database Management Systems. • Try things out on your own!