500 likes | 616 Vues
www.brianhitchcock.net. Session id: 31885. Dedicated to Pramitha Chowrira, the Goddess of the Rockies, Mike Waldron, the Student who became the Master, Sheryl Driscoll, who leads purely for the glory and Ann Bischoff, who took me in trade for a developer.
E N D
Session id: 31885 Dedicated to Pramitha Chowrira, the Goddess of the Rockies, Mike Waldron, the Student who became the Master, Sheryl Driscoll, who leads purely for the glory and Ann Bischoff, who took me in trade for a developer... How STATSPACK Was Used to Solve Common Performance Issues Brian HitchcockOCP 8, 8i, 9i DBA Sun Microsystems brian.hitchcock@sun.com www.brianhitchcock.net
What STATSPACK Is • Set of SQL and PL/SQL • Collects performance data from v$ tables • Stores collected data in separate tables • Each collection of data is a ‘snapshot’ • Reports deltas in data between snapshots • Supports ad-hoc SQL queries of the snapshot data
STATSPACK Details • Works for 8.1.7 onwards • Gathers data for a single instance • Snapshot levels • Determine how much data is collected • Defaults are fine • Snapshot interval of 15 minutes suggested • Long report periods miss transient events • Reports over an instance restart are not valid
STATSPACK -- Good • Free (very cool!) • Gathers a wide range of data • You don’t know what you’re looking for at first • Root cause isn’t usually obvious • Standard process to collect performance data • Gathers the same data on all instances • Easy to share with vendors, support groups
STATSPACK -- Not Perfect • Gathers a wide range of data • Ocean of data • Any information? • Easy to get lost • Does not tell you what the problem is • Shows what is happening in the instance • You need to figure out if this is a problem or not • Does not tell you the solution • Does not tell you that you are done tuning...
How to Interpret Output? • Requires experience with your system • No single way to analyze output • Must have history of your system • Look for possible problem areas • Trial and error to change problem behavior • Only you can tell if you have a performance problem
STATSPACK Report Sections • Instance Summary, Efficiency • Top 5 Wait Events, Wait Events • SQL Ordered by Gets, Reads, Executions • Instance Activity Stats • Tablespace IO Stats Ordered by IOs, Tblspc-file • Buffer Pool Statistics • Rollback Segment Stats, Storage • Latch Activity, Sleep, Miss Sources • Dictionary Cache Stats • Library Cache Activity • SGA Memory Summary, Breakdown • init.ora Parameters
Documentation of output • No comprehensive documentation • Oracle 8i Reference • Appendix A -- Wait Events defined • Appendix B -- Enqueue Names defined • Appendix C -- Statistics Descriptions • Database Performance Guide and Ref 9.0.1 • Chapters 21-23, Supplied packages, how to use • $ORACLE_HOME/rdbms/admin/spdoc.txt • ORACLE High-Performance Tuning with STATSPACK, Donald K. Burleson, Oracle Press ISBN 0-07-213378-3 • No explanation of • What output means for your system
Configuration Used • Oracle 8.1.7.2 • Snapshots every 15 minutes • snapshots taken continuously • Default STATSPACK snapshot ‘level’ • Application loads and analyzes web site click stream data • Lots of data • More data all the time • We don’t know what vendor code looks like
Actual Use • 4 Performance issues in 2002 • Case 1) Reports Running Slow • STATSPACK output didn’t show the problem • Case 2) Vendor Demo Slow • STATSPACK output didn’t show the problem • Case 3) Data Load Slow • STATSPACK output led to 18x speedup (1800%) • Case 4) Data Load Time Varies • STATSPACK output led to the root cause
Case 1) Reports Running Slow • Vendor code allows users to setup reports • Vendor code generates SQL for report • Long run interferes with next day’s data load • STATSPACK captures SQL • Generate explain plan(s) • Report SQL doesn’t generate where clause properly to use partition pruning • Vendor refuses to change their code • We simply removed the reports • Performance issue ‘resolved’
Case 2) Vendor Demo Slow • Due to issues like Case 1) • New vendor sets up demo, data load slow • Data load runs twice as fast at vendor • Statspack output doesn’t show anything obvious • Compare configuration of vendor and our dbs • Vendor has only one redo log file per group • We had two redo log files per group • We drop one file per group, performance issue resolved
Case 3) Data Load Slow • First time loading new type of web log data • No baseline to compare with • Classical performance tuning doesn’t always apply to the real world • Data load so slow no time for daily reporting • Must run faster or the data won’t be loaded • We don’t know if this load will run faster • Do we have a ‘performance’ issue? • Yes, data load must run faster to be useful • No, perhaps this is as fast as it can be...
Case 3) Data Load Slow • SQL -- Highest Gets per Exec SQL ordered by Gets for DB: BHDATA04 Instance: BHDATA04 Snaps: 4426 -4427 -> End Buffer Gets Threshold: 10000 -> Note that resources reported for PL/SQL includes the resources used by all SQL statements called within the PL/SQL code. As individual SQL statements are also reported, it is possible and valid for the summed total % to exceed 100 Buffer Gets Executions Gets per Exec % Total Hash Value --------------- ------------ -------------- ------- ------------ 634,304 68 9,328.0 97.5 3948948039 SELECT t526.keyvalueid FROM bh_lqueryvalue t526 WHERE t526.query infoid = :ph0 ORDER BY t526.keyvalueid ASC
Bad SQL? • SQL shouldn’t cost much • Select looking for one row • Table has two indexes • Explain plan shows ‘index full scan’ • Should show ‘index range scan’ • Explain plan with hint to force one index • Verify cost of each index • Optimizer is choosing wrong index! • Drop the costly index! • Indexes added by vendor ‘to be safe’...
Explain Plan Force Index1 • Cost 4 SQL> truncate table plan_table; Table truncated. SQL> explain plan set Statement_Id = 'TEST' for SELECT /*+ INDEX(t526 X_LQRYVL_QUERYIDKYVL) */ t526.keyvalueid FROM bh_lqueryvalue t526 WHERE t526.queryinfoid = 100 ORDER BY t526.keyvalueid ASC; 2 3 4 5 Explained. SQL> @$ORACLE_HOME/rdbms/admin/utlxpls Plan Table -------------------------------------------------------------------------------- | Operation | Name | Rows | Bytes| Cost | Pstart| Pstop | -------------------------------------------------------------------------------- | SELECT STATEMENT | | 3 | 24 | 1 | | | | INDEX RANGE SCAN |X_LQRYVL_ | 3 | 24 | 3 | | | --------------------------------------------------------------------------------
Explain Plan Force Index2 • Cost 11,667 SQL> explain plan set Statement_Id = 'TEST' for SELECT /*+ INDEX(t526 X_LQYVAL_KYVLQRYID) */ t526.keyvalueid FROM bh_lqueryvalue t526 WHERE t526.queryinfoid = 100 ORDER BY t526.keyvalueid ASC; 2 3 4 5 Explained. SQL> @$ORACLE_HOME/rdbms/admin/utlxpls Plan Table -------------------------------------------------------------------------------- | Operation | Name | Rows | Bytes| Cost | Pstart| Pstop | -------------------------------------------------------------------------------- | SELECT STATEMENT | | 3 | 24 | 2334 | | | | INDEX FULL SCAN |X_LQYVAL_ | 3 | 24 | 9333 | | | -------------------------------------------------------------------------------- SQL>
Solution • After dropping costly index • data load time was 18 hours, became 1 hour • 18:1 improvement (1800%) • Why did optimizer choose wrong index? • No idea, Oracle requested running 18 hour data load to gather instance data • Business users said “NO!” • Indexes created by vendor, no need for both indexes • Know when to quit tuning!
What About Wait Events? • Popular DBAs • Wait Events are all that matters • I want (desperately) to be popular too… • Return to Case 3) • Examine Top 5 Wait Events section • Try to understand what is causing the wait time
Case 3) Wait Events • Top 5 Wait Events Top 5 Wait Events ~~~~~~~~~~~~~~~~~ Wait % Total Event Waits Time (cs) Wt Time -------------------------------------------- ------------ ------------ ------- PX Deq: Execution Msg 335 1,156 45.30 latch free 1,740 522 20.45 control file parallel write 88 329 12.89 db file sequential read 253 323 12.66 log file parallel write 2 42 1.65 ------------------------------------------------------------- • PX -- parallel query issues? • Contact Oracle Tech Support
Ask the Experts • Oracle Tech Support • Many wait events in STATSPACK output should be ignored (a bug perhaps? Or an RFE?) • Requests the full STATSPACK report • Tells me that the latch free wait event must be addressed • set session_cached_cursors = 100 • Performance improvement will be ‘significant’ • Data load now takes 19 hours (about 10% worse)
What Happened? • How could the experts miss the bad SQL? • Wait events are important • If the total wait time is the largest problem • In this case • Bad SQL dominated the overall run time • Wait event analysis • Many events should be ignored • Need to determine how much of total run time is due to wait events only • Go back and fix the SQL issue
Total CPU Time • From Instance Activity Stats • Total CPU time is 27475 10s of milliseconds • 10 milliseconds is 1 centisecond (cs) = 0.01 sec • 27457 --> 27475 cs = 274.75 seconds • Report interval was 4.60 minutes (276 seconds) • confused? How many cs left until Happy Hour? Instance Activity Stats for DB: BHDATA04 Instance: BHDATA04 Snaps: 4426 -4427 Statistic Total per Second per Trans --------------------------------- ---------------- ------------ ------------ CPU used by this session 27,441 99.4 27,441.0 CPU used when call started 27,475 99.6 27,475.0 … … …
Total Wait Time • Look at all Wait Events Wait Events for DB: BHDATA04 Instance: BHDATA04 Snaps: 4426 -4427 -> cs - centisecond - 100th of a second -> ms - millisecond - 1000th of a second -> ordered by wait time desc, waits desc (idle events last) Avg Total Wait wait Waits Event Waits Timeouts Time (cs) (ms) /txn ---------------------------- ------------ ---------- ----------- ------ ------ PX Deq: Execution Msg 335 0 1,156 35 335.0 latch free 1,740 1,046 522 3 ###### control file parallel write 88 0 329 37 88.0 db file sequential read 253 0 323 13 253.0 log file parallel write 2 0 42 210 2.0 enqueue 61 0 40 7 61.0 refresh controlfile command 602 0 37 1 602.0 PX Deq: Msg Fragment 64 0 29 5 64.0 PX Deq: Parse Reply 59 0 26 4 59.0 control file sequential read 2,673 0 16 0 ###### log file sync 1 0 15 150 1.0 PX Deq: Signal ACK 2 2 11 55 2.0 PX Deq: Join ACK 57 0 3 1 57.0 PX Deq: Execute Reply 5 0 3 6 5.0 SQL*Net more data to client 37 0 0 0 37.0 file open 6 0 0 0 6.0 db file parallel write 4 0 0 0 4.0 PX Idle Wait 4,316 4,287 877,818 2034 ###### SQL*Net message from client 2,213 0 27,615 125 ###### SQL*Net message to client 2,214 0 1 0 ###### ------------------------------------------------------------- -----------> Total Wait Time 907997 cs
Real Total Wait Time • Remove idle events • MetaLink Note:191103.1 PQ Wait Events • STATSPACK report should filter out idle events • Database Performance Guide and Ref 9.0.1 • Explains more about this ‘feature’
Real Total Wait Time Wait Events for DB: BHDATA04 Instance: BHDATA04 Snaps: 4426 -4427 -> cs - centisecond - 100th of a second -> ms - millisecond - 1000th of a second -> ordered by wait time desc, waits desc (idle events last) Avg Total Wait wait Waits Event Waits Timeouts Time (cs) (ms) /txn ---------------------------- ------------ ---------- ----------- ------ ------ PX Deq: Execution Msg 335 0 1,156 35 335.0 <----- remove latch free 1,740 1,046 522 3 ###### control file parallel write 88 0 329 37 88.0 db file sequential read 253 0 323 13 253.0 log file parallel write 2 0 42 210 2.0 enqueue 61 0 40 7 61.0 refresh controlfile command 602 0 37 1 602.0 PX Deq: Msg Fragment 64 0 29 5 64.0 PX Deq: Parse Reply 59 0 26 4 59.0 control file sequential read 2,673 0 16 0 ###### log file sync 1 0 15 150 1.0 PX Deq: Signal ACK 2 2 11 55 2.0 <----- remove PX Deq: Join ACK 57 0 3 1 57.0 PX Deq: Execute Reply 5 0 3 6 5.0 SQL*Net more data to client 37 0 0 0 37.0 file open 6 0 0 0 6.0 db file parallel write 4 0 0 0 4.0 PX Idle Wait 4,316 4,287 877,818 2034 ###### <----- remove SQL*Net message from client 2,213 0 27,615 125 ###### <----- remove SQL*Net message to client 2,214 0 1 0 ###### ------------------------------------------------------------- ----------> Total Wait Time 1397 cs • Idle Events
Total Response Time • Total CPU Time + Total Wait Time • 27475 cs + 1397 cs = 28872 cs • Total Wait Time • 1397/28872 = 0.05 • 5% of Total Response Time • Wait Time was never an issue! • If you don’t remove the idle events • 907997/(27475+907997) = 97%
Bad SQL Rules • Slow data load time • Time due to Bad SQL • Time due All Others • Including Wait Events
Case 4) Data Load Time Varies • Data Load Time • Normal 6.5 hours, Long 16 hours • Varies randomly, no pattern • Generate STATSPACK report • Normal, Long • Compare reports • Look for differences between reports • Tablespace IO Stats Section • Normal -- 11 tablespaces accessed • Long -- 74 tablespaces accessed
Problem and Solution • Vendor data load shouldn’t touch all tables • What process would access all tables? • Production db supported by another group • We aren’t allowed to connect as ‘oracle’ • Can’t see what they might be running (cron?) • Turns out • Production DBAs decided we needed full exports • We weren’t notified • Stop the exports, performance issue goes away!
What About Cache Hit Rates? • Back to the subject of experts • Remember when it was cool to discuss hit rates? • For Case 4) • Compute buffer cache hit ratio from tables • Tables larger than physical memory • Can’t have all pages in memory at once • Buffer cache hit ratio won’t be 100% • Even if we had 100% • Bad SQL (index) was the real problem • Buffer cache hit ratio wasn’t relevant
Select Buffer Cache Hit Ratio • Data Load without Exports running
Total Wait Time? • For Case 4), 15 minute report interval • Wait Time is 28% of total time Instance Activity Stats for DB: BHDATA01 Instance: BHDATA01 Snaps: 13817 -1381 Statistic Total per Second per Trans --------------------------------- ---------------- ------------ ------------ CPU used by this session 1,422,921 1,556.8 154.4 CPU used when call started 2,237,915 2,448.5 242.8 <----- Top 5 Wait Events ~~~~~~~~~~~~~~~~~ Wait % Total Event Waits Time (cs) Wt Time --------------------------------------- ------------ ------------ ------- PX Deq: Table Q Normal 495,566 1,017,115 30.12 <-- remove slave wait 25,580 744,985 22.06 PX Deq Credit: send blkd 406,262 678,556 20.10 <-- remove PX Deq: Execution Msg 33,009 438,309 12.98 <-- remove latch free 42,575 134,840 3.99 ------------------------------------------------------------- 879825 cs total time = 2237915 + 879825 = 3117740 Wait time is 879825/3117740 = 28%
Review • For the 4 issues we had, STATSPACK output • Was useful for all 4 • Provided standard set of data for all involved • Fixed 2 issues • Provided the data that led to the root cause • Verified that the fix was working • Performance improvements were substantial • Tuning process much faster with STATSPACK • Same process worked for all 4 issues • Decided not to look for further improvements • Wait Time analysis might be useful...
oraperf.com Analyzer • Website oraperf.com • Submit STATSPACK report • Analyzer reviews report • Generates detailed analysis • CPU time • Wait time • Gives specific advice • Not perfect, but it is fast and free! • Has same issues with idle wait events as STATSPACK report
oraperf.com • Who or what is oraperf.com? • From the website... Oraperf.com is run by Anjo Kolk. Anjo has worked for over 16 years at Oracle (1985-2001). While at Oracle he worked in different countries and different departments. Many people generate utlbstat/utlestats and statspack reports, but don't know how to interpret the data. People that do look at these are reports also mostly looking at the wrong information and end up making the wrong tuning decisions. That is why the reports are analyzed based on the YAPP method. The YAPP method will show what component of the total response time should be tuned first. YAPP-Method -- Yet Another Performance Profiling Method http://oraperf.com/download/yapp_anjo_kolk.pdf
oraperf.com -- Case 3) • Upload report from slow data load • Analyzer shows • Response time • 91.63% CPU Time • 8.37% Wait Time • Advice? • Reduce the number of buffer gets or executions • Wait time • Matters only as a % of total response time
oraperf.com -- Case 3) • Upload report from fast data load • Analyzer shows • Response time • 5.16% CPU Time • 94.84% Wait Time • Advice? • Tune PX Deq: Execution Msg event • But this is an idle event... • Non-idle wait time is only about 25% total time
oraperf.com -- Case 3) • Conclusion • oraperf.com analyzer provides another tool for performance tuning • Well worth using if only to compute • Response Time • CPU Time • Wait Time • Check for idle wait events...
No Excuses • Install STATSPACK • Generate two snapshots • Generate standard report • Upload to oraperf.com • Review advice • Fast, free performance analysis!
Installing STATSPACK • Create separate tablespace • Create PERFSTAT user • Execute SQL script to create tables • Setup job to execute snapshots • Setup process to purge data over time • Set timed_statistics = TRUE • Not required, but needed to get wait time data
Installing STATSPACK • As user ‘SYS’ create tablespace perfstat datafile '/xxx/xxx/perfstat_01.dbf' size 500M; cd $ORACLE_HOME/rdbms/admin sqlplus sys @spcreate.sql Enter value for default_tablespace: perfstat Enter value for temporary_tablespace: temp
Generate Standard Report • Report SQL supplied by Oracle sqlplus perfstat/perfstat@<SID> execute statspack.snap execute statspack.snap sqlplus perfstat/perfstat@<SID> @$ORACLE_HOME/rdbms/admin/spreport.sql Enter value for begin_snap: 1 Enter value for end_snap: 2 Enter value for report_name: testing
Select STATSPACK Data • Query the tables directly select to_char(snap_time,'yyyy-mm-dd HH24') mydate, new.name buffer_pool_name, (((new.consistent_gets-old.consistent_gets)+ (new.db_block_gets-old.db_block_gets))-(new.physical_reads-old.physical_reads)) / ((new.consistent_gets-old.consistent_gets)+ (new.db_block_gets-old.db_block_gets)) bhr from perfstat.stats$buffer_pool_statistics old, perfstat.stats$buffer_pool_statistics new, perfstat.stats$snapshot sn where new.snap_id > 13125 and new.snap_id < 13149 and new.name = old.name and new.snap_id = sn.snap_id and old.snap_id = sn.snap_id-1; Based on SQL from ORACLE High-Performance Tuning with STATSPACK Donald K. Burleson Oracle Press ISBN 0-07-213378-3
Buffer Cache Hit Ratio Case 4) yr. mo dy Hr BUFFER_POOL_NAME BHR ------------- -------------------- ----- 2002-09-17 02 DEFAULT 1.00 2002-09-17 02 DEFAULT 1.00 2002-09-17 02 DEFAULT .94 2002-09-17 02 DEFAULT .87 2002-09-17 03 DEFAULT .90 2002-09-17 03 DEFAULT .93 2002-09-17 03 DEFAULT .97 2002-09-17 03 DEFAULT .82 2002-09-17 04 DEFAULT .78 2002-09-17 04 DEFAULT .78 2002-09-17 04 DEFAULT .76 2002-09-17 04 DEFAULT .80 2002-09-17 05 DEFAULT .69 2002-09-17 05 DEFAULT .73 2002-09-17 05 DEFAULT .76 2002-09-17 05 DEFAULT .69 2002-09-17 06 DEFAULT .81 2002-09-17 06 DEFAULT 1.00 2002-09-17 06 DEFAULT 1.00 2002-09-17 06 DEFAULT 1.00 2002-09-17 07 DEFAULT 1.00 2002-09-17 07 DEFAULT 1.00 2002-09-17 07 DEFAULT 1.00 • Output of SQL on previous slide
Space Used • Snapshot size varies with • Number of tablespaces • Number of SQL statements captured • Db1 21 tablespaces --> 0.15 Mb/snapshot • Db2 376 tablespaces --> 0.37 Mb/snapshot • Assuming a snapshot every 15 minutes • 96 snapshots per day • Db1 --> 14.4 Mb/day • Db2 --> 35.6 Mb/day
Removing Snapshot Data • Oracle supplied SQL • SQL removes snapshot data for a range of snapshot id numbers • Example sqlplus perfstat/perfstat@<SID> @$ORACLE_HOME/rdbms/admin/sppurge ... ... (listing of all existing snapshots) ... Specify the Lo Snap Id and Hi Snap Id range to purge ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Enter value for losnapid: 4001 Using 4001 for lower bound. Enter value for hisnapid: 5000 Using 5000 for upper bound. Deleting snapshots 4001 - 5000. commit; Note: large deletes may fill rollback segments
Summary • STATSPACK • Free, easy to install, easy to run • Output can be very useful or confusing • Real-world use has resulted in big performance gains • Useful for all instances • Standard way to gather performance data
Reminder – please complete the OracleWorld session surveyThank you.