RDBMS Forensics: Troubleshooting Using ASH Session UGF-9740 Tim Gorman EvDBT.com

RDBMS Forensics: Troubleshooting Using ASHSession UGF-9740 Tim Gorman EvDBT.com

Speaker Qualifications • Co-author… • “Oracle8 Data Warehousing”, 1998 John Wiley & Sons • “Essential Oracle8i Data Warehousing”, 2000 John Wiley & Sons • “Oracle Insights: Tales of the Oak Table”, 2004 Apress • “Basic Oracle SQL” 2009 Apress • “Expert Oracle Practices: Database Administration with the Oak Table”, 2010 Apress • 29 years in IT… • “C” programmer, UNIX system administrator since 1984 • Consultant and technical consulting manager at Oracle (1990-1998) • Independent consultant (www.EvDBT.com) since 1998 • Board of Rocky Mountain Oracle Users Group (www.RMOUG.org) since 1995 • Board of Oracle Developers Tools Users Group (www.ODTUG.com) since 2013 • Oak Table network (www.OakTable.net) since 2002 • Oracle ACE since 2007, Oracle ACE Director since 2012 • More… • Board of directors of Project SafeGuard (www.PSGHelps.org), since 2011

Agenda • When someone complains of a performance problem • The problem is not occurring right this minute • So, what can we do to find performance problems, even outright failure conditions, that have occurred in the past? • Where do we begin? • Where do we look, what tools? • How do we decide where to go next? • Two case studies as illustration…

First case study… • Informatica jobs reporting Oracle error (something about “connection lost contact”) • Obtaining the error messages from the Informatica logs showed ORA-03135 • Searching for that error message in MyOracleSupport (MOS) yields… • Note #730066.1 (entitled “Diagnosis of ORA-3135/ORA-3136 Connection Timeouts when the Fault is in the Database”)

Researching in MOS… • Note #730066.1 indicates that ORA-03135 (“connection lost contact”) is returned to the client-side of the database connection (i.e. Informatica PowerMart, in this case) • It also mentions that on the database side, error ORA-03136 (“inbound connection timed out”) is recorded to the database alert.log file as well…

Researching in MOS… • Note #730066.1 goes on to explain the sequence of SQL commands executed when establishing a database connection • Each of these SQL statements has to be parsed, executed, and fetched • Thus presenting the possibility of a “hang” if there is any kind of contention • Also, mentions two Shared Pool bugs with symptoms including ORA-03135/3136 errors • Almost goes without saying… 

Researching in MOS… • Note #730066.1 then goes on to suggest three different methods for diagnosing “hangs” while establishing database connections… • From a separate SYSDBA session in SQL*Plus, take three SYSTEMSTATE dumps each 90 seconds apart while the problems are happening • Examine Active Session History (ASH) information for the 15 minutes leading up to the occurrence of the error • Continuous monitoring of the contents of V$LATCHHOLDER while the problems are happening

Researching in MOS… • Also, note #465043.1 (entitled “Troubleshooting ORA-3136 WARNING Inbound connection timed out”) is useful • Further describing the error as a timeout specified by the SQL*Net parameter SQLNET.INBOUND_CONNECT_TIMEOUT (defaults to 60 secs) • Introduced in Oracle10g to prevent Denial Of Service (DoS) attacks on the database

Researching on Google… • Searching for keywords “oracle ora-3135 ora-3136” seems to mostly yield articles that repeat the following advice posted to an Oracle TechNet (OTN) discussion forum… It's a warning. by default there would be some time in seconds, usually a connection from client will get connected to DB Server with in prescribed time limit. If the connection does not happen then you can find the warning messages. In order to handle this you can include some parameters in sqlnet.ora and listener.oraExample :- sqlnet.oraSQLNET.INBOUND_CONNECT_TIMEOUT = 120 In the listener.oraINBOUND_CONNECT_TIMEOUT_LISTENER = 110

Research summary • Connections taking longer to establish than the timeout specified by SQL*Net timeout parameters • On both instances of a 2-node RAC cluster, we have already set all these SQL*Net parameters to 300 secs (5 mins) • Would it be reasonable to set these higher? • Consider: • timeouts tend to turn a failure condition into a performance problem • longer timeouts show as bigger performance problems…

Research summary • ORA-03136 errors only showed up in the alert.log file of RAC02 • No error messages found in RAC01 ... Mon Oct 4 22:05:44 2010 SUCCESS: diskgroup D3098_ORAARCH was dismounted Mon Oct 4 22:07:05 2010 WARNING: inbound connection timed out (ORA-3136) Mon Oct 4 22:07:05 2010 WARNING: inbound connection timed out (ORA-3136) ...

Active Session History (ASH) ASH is a slightly different take from other information in the Automatic Workload Repository (AWR) • All AWR information is based on real-time statistics maintained in memory-based V$ performance views • V$ performance views come in two basic flavors… • Cumulative instance-level statistics • Cumulative session-level statistics for currently-active sessions only • AWR is the long-term repository for the information in the V$ performance views • Only for the cumulative instance-level statistics… • …session-level statistics are not retained past the end of the session…

Active Session History (ASH) • Why is session-level data valuable? • Instance-level data is aggregated almost to the point of being meaningless • AWR data aggregated to an hour, instance-wide, by default • Good for finding “Top N Worst” SQL and general trends lasting more than an hour • Bad for finding anomalies, incidents happening in seconds or minutes • Session-leveldata shows more detailed information from sampled session data • Includes information about the identity of the user or program, giving some idea of the context of a problem

Active Session History (ASH) • Circular buffer is intended to store about an hour of active session information • MMNL process scans V$SESSION info in shared memory every second for “active” session information, flushes to ASH buffers in shared memory • Parameter “_ash_sampling_interval”, defaults to 1000 millisecs (1 sec) • MMNL process also periodically flushes V$ view info to AWR table • Parameter “_ash_disk_filter_ratio”,defaults to 10 (seconds) • If ASH buffer becomes 66% full, then MMNL process flushes to AWR between snapshots • Parameter “_ash_eflush_trigger”, defaults to 66 (percent) • How long data is retained in V$A_S_H is clearly dependent on workload • Historical ASH data retained in AWR tables under all circumstances • Intra-snapshot flushes provide a valuable trail of evidence in dire situations

V$ACTIVE_SESSION_HISTORY • View is too wide to display here, but information includes… • Timestamp • User ID, session ID, type, and status • Program, module, action • SQL ID, SQL plan hash value, SQL opcode • PL/SQL module info (entry point, current procedure) • Parallel execution query coordinator (QC) information • Blocking session information, if waiting on enqueue • Wait event information with parameters

DBA_HIST_ACTIVE_SESS_HISTORY • Contains same columns as V$ACTIVE_SESSION_HISTORY, plus… • AWR snapshot ID • DBID and instance number • A few more columns, though fewer rows, but data retained much longer (default: 1 week) than the V$ view

Back to the case study… • ORA-03136 errors are occurring on RAC02 only • We focus on two “groupings” of errors a couple hours apart (jump to Listing01.rtf) (then, jump to Listing02.rtf)

Lots of “concurrency” class waits… • Remembering that the long-term retained ASH information in the AWR view represents 10-second samples (instead of 1-second samples) • Filter out the I/O waits and PX-related waits… • …notice the concurrency waits… • enq: SQ – contention • row cache lock • gc current block 2-way • gc buffer busy • gc current request • latch: row cache objects

Waits on sequence cache refresh? • Drilling down on the event “enq: SQ – contention”… • This event means that a sequence is “refreshing” its next set of sequence values from the data dictionary table SYS.SEQ$ • Note the P2 value of “144”? • According to the view V$EVENT_NAME, parameter P2 on the wait event contains an OBJECT_ID value • Translating OBJECT_ID = 144 using the DBA_OBJECTS view in the RAC database shows… SQL> select owner, object_name, object_type from dba_objects where object_id = 144; OWNER OBJECT_NAME OBJECT_TYPE ------ --------------- ------------------- SYS AUDSES$ SEQUENCE • This sequence is used to generate values for the column AUDSES in the V$SESSION view when a database session is created…

Waits on the row cache? • Drilling down on the event “row cache lock”… • This event indicates problems modifying data values in the data dictionary cache (a.k.a. “row cache”) within the Shared Pool of the SGA • Note the P1 value of “13”? • According to view V$EVENT_NAME, parameter P1 is the “cache id” • So, looking in V$ROWCACHE where CACHE# = 13, we see… SQL> select parameter from v$rowcache 2 where cache# = 13; PARAMETER -------------------------- dc_sequences • Hmmm… are we seeing a pattern? • ..but wait – there’s more!!!

Waits on RAC “cache fusion”? • Drilling down on the event “gc current request”… • This event is posted when an instance must obtain the most current copy of a block buffer from another RAC instance… • According to the view V$EVENT_NAME, parameter P1 is file# and P2 is block# • In the ASH query, we saw a *lot* of these events where P1=1 and P2=180… • Looking in the DBA_EXTENTS view, we can find the database object in which this contended-for block resides… SQL> select owner, segment_name from dba_extents 2 where file_id = 1 and 3 180 between block_id and (block_id + (blocks - 1)); OWNER SEGMENT_NAME -------------- ---------------------- SYS SEQ$ • …it just keeps piling up, doesn’t it?…

Poorly-cached sequences? • Let’s find what sequences might need caching… SQL> select sequence_owner, sequence_name, cache_size, last_number 2 from dba_sequences 3 where cache_size< 20 4 and (nvl(min_value,0) + last_number) / increment_by>= 10000 5 order by (nvl(min_value,0) + last_number) / increment_bydesc; OWNER SEQUENCE_NAME CACHE_SIZE LAST_NUMBER -------- ------------------------------ ---------- ----------- SYS DBMS_LOCK_ID 10 1073741891 SYS IDGEN1$ 10 18987301 SYS WRI$_ADV_SEQ_MSGGROUP 10 6428319 PROD JL_SEQ 0 2344665 SYS ORA_TQ_BASE$ 0 1895033 SYS WRI$_ALERT_SEQUENCE 10 1180544 MDSYS TMP_COORD_OPS 0 1000000 PROD DL_SEQ 0 964612 PROD RUN_ID_SEQ 0 187744 SYS SCHEDULER$_INSTANCE_S 10 166467 SYS WRI$_ADV_SEQ_TASK 10 151284 SYS OBJECT_GRANT 10 139382 PROD DF_SEQ 0 53402

Poorly-cached sequences? • The sequence SYS.AUDSES$ is not poorly cached… • CACHE_SIZE = 100 (jump to Listing03.rtf – oops!!!) • But all of the other poorly-cached sequences affect the performance of any sequence • They’re all housed within the same database block in the SEQ$ table

RAC contention on a block? • Does the AUDSES$ sequence reside within block #180 on file #1? SQL> set serveroutput on SQL> declare 2 v_ridrowid; 3 v_type number; 4 v_obj number; 5 v_rfno number; 6 v_bno number; 7 v_rno number; 8 begin 9 select rowid into v_rid from sys.seq$ where obj# = 144; 10 dbms_output.put_line('v_rid = "'||v_rid||'"'); 11 dbms_rowid.rowid_info(v_rid, v_type, v_obj, v_rfno, v_bno, v_rno); 12 dbms_output.put_line('v_type = "'||v_type||'"'); 13 dbms_output.put_line('v_obj = "'||v_obj||'"'); 14 dbms_output.put_line('v_rfno = "'||v_rfno||'"'); 15 dbms_output.put_line('v_bno = "'||v_bno||'"'); 16 dbms_output.put_line('v_rno = "'||v_rno||'"'); 17 end; 18 / v_rid = "AAAABEAABAAAAC0AAE" v_type = "1" v_obj = "68" v_rfno = "1“  file #1 v_bno = "180“  block #180 v_rno = "4"

RAC contention on a block? • Using the same PL/SQL block to translate a ROWID into file# and block#... • it turns out that all of these busy sequences with low or disabled caching are in file# = 1 and block# = 180… • Now, do the waits on “gc buffer busy”,“gc current block 2-way”, and “gc current request” make more sense? • We’re experiencing high demand for slowly-serviced sequences, due to lack of caching… • Causing inter-instance contention • Since RAC02 is the less-busy of the two-node RAC environment… • …it spends more time waiting because RAC01 has the current image of these blocks

Taking all the findings together… • Uncached or poorly-cached sequences are resulting in lots of I/O to one block in the SYS.SEQ$ data dictionary table • Thus waits on event “enq: SQ – contention”… • This causes contention within the Row Cache of the Shared Pool • Thus waits on “row cache lock” on the section of the Row Cache storing sequence information… • The less-utilized RAC instance (RAC02) frequently has to wait for a current image of that block to be shipped over from the heavily-utilized RAC instance (RAC01) • Thus waits on “gc” events…

…to reach a conclusion • The resolution for the ORA-03135/3136 turns out to be very simple… • Increase caching on all under-cached sequences • Increased sequence caching reduces physical I/O to/from the SYS.SEQ$ table, which reduces inter-instance RAC contention as well (gc waits), which eliminates starvation by lesser-used instance

Resolution SQL> select 'alter sequence '||sequence_owner||'.'||sequence_name|| 2 ' cache 100 /* prev_cache_size='||cache_size||', last#='||last_number||' */ ;' 3 from dba_sequences 4 where cache_size <= 20 5 and (nvl(min_value,0) + last_number) / increment_by>= 10000 6 order by (nvl(min_value,0) + last_number) / increment_bydesc; CMD -------------------------------------------------------------------------------------------- alter sequence SYS.DBMS_LOCK_ID cache 100 /* prev_cache_size=20, last#=1073741891 */ ; alter sequence SYS.IDGEN1$ cache 100 /* prev_cache_size=20, last#=18987301 */ ; alter sequence SYS.WRI$_ADV_SEQ_MSGGROUP cache 100 /* prev_cache_size=10, last#=6428319 */ ; alter sequence PROD.JL_SEQ cache 100 /* prev_cache_size=0, last#=2344665 */ ; alter sequence SYS.ORA_TQ_BASE$ cache 100 /* prev_cache_size=0, last#=1895045 */ ; alter sequence SYS.WRI$_ALERT_SEQUENCE cache 100 /* prev_cache_size=20, last#=1180544 */ ; alter sequence MDSYS.TMP_COORD_OPS cache 100 /* prev_cache_size=0, last#=1000000 */ ; alter sequence PROD.DL_SEQ cache 100 /* prev_cache_size=0, last#=964612 */ ; alter sequence PROD.RUN_ID_SEQ cache 100 /* prev_cache_size=0, last#=187744 */ ; alter sequence SYS.SCHEDULER$_INSTANCE_S cache 100 /* prev_cache_size=20, last#=166467 */ ; alter sequence SYS.WRI$_ADV_SEQ_TASK cache 100 /* prev_cache_size=10, last#=151284 */ ; alter sequence SYS.OBJECT_GRANT cache 100 /* prev_cache_size=20, last#=139402 */ ; alter sequence PROD.DF_SEQ cache 100 /* prev_cache_size=0, last#=53402 */ ; 13 rows selected.

Another case study… • Another two-node RAC environment… • Supporting OBIEE 10.1.3.1.2 • App-server (NQServer.exe) supporting Hyperion users encountering ORA-03135 • Numerous ORA-03136 found frequently in bothdatabase instance “alert.log” files

…which at first seems similar… • Similar alert log entries in both instances, at different times… Mon Oct 31 05:51:07 2011 WARNING: inbound connection timed out (ORA-3136) Mon Oct 31 05:51:07 2011 WARNING: inbound connection timed out (ORA-3136) Mon Oct 31 05:51:07 2011 WARNING: inbound connection timed out (ORA-3136) Mon Oct 31 05:51:07 2011 WARNING: inbound connection timed out (ORA-3136) Mon Oct 31 05:51:07 2011 WARNING: inbound connection timed out (ORA-3136)

…but ultimately seems quite different… SQL> select event, 2 count(*) cnt, 3 sum(decode(session_state, 'ON CPU', wait_time, time_waited))/1000000 time_waited 4 from dba_hist_active_sess_history 5 where sample_time between to_date('31-OCT-2011 05:50:30','DD-MON-YYYY HH24:MI:SS') 6 and to_date('31-OCT-2011 05:51:10','DD-MON-YYYY HH24:MI:SS') 6 group by event 7 order by time_waiteddesc; EVENT CNT TIME_WAITED ------------------------------ -------- ------------ latch free 277 81 latch: shared pool 153 44 latch: library cache 45 13 reliable message 1 1 12 0 DBMS_LDAP: LDAP operation 1 0

Research… • Bearing in mind that this is the AWR view… • Not the memory-based V$ view… • With 10-second samples, not 1-second samples… • Latch contention… • “latch free” • “latch: shared pool” • “latch: library cache” • No RAC-related events • Nothing to do with sequences either…

Research (cont’d)… SQL> select name, parameter1, parameter2, parameter3 2 from v$event_name 3 where name in ('latch free','latch: shared pool', 4 'latch: library cache'); NAME PARAMETER1 PARAMETER2 PARAMETER3 -------------------- ------------- ------------- ------------ latch: shared pool address number tries latch: library cache address number tries latch free address numbertries

Research (cont’d)… SQL> select h.event, l.name, count(*) cnt, 2 sum(decode(h.session_state,'ON CPU',h.wait_time,h.time_waited))/1000000 secs 3 from dba_hist_active_sess_history h, v$latchname l 4 where h.sample_time between to_date('31-OCT-2011 05:50:30','DD-MON-YYYY HH24:MI:SS') 5 and to_date('31-OCT-2011 05:51:10','DD-MON-YYYY HH24:MI:SS') 6 and h.event in ('latch free','latch: shared pool','latch: library cache') 7 and l.latch# = h.p2 8 group by h.event, l.name 9 order by secsdesc; EVENT NAME CNT SECS ---------------------- --------------------------------------- -------- ------------ latch free parameter table allocation management 174 51 latch: shared pool shared pool 153 44 latch free user lock 90 26 latch: library cache library cache 45 13 latch free active checkpoint queue latch 12 4 latch free kokc descriptor allocation latch 1 0

Research (cont’d)… • parameter table allocation management • Automatic SGA management (10.1  10.2) • kokc descriptor allocation latch • Related to high concurrency in XML/XDB (10.2.0.4+) • user lock • Acquired when cleaning dead user processes or dropping a schema user (from note on MOS) • active checkpoint queue latch • Database full/incremental checkpoint in progress • shared pool • Synchronizes changes to Shared Pool in SGA • library cache • Synchronizes changes to sub-component of Shared Pool (hard parsing?)

Research summary • Not RAC-related • Not sequence-caching related • Not hard-parsing (i.e. no bind variables) • Checkpointing? LGWR problems? • Process cleanup? PMON issues? • Auto SGA management?

Automatic SGA Management • SGA_TARGET and SGA_MAX_SIZE • Both set to 20G • DB_CACHE_SIZE, SHARED_POOL_SIZE, LARGE_POOL_SIZE, JAVA_POOL_SIZE, STREAMS_POOL_SIZE, and LOG_BUFFER all unset • displays as “0” • Has auto SGA management been particularly active? • …bingo!... (jump to Listing04.rtf and Listing05.rtf)

Dampening the thrashing… • When SGA_TARGET > 0, then the meaning of the standard SGA sizing parameters changes • Instead of specifying a static size… • …they become a “floor” or minimum value • So for example, setting SHARED_POOL_SIZE = 6G when SGA_TARGET > 0 means… • Shared Pool can still be automatically resized larger than 6G, but not smaller…

Resolution for the 2nd case study • Dampen the thrashing between the Buffer Cache and the Shared Pool • If Buffer Cache is shrunk or constrained, then more cache misses and physical read I/O is the result • If Shared Pool is shrunk or constrained, then more latch contention is the result • Based on past history of auto re-sizing, set a floor value for the Shared Pool ALTER SYSTEM SET SHARED_POOL_SIZE = 3584M;

Summary • Although primarily regarded as a performance tuning tool • ASH is also a great diagnostic tool • If you understand its limitations • Database forensics • Understand what is recorded and stored within AWR • Understand how it can be used

COLLABORATE14.IOUG.ORG April 7 – 11, 2014The Venetian Resort, Las Vegas, NV • Network with 5,000+ database and application professionals • 5 days of in-depth education built by users, for users • Complimentary Pre-Conference Workshops for IOUG registrants Attend for free! IOUG’s Call for Speakers is now open collaborate14.ioug.org/call-for-speakers

Training Close to Home IOUG Master Class Series Returns Coming soon to a city near you! One Day Interactive Training Events to meet your educational needs Communities • Maximum Availability Architecture with Oracle’s Larry Carpenter and Joe Meeks • October 15: Milwaukee, WI • October 16: Rochester, NY • Performance and Internals with Craig Shallahamer November 6: Chicago, ILNovember 8: Atlanta, GA • Get more information at www.ioug.org/masterclass • or visit the IOUG kiosk in Moscone West, 2nd floor

Come See IOUG at the User Group Pavilion Win a Free Registration to COLLABORATE 14! Stop by the IOUG kiosk in the User Group Pavilion in Moscone West, 2nd floor, to pick up a lucky poker chip and see if you are a winner of a free registration to COLLABORATE 14 in Las Vegas!

www.mhprofessional.com/dbsecUse Code: db12c Complimentary eBook Register Now

Thank You! tim@evdbt.com RDBMS Forensics: Troubleshooting Using ASH Session UGF-9740

RDBMS Forensics: Troubleshooting Using ASH Session UGF-9740 Tim Gorman EvDBT.com

RDBMS Forensics: Troubleshooting Using ASH Session UGF-9740 Tim Gorman EvDBT.com

Presentation Transcript

Tim Gorman Evergreen Database Technologies, Inc. EvDBT

RDBMS-based GIS

Gorman Paving

Tuning PL/SQL procedures using DBMS_PROFILER 20-August 2009 Tim Gorman Evergreen Database Technologies, Inc.

Understanding RDBMS

RDBMS

RDBMS Fundamentals: Indexing

Representation of Trust Model using RDBMs

DBMS, RDBMS Concepts

RDBMS for CDB ?

ADABAS to RDBMS Using NatQuery

QC Your RDBMS Data Using Dictionary Tables

Tim Gorman Evergreen Database Technologies, Inc. EvDBT

Troubleshooting using the DTE-port

Tim Gorman - Delphix

Tim Gorman Evergreen Database Technologies, Inc. EvDBT

RDBMS

Practical Active Session History (ASH)