Diagnosing the “Bottlenecks” in your Streams Environment

Diagnosing the “Bottlenecks” in your Streams Environment Brian Keating Chris Lawson May 17, 2007

What’s our Goal for Today? “We will divulge & show you how to deal with some of the complications of Streams.”

Our Agenda • Overview of Streams Replication • What are the parts of Streams? • How does it work? • Troublesome Parts of Streams • Troubleshooting techniques • A fun trivia question • Types of bottlenecks & how to resolve • Some helpful SQL scripts • Questions?

A Few Caveats • Unless otherwise stated, our examples assume Oracle 10g, release 2. • Streams set-up is not the real problem; thus our we focus on monitoring and troubleshooting. • Set-up documented in Oracle 10g Release 2 Concepts & Admin Guide. • Also—the subject of “conflict resolution” is a complex subject of its own, and will not be covered.

Is Streams Difficult or Easy? How difficult is it to figure out all the pieces, & keep Streams working well? Here’s one opinion . . .

Gosh, It’s Really Easy!Your Life will be so Peaceful “You don’t need rocket scientists on your staff!” * * Marketing executive, on the simplicity of Streams/CDC.

Here’s another point of view . . .

Streams is like a Whirlpool “Is Streams really a proven product?” * DBA * Programmer, noting the numerous bug fixes.

Streams Replication:A View from 10,000 Feet • Streams propagates both DML & DDL changes to another database (or elsewhere in same database.) • Streams is based on Log Miner, which has been around for years. • You decide, via “rules,” which changes are replicated. • You can optionally “transform” data before you apply it at the destination.

Overview of Streams :Three Main Processes Capture Propagate Apply c0, c1 J00, j01 a001 background processes Note: “p” processes also used in capture & apply

What’s in the Redo Log? LogMiner Processing Redo logs LCR’s • LogMiner continuously reads DML & DDL changes from redo logs. • It converts those changes into logical change records (LCRs). • There is at least 1 LCR per row changed. • The LCR contains the actual change, as well as the original data.

What’s in the Redo Log? • Recall that changed (dirty) blocks in the redo log buffer are written after anyone’s commit—not just your own. This means that changes will often be captured, propagated (but not applied) even though you haven’t committed!

The Capture Process:Capture Change Capture Change Generate LCR  Enqueue Message • The capture change step reads changes from redo logs. • All changes (DML and DDL) in redo logs are captured – regardless of whether Streams is configured to propagate any given change. • Observe “capture_message_number” value (in v$streams_capture). It regularly increments – even if there are no application transactions.

The Capture Process:Generate LCR Capture Change Generate LCR Enqueue Message • First, examine change & determine if Streams is configured to handle that change or not. • If so, convert into one or more “logical change records”, or “LCRs”. • An LCR only contains changes for one row. DML statements that affect multiple rows will cause multiple LCRs to be generated. • So, if one statement updates 1,000 rows, atleast 1,000 LCRs will be created!

The Capture Process:Enqueue Message Capture Change  Generate LCR Enqueue Message • This step places previously-created LCRs onto the capture process’s queue. • Oracle uses “q00n” background workers, & the QMNC (Queue Monitor Coordinator) q001 q002 q003 QMNC

The Capture Process:Enqueue Message Example of the queue background processes: Select Program, Sid From V$session Where Upper(program) Like '%(Q%' And Osuser = 'Oracle‘ Order By 1; PROGRAM SID -------------------------------------------- ----- oracle@ny1-server_NY-isb-sa (QMNC) 471 oracle@ny1-server_NY-isb-sa (q000) 456 oracle@ny1-server_NY-isb-sa (q001) 936 oracle@ny1-server_NY-isb-sa (q002) 1366 oracle@ny1-server_NY-isb-sa (q003) 261 oracle@ny1-server_NY-isb-sa (q004) 744

The Propagate Process Source Queue  Propagate Process  Destination Queue • Streams copies LCRs from the source queue to destination queue. • These transfers are done by the J00n background processes. • How do we know it’s the “j” processes? • Let’s look at session statistics and see who’s doing all the db link communicating

The Propagate ProcessWho’s doing the Sending? Select Sid, Name, Value From V$sesstat One, V$statname Two Where One.Statistic# = Two.Statistic# And Name Like '%Link%‘ And Value > 1000 SID NAME VALUE ----- -------------------------------------------- ---------- 467 bytes sent via SQL*Net to dblink 7.3870E+10 947 bytes sent via SQL*Net to dblink 2.5478E+10 467 bytes received via SQL*Net from dblink 113698397 947 bytes received via SQL*Net from dblink 47786284 467 SQL*Net roundtrips to/from dblink 2661410 947 SQL*Net roundtrips to/from dblink 1179795 SIDs 467, 947 are indeed the “j00” processes

The Propagate ProcessWhat’s in the Queue? • Sample of the queue content: • Select Queue_Name, Num_Msgs From V$Buffered_Queues; • QUEUE_NAME NUM_MSGS • ------------------------- ---------- • APP_NY_PROD_Q 544 • CAP_LA_PROD_Q 1640

The Apply ProcessQueue and Dequeue • LCRs usually wait in destination queue until their transaction is committed or rolled back. • There are some exceptions, due to the queue “spilling” (covered later) lcr lcr lcr lcr lcr lcr lcr lcr • The LCRs also remain on the capture queue until they are applied at the destination.

Commits & Rollbacks • When transaction commits, the Streams sends over a single Commit LCR. • After destination applies LCRs, destination sends back confirming LCR that it’s okay for capture queue to remove those LCRs. • For rollback, source sends 1 rollback LCR for each LCR in that transaction. And now to the troubleshooting . . .

Trivia Question Question: how does Streams avoid “infinite loop” replication? For example, after a change is applied at a destination database, why doesn’t Streams “re-capture” that change, and then propagate it back to the original source database? The answer to this question will be provided later in the presentation.

LCR Concepts An LCR is one type of Streams message – the type that is formatted by the Capture process. Two types of LCRs: DDL LCRs and row LCRs. A DDL LCR contains information about a single DDL change. A row LCR contains information about a change to a single row. As a result, transactions that affect multiple rows will cause multiple LCRs to be generated.

LCR ConceptsLOB Issues Tables that contain LOB datatypes can cause multiple LCRs to be generated per row affected! Inserts into tables with LOBs will always cause multiple LCRs to be generated per row. Updates into those tables might cause multiple LCRs, if any of the LOB columns are updated. Deletes will never generate multiple LCRs – deletes always generate 1 LCR per row.

LCR ConceptsOther Items to Note • All of the LCRs that are part of a single transaction must be applied by one apply server. • If a transaction generates 10,000 LCRs, then all 10,000 of them must be applied by just one apply server – no matter how many servers are running. • For each transaction, there is one additional LCR generated, at the end of the transaction. • If a single transaction deletes 2,000 rows, then 2,001 LCRs will be generated. This item is important to note for the “spill threshold” value.

“Spill” Concepts • Normally, outstanding messages are held in • buffered queues, which reside in memory. • In some cases, messages can be “spilled”; that is, • they can be moved into tables on disk. 3 Reasons: • The total number of outstanding messages • is too large to fit in the buffered queue; • A given message has been in memory too long; • The number of messages in a single transaction is • larger than the “LCR spill threshold” parameter.

“Spill” ConceptsTypes of Spill Tables • There are two types of tables that can hold “spilled” messages: • “Queue” tables: each buffered queue has a “queue table” associated with it. Queue tables are created at the same time as buffered queues. • Name format for queue tables: • AQ$_<name that you specified>_P • Example: • AQ$_CAPTURE_QUEUE_TABLE_P

“Spill” ConceptsTypes of Spill Tables (cont.) • The “Spill” table: there is one (and only one) “spill” table, in any database that uses Streams. • The name of the spill table is: • SYS.STREAMS$_APPLY_SPILL_MSGS_PART • As the name implies, the spill table can only be used by apply processes – not by capture processes.

“Spill” ConceptsWhere are the spilled messages? • Tables where spilled messages are placed: • If a message is spilled because the total number of outstanding messages is too large, then that message is placed in the associated queue table. • If a message has been held in memory too long, then that message is placed in the queue table. • If the number of LCRs in a single transaction exceeds the “LCR spill threshold”, then all of those LCRs are placed in the spill table – SYS.STREAMS$_APPLY_SPILL_MSGS_PART.

Troubleshooting • Two basic ways to troubleshoot Streams issues: • Query the internal Streams tables and views; • Search the alert log for messages. (This is particularly useful for capture process issues).

Capture Process Tables and Views streams$_capture_process: lists all defined capture processes dba_capture: basic status, error info v$streams_capture: detailed status info dba_capture_parameters: configuration info

Propagate Process Tables and Views streams$_propagation_process: lists all defined propagate procs dba_propagation: basic status, error info v$propagation_sender: detailed status v$propagation_receiver: detailed status

Apply Process Tables and Views streams$_apply_process: lists all defined apply processes dba_apply: basic status, error info v$streams_apply_reader: status of the apply reader v$streams_apply_server: status of apply server(s) v$streams_apply_coordinator: overall status, latency info dba_apply_parameters: configuration info

“Miscellaneous” Tables and Views v$buffered_queues: view that displays the current and cumulative number of messages, for each buffered queue. sys.streams$_apply_spill_msgs_part: table that the apply process uses, to “spill” messages from large transactions to disk. system.logmnr_restart_ckpt$: table that holds capture process “checkpoint” information.

Types of Bottlenecks Two main types of “bottlenecks”: Type 1: Replication is completely stopped; i.e., no changes are being replicated at all. Type 2: Replication is running, but it is slower than the rate of DML; i.e., it is “falling behind”.

Capture Process Bottlenecks Capture process bottlenecks typically have to do with a capture process being unable to read necessary online or (especially) archive logs. This will result in a “Type 1” bottleneck – that is, no changes will be replicated at all (because no changes can be captured). Almost all of the “Type 1” bottlenecks that I have ever encountered have been due to archive log issues with capture processes.

Capture Process BottlenecksCapture “Checkpoints” The capture process writes its own “checkpoint” information to its data dictionary tables. This checkpoint info keeps track of the SCN values that the capture process has scanned. That information is used to calculate the “required_checkpoint_scn” value. Capture process checkpoint information is primarily stored in the following table: SYSTEM.LOGMNR_RESTART_CKPT$

Capture Process BottlenecksCapture “Checkpoints” (cont.) By default, the capture process writes checkpoints very frequently, and stores their data for a long time. On a very write-intensive system, this can cause the LOGMNR_RESTART_CKPT$ table to become extremely large, very quickly. The amount of data stored in that table can be modified with these capture process parameters: _checkpoint_frequency: number of megabytes captured which will trigger a checkpoint checkpoint_retention_time: number of days to retain checkpoint information

Capture Process BottlenecksSCN checks When a capture process starts up, it calculates its “required_checkpoint_scn” value. This value determines which redo logs must be scanned, before any new transactions can be captured. The capture process needs to do this check, in order to ensure that it does not “miss” any transactions that occurred while it was down. As a result, when a capture process starts up, the redo log that contains that SCN value – and every subsequent log – must be present in the log_archive_dest directory (or in online logs).

Capture Process BottlenecksSCN checks (cont.) If any required redo logs are missing during a capture process restart, then the capture process will permanently “hang” during its startup. This issue will completely prevent the capture process from capturing any new changes. If the required redo log(s) cannot be restored, then the only way to resolve this situation is to completely rebuild the Streams environment. As a result, it is extremely important to “keep track” of redo logs that the capture process needs, before deleting any logs.

Capture Process BottlenecksWhich archive logs are needed? The following query can be used to determine the oldest archive log that will need to be read, during the next restart of a capture process. select a.sequence#, b.name from v$log_history a, v$archived_log b where a.first_change# <= (select required_checkpoint_scn from dba_capture where capture_name = ‘<capture process name>’) and a.next_change# > (select required_checkpoint_scn from dba_capture where capture_name = ‘<capture process name>’) and a.sequence# = b.sequence#(+); If no rows are returned from that query, then the SCN in question resides in an online log.

Capture Process BottlenecksFlow Control • In 10g, the capture process is configured with • “automatic flow control”. This feature prevents the • capture process from spilling many messages. • If a large number of messages build up in the • capture process’s buffered queue, then it will • “pause” capturing new messages, until some • messages are removed from the queue. • A message cannot be removed from the capture • process’s queue until one of these items occurs: • The message is applied at the destination • The message is spilled at the destination

Propagate Process Bottlenecks • I have heard of a few “Type 2” propagate • bottlenecks in some “extreme” environments. • Parameter changes to consider, to try to avoid • propagate-related bottlenecks: • Set the propagate parameter LATENCY to 0 • Set the propagate parameter • QUEUE_TO_QUEUE to TRUE (10.2 only) • Set the init.ora parameter _job_queue_interval to 1 • Set the init.ora parameter job_queue_processes to 4 or more

Apply Process Bottlenecks • Apply process bottlenecks usually deal with the apply process not “keeping up” with the rate of DML being executed on the source tables. • I have generally only seen apply bottlenecks with “batch” DML (as opposed to “OLTP” DML). • The three main areas to be concerned with regarding apply process bottlenecks are: • Commit frequency; • Number of apply servers; • Other Streams parameters.

Apply Process BottlenecksCommit Frequency • The number of rows affected per transaction has • an enormous impact on apply performance: • The number of rows affected per transaction • should not be too large – due to spill, and to the • “one apply server per transaction” restriction. • The number of rows affected per transaction • should not be too small, either – due to DML • degradation, and to apply transaction overhead. • From my experience, the optimal commit • frequency is around 500 rows.

Apply Process BottlenecksNumber of Apply Servers The number of parallel apply servers also has a large impact on the apply process’s performance. The number of parallel apply servers is set by the apply “PARALLELISM” parameter. From my experience, for best throughput, the PARALLELISM parameter should be three times the number of CPUs on the host machine. Note: setting PARALLELISM that high can potentially “eat up” all of the CPU cycles on the host, during intensive DML jobs.

Apply Process BottlenecksOther Streams Parameters • Several other Streams-related parameters can • also have an effect on apply performance: • Apply process parameter _HASH_TABLE_SIZE: • set this relatively high (such as 10000000), to • minimize “wait dependency” bottlenecks. • Apply process parameter • TXN_LCR_SPILL_THRESHOLD: set to be a little • bit higher than the maximum number of LCRs • generated per transaction, to prevent spill. • Init.ora parameter aq_tm_processes: set to 1.

Apply Process Bottlenecks“Turning Off” Streams As mentioned previously, apply bottlenecks generally occur with “batch”-style DML. It is possible to configure a session so that Streams will not capture any DML from that session – and therefore, that session’s DML will not be propagated or applied. This technique allows batch-style DML to be run at each Streams database individually – rather than running the batch DML in one database, and then having that DML get replicated to all of the other databases.

Apply Process Bottlenecks“Turning Off” Streams (cont.) “Turning off” Streams in a session is done by setting the session’s “Streams tag” to a non-NULL value. Here is an example of turning Streams off before a batch statement, and then back on: exec dbms_streams.set_tag (tag => hextoraw(’01’)); <batch DML statement> commit; exec dbms_streams.set_tag (tag => NULL); Note: this technique will only work if the “include_tagged_lcr” parameter, in the Streams capture rules, is set to FALSE. (FALSE is the default value for that parameter in Streams rules.)

Apply Process BottlenecksApply Process Aborts! If there are any triggers on any replicated tables, then the apply process can sometimes abort – and not restart – during periods of intense DML activity. These aborts happen most frequently when one particular apply server becomes “overloaded”, and has to run too much recursive SQL (triggers cause recursive SQL). This issue is caused by Oracle bug # 4712729. This bug is apparently fixed in 10.2.0.3; and there are “backport” patches available for 10.2.0.1 and 10.2.0.2.

Diagnosing the “Bottlenecks” in your Streams Environment