REP708: Using Replication Server to Feed a Sybase IQ Warehouse

REP708: Using Replication Server to Feed a Sybase IQ Warehouse Robert SegessenmenSystems Administrator Dept of Defense robtronics@mac.com Jeff TallmanPrincipal Systems Consultant Sybase, Inc tallman@sybase.com

Using RS to Feed an IQ DW Agenda: • How to use RS to feed IQ • What not to do • Why use RS? What is role of ETL? • Architecture and Implementation • Dept Of Defense Case Study • In Production for >5 yrs • Implementation • OLTP  Staging • Staging  DW • Stumbling blocks

What Not to Do • Do NOT Replicate Directly into IQ • IQ is tuned for bulk loads, RS does atomic • It will not keep up • Table versioning • Don’t Try to keep the DW up-to-the-minute • This is a DW right? • If analyzing the last 5 years of data – how really important is the last 30 minutes of changes???? • Although Telestra did it every 10-30 seconds for 2000 Summer Olympics in Australia • If frequent updates are required, extensive denormalization, transformation and other processing will not be sustainable on large schemas. • Simple star-schemas with minimal aggregation

Why Use RS at All? • Best Change Detection mechanism available • The E in ELT stands for “Extract” – specifically, “Extract Changed Data” • RS detects changes by scanning Transaction Log • What about ELT tools? • Still may be needed to Load/Transform • We aren’t suggesting they aren’t necessary – just the facts – they can’t do change detection as efficiently as RS • Extractions are problematic – Lack of access to transaction log dictates use of timestamp columns or triggers to identify changed rows • Timestamp columns • miss peak events, no delta, high contention, non-transactional, no user info, etc. • Shadow tables • trigger changes, extra I/O impact on OLTP, non-transactional, possible contention on shadow tables during load times

The Architecture ELT Tool …or… Custom Job Scheduler OLTP (Source System) DW Staging DW/ Data Mart (Today, we will focus here)

Why a Staging Area? Ralph Kimball’s Data Warehouse Bus Architecture “The Data Warehouse LifeCycle Toolkit”

What the Process Looks Like Synchronization Process Suspend/Resume 6 3 5 Truncate Local Change 1 4 Function Strings Insert/Location Mapping Stored Procedures 2 7 Staging Stored Procedures Exception Handling Tables

Implementation Overview • Replication Server Interface • Function strings calling stored procedures in staging database • Staging Database • Change tracking tables • Stored procedures for mapping RS input stream to tracking tables • Synchronization Process • Suspends/Resumes RS DSI to staging database • Invokes Insert/Location scripts to copy changes to local change tables • Invokes scripts/procedures to map changes to DW schema • Processes exceptions (i.e. records errors, alerts DBA) • Overall Guiding Principles/Goals • Minimize overhead on replication throughput • Maximize performance advantages of IQ in aggregating values, etc. • Minimize joins within IQ as part of processing changed data

The Replication Server Interface • Function String/Stored Procedure based approach • Function Strings call Stored Procedures to map replicated rows to staging tables as appropriate • Advantages: • Procs are easily supported by CASE tools & general coders/developers • Allows an intermediary scheme to be used (semi-star) • Avoid: • Over complicating staging logic – keep it simple • Complicated logic could slow down replication throughput • Tracking complicated aggregates – track the raw data & let IQ aggregate • Simple aggregates within the same synch period are fine • Extensive denormalization • Denormalization can be done via the insert/location query • Alternatively, denormalization can be done from local change tables in IQ

The Staging Database • Contains data staging tables • Staging tables for inserts, updates, deletes for each source • Alternatively map updates as delete/insert • The problem with this approach is you could loose intermediate values, mins, maxes, etc. • Staging tables have extended columns • Source DS/DB, commit timestamp, user info, etc. • Before/After image for update staging tables • Contains reference data • Useful for decoding lookups/minor denormalization • Avoids too many joins during insert/location • Contains views of staging tables • Useful for simplifying insert/location queries • Extremely useful if using bcp/load table vs. insert/location • You can bcp out from a view..

General Staging Strategy • Insert  Adds row to insert tracking table • Update  Adds row to update tracking table • 5 updates = 5 inserted rows – this allows DW to find peak values, etc. • Delete  adds row to delete tracking table • May remove rows from insert & update tables • If not, then situation may be more of a transaction register and a single table may work for all three (insert, update, delete) • i.e. checking account • Consideration: • A business transaction may take several hours or days to complete • What do you send to the staging database – only completed business transactions or do you filter incomplete transactions during the load process??

Synchronization Process • Scheduling Utility/Tool • CA Autosys, BMC, unix cron • ASIQ Events may not work too well • System down, connections to RS, etc. • Script-based logic • Suspend RS DSI to staging database • Use Insert/Location to copy changes to local IQ tables • Can be scripts or stored procedures • Keep SELECT clause of Insert/Location optimized • Table scan at staging database will be necessary (obvious), however, watch the number and complexity of the joins • Resume RS DSI to staging database • Invoke mapping procedures in IQ • Perform denormalizations, aggregations, mapping to DW schema, etc. • Check for exceptions

Why Insert/Location? • Isn’t “Load Table” and flat files faster? • Yes…but….you have to first bcp the data out, possibly compress it, ftp it to the IQ server, and then uncompress it….(how long does that take?) • Insert/Location will be generally faster as an overall process than the above • Advantages of Insert/Location • No bcp/ftp errors • file system full, 2GB file limits, \0x0a vs. \0x0d\0x0a, etc. • On the fly denormalization • Select clause can use joins, unions, views, etc. • No messing with date formats, etc. • Can be used with DirectConnect for Heterogeneous • Including DB2/Mainframe data • Caveat • Prior to IQ 12.4.3, Insert/Location could not be concurrent • Hence, it is one reason (besides the fact it wasn’t around 5 years ago) why the case study uses bcp and load table via a pipe file

Local Change Tables in IQ • Advantages: • Allows IQ to do aggregation with IQ indexes • Avoids locking main IQ tables during data transfer from ASE • Shortens interval for multi-table updates • A plus if ensuring the DW is commit consistent at each point due to ongoing query activity • Hints & Tips: • Truncate at the beginning of the synch process vs. at the end • Allows you to go back and review data that was loaded • Use permanent local change tables • Creating tables on the fly causes problems with IQ Multiplex • Readers will suspend (ouch)! • Temporary tables are lost if writer crashes

Mapping Changes into IQ Tables • General Sequence: • Deletes  Inserts  Updates • Stored Procedure or Script based • Not a 1:1 with tables – but 1:1 with mapping requirements • Hints & Tips • You can resume RS before starting this step • Wait until all insert/locations complete prior to starting • Avoids errors, etc. during denormalization • Update logical groupings in DW schema in a single transaction • Dimensions first, then FACT tables • Avoids issues with ongoing queries, FK violations • You may need to create “sparse” cubes to contain incremental values for star-schemas

Some RS Gotcha’s • Minimal Column Replication • Useful feature, but wrecks function strings • Use a separate repdef for staging database • Minimal column replication can significantly improve performance of RS delivery to WS and other OLTP targets • Only disable minimal column replication for primary targets as a last resort • Stored procedure replication • This could be a major problem • What data was modified is not recorded in the log records scanned • So we have to figure it out somehow….

Handling Procedure Replication – 3 Scenarios • Same Transaction  Exec in Staging DB • Data modified is restricted to same transaction • Data is guaranteed to all be in staging • Alternative is to use a Java Object to contain array of PKeys and pass as a parameter to a replicated procedure • Especially easy if rows to be modified are “pre-identified” in a temp table during the normal course of regular procedure processing. • Insert Procs  Exec in Staging DB • In this case, the parameters to the procedure contain all the data necessary • At OLTP source, proc maps params to multiple tables • Just do the same in Staging DB • Update or Delete Procs  Handle in IQ (skip purge procs) • Data modified may be in staging or not • May have just sync’d and been truncated, unless sync is carefully regulated to avoid this problem (then exec proc in staging DB) • Track proc exec in separate tables w/ params • Apply proc in IQ as per next slide

Proc Replication: Handling in IQ • Worse case scenario • In other words, avoid this if you can • Apply mappings for data changes PRIOR to proc execution • Use commit time datetime values • Execute a proc in IQ with equivalent logic for DW schema • Keep in mind time dimension • Usually will involve updating a dimension, etc. • i.e. changing the status dimension • Change all blood supplies collected within 100 miles of somewhere over the past 60 days to a “suspect/contaminated status” • Apply mappings for data changes AFTER proc execution • The more procs you handle this way the more “breaks” logic • This could complicate the mapping logic to the point that you may wish to change the primary logic instead • Lengthening the synch interval might seem to drive more, but it may also allow the proc to execute in staging DB if data impact is within sync timeframe

Dept Of Defense Case Study • Went into production in 1996 • OLTP was ASE 11.0.3, RS 11.0 • Initial IQ was IQ 11.1 • Today • OLTP is ASE 12.5.0.2, RS 12.5 • IQ currently is 12.4.2 • IQ DB Size is approximately 500GB • Growing just under 1GB/day (2.4 million objects)

Case Study Architecture Load Balanced OLTP Systems Semi-Denormalized Staging Database Load Balancing Router SGI Origin 3000 Series

Case Study Baseline • 2.4 million objects generated per 24 hour day • Object composed of one main record - plus one or more related rows in five subordinate tables • Each object requires 10-20 transactions from creation to completion • Objects are machine generated locally and via replication • Growth currently at 5-10% monthly - Yes this is a major challenge!!! • Only “last” transaction triggers replication to staging area • Do the math - That’s about 100,000 objects per hour to “move” and load • Final transaction to complete object fires “archive proc” from trigger • OLTP ‘archive’ proc “collects” several tables worth of columns or parameters.

Case Study (OLTP  Staging) • OLTP System uses Peer-to-Peer for load balancing and availability • Fault tolerance an absolute requirement for this mission critical service • Replication between OLTP systems is table based • Uses some fstrings for handling delta values and “accepted” errors • Replication from each OLTP to Staging DB is procedure based • Only completed objects are desired • Trigger on OLTP tables call replicated procs • Procs are empty shells • In staging database, replicated procs insert into tracking tables • These procs perform some denormalization and pre-calculate several time based deltas

Case Study (Staging  DW) • “Loading” of DW occurs once each day… • Basic steps • Replication (DSI to Staging) suspended • Full Staging DB backup • BCP out of staging DB to named pipe read by “load_table” command • Staging DB “cleaned” via truncate table • Replication resumed • Incremental save of DW taken (To local disk) • Weekly full save / Monthly full save - Were to put it?

Case Study (Staging  DW) cont… • Stumbling blocks • Some data values do not “translate” to DW column types • Where to put or at least record these rows • As the volume of records to load increases… • So does the time the Replication process is interrupted • So does the time it takes to work off the backlog • And so does the amount of time a “writer” is competing for resources • Time to optimize the staging database! • Spend some of that P&T time and money where it may not be obvious

Questions?

REP708: Using Replication Server to Feed a Sybase IQ Warehouse