ETL Framework

ETL Framework

Motivations • It is nice to not have jobs fail overnight • Automated retries • Detect transient failures (e.g. interrupted transfer) • Track data quality issues without failing • It is nice to know when data is not right • Provide edit messages for problems • Fix / sanitise data where possible • It is nice to know what data looked like before the most recent refresh • Track versions of rows as they are received • These are not features of SSIS or Talend

Features • Standardise jobs without loss of flexibility • Batch tracking • Row count tracking • Change capture and tracking • Globally unique, persistent surrogate keys • Edit messages • Scheduling & logging

Standardised jobs - Staging

Standardised Jobs - Staging Having a standard layout and sequence of steps helps when trying to learn what each job does.

Standardised Jobs • All staging tables have a generated primary key comprising batch ID and row number • All staging tables have a reject flag indicating whether reject edits were raised • As rows are staged, edits are checked • Edit columns are nullable: if non-null, the edit was raised and the column contains the message

Non-standard Jobs

Batch Tracking • Call etl_start_batch and etl_stop_batch SP • Use the database for all ETL timestamps to get consistent clock • Stop batch optionally performs row count checks (not part of stored proc) • Logs to etl_r_batch table

Row Count Tracking • Row counters track input rows, output rows, rejects and edits – also logged to etl_r_batch • Allows reconciliation to check rows aren’t “lost” through unintended ETL actions • Lets me see how jobs are progressing

Row Numbering • Surrogate keys, batch IDs, row numbers are assigned by a custom Talend component

Change Capture • We want to know when a source system row changes between snapshots • Permits investigation of historical reports • Helps undo data problems in source • Provides triggers for re-evaluating edits • Efficient storage of changes is useful • but space is cheap, so it’s not a big constraint • Having three tables (staging, base, history) for each source table has impacts for naming

Change Capture • Use left join from staging to current (yesterday) to pick up insertions (IN) and new versions of updated rows (UI) • Union with left join from current to staging to pick up deletions (DE) and old versions of updated rows (UD) • Null-safe compare on all non-key columns • Retain previously allocated skey for updates • Allocate new skey for inserts only • Means skeys persist while row remains in source • Conserves global pool of skeys (although bigint has plenty!)

Change Capture Note that if you process changes while the query runs, you need to use snapshot isolation to avoid the second half picking up applied changes SET TRANSACTION isolation level snapshot; SELECT etl_batch_id, etl_skey, etl_chg_type, [etl_reject_flag], [map_name], [code], [cond_desc], [full_desc], [mapped_code] FROM (SELECT CASE WHEN dst.etl_batch_id IS NOT NULL THEN 'UI' ELSE 'IN' END AS etl_chg_type, dst.etl_skey, src.etl_batch_id, src.[etl_reject_flag], dst.[etl_reject_flag] AS [etl_cmp_etl_reject_flag], src.[map_name], dst.[map_name] AS [etl_cmp_map_name], src.[code], dst.[code] AS [etl_cmp_code], src.[cond_desc], dst.[cond_desc] AS [etl_cmp_cond_desc], src.[full_desc], dst.[full_desc] AS [etl_cmp_full_desc], src.[mapped_code], dst.[mapped_code] AS [etl_cmp_mapped_code] FROM [dbo].[ip_s_code_map] AS src LEFT JOIN [dbo].[ip_r_code_map] AS dst ON src.[map_name] = dst.[map_name] AND src.[code] = dst.[code] Using separate rows for UI & UD means you do not need to double the number of columns in your output or change nullability of columns. Left join staging to current to find new rows or the new version of existing rows.

Change Capture UNION ALL SELECT CASE WHEN dst.etl_batch_id IS NOT NULL THEN 'UD' ELSE 'DE' END AS etl_chg_type, src.etl_skey, src.etl_batch_id, src.[etl_reject_flag], dst.[etl_reject_flag] AS [etl_cmp_etl_reject_flag], src.[map_name], dst.[map_name] AS [etl_cmp_map_name], src.[code], dst.[code] AS [etl_cmp_code], src.[cond_desc], dst.[cond_desc] AS [etl_cmp_cond_desc], src.[full_desc], dst.[full_desc] AS [etl_cmp_full_desc], src.[mapped_code], dst.[mapped_code] AS [etl_cmp_mapped_code] FROM [dbo].[ip_r_code_map] AS src LEFT JOIN [dbo].[ip_s_code_map] AS dst ON src.[map_name] = dst.[map_name] AND src.[code] = dst.[code]) AS cmp WHERE CASE WHEN etl_chg_type NOT IN ( 'IN', 'DE' ) AND ( [etl_reject_flag] = [etl_cmp_etl_reject_flag] ) AND ( [cond_desc] = [etl_cmp_cond_desc] OR ([cond_desc] IS NULL AND [etl_cmp_cond_desc] IS NULL)) AND ( [full_desc] = [etl_cmp_full_desc] ) AND ( [mapped_code] = [etl_cmp_mapped_code] ) THEN 1 ELSE 0 END = 0 Left join current to staging to find deleted rows or the old version of existing rows. Use of CASE helps guard against accidental NULL-unsafe comparisons

Change Capture • First batch is all inserts • Subsequently rows may be updated or deleted, and new rows inserted • Outgoing data is also present • Useful to join UI & UD to see effect of update

An Aside: Naming Schemes • I don’t use tbl_, vw_, sp_, etc • Names tend to be right-trimmed, left-most characters are precious • Names tend to be prefix-sorted, left-most characters should group • You can fairly easily work it out without it • If I decide to replace a table with a view (it happens!) it looks silly to have tbl_ in a view name – corollary is that “what it is” is less relevant than “how do you use it” – eg does it return a row set on SELECT? • I start the name with a functional group • E.g. ip_, op_, rt_, pl_, ph_, wl_, ad_ • I use a letter to indicate the workflow role • c = hand entered config, s = staging (truncated daily), r = repository (current records), h = history (change tracking records), f = fact, d = dimension, a = aggregation, m = materialized view (refreshed daily) • I try to name columns consistently across systems

Standardised jobs - Merging

Surrogate Keys • A globally unique key for all rows • Allows me to have a single edits table • Allows easy merging of different data sources • Makes it easier to identify a row in queries (vs. compound or non-integer keys) • Issued monotonically – suitable for clustered indexes • Not used in staging • Allocated after change capture has identified the new rows that need a new skey • Can remain attached to the row while cardinality doesn’t change (e.g. into facts but not aggregations)

Edit Messages • Writing data quality messages means you can massage bad data to load it, without losing track of what you originally received • Capturing your understanding of the data in DQ messages helps to provide living documentation of key business concepts • Using change capture, edits can be regenerated only when the source data changes

Other Logging • Scheduler logs job output and error streams to database • Jobs can log details (e.g. file name processed, date range, etc) to journal • Jobs can create log messages • but mostly don’t, relying on stdout/stderr

Web Job Monitor • Web Job Monitor is underdeveloped • Should be integrated to scheduler and allow more job control • Provides a web-based listing of job outcomes and logs

Web Job Monitor • Status, run times, rows processed, etc • Links to detail screen for each job (next slide)

Web Job Monitor • Edits can be acknowledged to hide them from data quality reports • Acknowledgement persists until row changes in source (as detected by change capture)

Scheduler • Uses Oddjob • http://www.rgordon.co.uk/oddjob/ • Talend does not easily non-trivial dependencies for job sequences • E.g. Six independent reference tables can run in parallel, but then the transaction table load must wait for all to succeed • Talend also makes you pay for scheduler • SQL Server Agent requires SQL licence • Task Scheduler is not sophisticated enough

Job Stream Structure This node is a link to the underlying configuration file DI_All.xml and needs to be running for the daily schedule to be active. This sequence is run (and keeps running) to load the configuration and the schedule. This job represents the daily schedule: when it is a blue dot it is “running” and will trigger. Batch job corresponding to the master job. Sequential job containing the phases. Sequential job representing the first phase. Sequential job within a phase representing the logical step (stage + merge bookings). Talend jobs that perform the actual stage and merge processing.

Future • I have broken every rule I have created • It would be nice to reconsider some of those • The change capture query builder could be turned into a component • Hard part is getting the schema to load into the Talend designer • Edit messages should be checked and reloaded if the edit message changes • Web Job Monitor needs love

ETL Framework

ETL Framework

Presentation Transcript

ETL

Defensive ETL

ETL ity

ETL

ETL European ETL Spain 2002 S.L. - ETL Hungary Kft.

DWH-ETL

ETL ity

ETL

ETL Listing

Semantic Analysis for an Advanced ETL framework

More ETL

ETL tube

ETL

Talend ETL

etl tools

DWH-ETL

A Do-It-Yourself ETL Framework in Python