200 likes | 204 Vues
Patterns and Best Practices in SSIS. or, how to keep your DBA happy with your crazy-ass ETL. It’s all about me. I’m a SpeakingMentor !. Obligatory LOTR reference…. Mantra. C# is *not* the only way to initialise a variable C# is *not* the only way to move files
E N D
Patterns and Best Practices in SSIS or, how to keep your DBA happy with your crazy-ass ETL
It’s all about me... I’m a SpeakingMentor !
Mantra • C# is *not* the only way to initialise a variable • C# is *not* the only way to move files • C# is *not* the only way to call a web service • C# scripts are opaque to the SSIS runtime • C#... <sigh>
Not a pipeline • Data does not get passed to components (cough) • Components manipulate blocks of data (true)
What you *think* happens…. Get data Replace nulls Conditional Split Merge Sort
What actually happens… Get data Conditional Split Replace nulls Sort Merge
It’s all about speed… • There are 2 transformation types: • Synchronous – fastest (streaming and row-based) • Asynchronous – slower • And three ‘modes’: • Non-blocking • Semi-blocking • Full blocking
Non-blocking synchronous streaming transforms • Audit • Cache Transform • Character Map • Conditional Split • Copy Column • Data Conversion • Derived Column • Multicast • Percent Sampling • Row Count • Lookup
Non-blocking synchronous row transforms • DQS cleansing • Export Column • Import Column • OLEDB Command • Script Task • SCD • Lookup
Wait, there’s two ‘Lookup’s ? • Lookups are non-blocking streaming transforms when the ‘Full Cache’ option is used for the lookup data • Using ‘Partial Cache’ or ‘No Cache’ options in the lookup make the Lookup a row-based transform, which is necessarily slower
Semi-blocking asynchronous transforms • Data Mining • Merge • Merge Join • Pivot • Unpivot • Union All • Term Lookup
Full blocking asynchronous transforms • Aggregate • Fuzzy Grouping • Fuzzy Lookup • Row Sampling • Sort • Term Extraction • Script Task
Wait, there’s two ‘Script Task’s ?? • Script tasks are non-blocking when they’re using an outside resource (i.e. not the data that you’re working on) • They become blocking when they collect a dataset before sending it on to a destination • Set the ‘SynchronousInputID’ property on the output columns to ‘None’
Large Data Sets • BufferTempStoragePath • BlobTempStoragePath • These can either be set in your package template, or injected into the .dtsx
Really ? You can just inject stuff ? • Yes • Find and Replace default entries with your custom requirement… any decent text editor will do. BLOBTempStoragePath="F:\astDisk\WithLoadsOfSpace_temp" bufferTempStoragePath="F:\astDisk\WithLoadsOfSpace_temp”
Things that make you go hmmm • Logging • Package Checkpoints • Location, location, location • Expressions • Parallel Operations • ‘IsSorted=True’ property • Auditing • Bad Data
Time to dissect Here’s one I prepared earlier…
Take-aways – You should know.. • Your data • Your environment • Transform types • Ingest / Egress of data • What you are trying to achieve • When to give up and use C#