1 / 60

ISQS 6339, Data Management & Business Intelligence Extraction, Transformation, and Loading (II)

ISQS 6339, Data Management & Business Intelligence Extraction, Transformation, and Loading (II). Zhangxi Lin Texas Tech University. Agenda. I. Using SSIS for ETL Integration Services Learn by doing Package items Problem-oriented package development II. The Principle of ETL Extraction

Télécharger la présentation

ISQS 6339, Data Management & Business Intelligence Extraction, Transformation, and Loading (II)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISQS 6339, Data Management & Business IntelligenceExtraction, Transformation, and Loading (II) Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence

  2. Agenda I. Using SSIS for ETL • Integration Services • Learn by doing • Package items • Problem-oriented package development II. The Principle of ETL • Extraction • Transformation • Loading ISQS 6339, Data Management & Business Intelligence

  3. II. The Principle of ETL ISQS 6339, Data Management & Business Intelligence

  4. Structure and Components of Business Intelligence SSMS SSIS SSAS SSRS SAS EG SAS EM ISQS 6339, Data Management & Business Intelligence

  5. Automating your routine information processing tasks • Your routine information processing tasks • Read online news at 8:00a and collect a few most important pieces • Retrieve data from database to draft a short daily report at 10a • View and reply emails and take some notes that are saved in a database • View 10 companies’ webpage to see the updates. Input the summaries into a database • Browse three popular magazines twice a week. Input the summaries into a database • Generate a few one-way frequency and two-way frequency tables and put them on the web • Merge datasets collected by other people into a main database. • Prepare a weekly report using the database and at 4p every Monday, and publish it to the internal portal site. • Prepare a monthly report at 11a on the first day of a month, which must be converted into a pdf file and uploaded to the website. • Seems there are many things are on going. How to handle them properly in the right time? • Organizer – yes • How about regular data processing tasks? ISQS 6339, Data Management & Business Intelligence

  6. Information Processing and Information Flow • Transaction processing • Interactions between a user and a computer application system with immediate responses from the application • Operational processing • Make use of computer to control a process • Batch processing • Consisting of a series of executions, each of which is applied to a set of data and turns the result to the next one. • Analytical processing • The interaction between analysts and collections of aggregated data that may have been reformulated into alternative representational forms for improved analytical performance. ISQS 6339, Data Management & Business Intelligence

  7. Programs Gateways Tools Extraction, Transformation, Loading (ETL) Processes • Extract source data • Transform/clean data • Index and summarize • Load data into warehouse • Detect changes • Refresh data Operational systems ETL Data Warehouse ISQS 6339, Data Management & Business Intelligence

  8. ETL: Tasks, Importance, and Cost Extract Clean up Consolidate Restructure Load Maintain Refresh Data Warehouse Operationalsystems ETL Relevant Useful Quality Accurate Accessible ISQS 6339, Data Management & Business Intelligence

  9. Data mapping Transform Extracting Data • Source systems • Data from various data sources in various formats • Extraction Routines • Developed to select data fields from sources • Consist of business rules, audit trails, error correction facilities Warehouse database Operational databases Data staging area ISQS 6339, Data Management & Business Intelligence

  10. IMS DB2 Oracle Sybase Informix VSAM SAP Shared Medical Systems Dun and Bradstreet Financials Hogan Financials Oracle Financials Production Data • Operating system platforms • File systems • Database systems and vertical applications ISQS 6339, Data Management & Business Intelligence

  11. Archive Data • Historical data • Useful for analysis over long periods of time • Useful for first-time load • May require unique transformations Operation databases Warehouse database ISQS 6339, Data Management & Business Intelligence

  12. Planning Marketing Accounting Internal Data • Planning, sales, and marketing organization data • Maintained in the form of: • Spreadsheets (structured) • Documents (unstructured) • Treated like any other source data Warehouse database ISQS 6339, Data Management & Business Intelligence

  13. A.C. Nielsen, IRI, IMS,Walsh America Purchased databases Competitive information Dun and Bradstreet Economic forecasts Wall Street Journal Barron's Warehousing databases External Data • Information from outside the organization • Issues of frequency, format, and predictability • Described and tracked using metadata ISQS 6339, Data Management & Business Intelligence

  14. Possible ETL Failures • A missing source file • A system failure • Inadequate metadata • Poor mapping information • Inadequate storage planning • A source structural change • No contingency plan • Inadequate data validation ISQS 6339, Data Management & Business Intelligence

  15. Maintaining ETL Quality • ETL must be: • Tested • Documented • Monitored and reviewed • Disparate metadata must be coordinated. ISQS 6339, Data Management & Business Intelligence

  16. Transformation • Transformation eliminates anomalies from operational data: • Cleans and standardizes • Presents subject-oriented data • Transform: • Clean up • Consolidate • Restructure Extract Warehouse Operationalsystems Load Data Staging Area ISQS 6339, Data Management & Business Intelligence

  17. Operationalsystem Operationalsystem Warehouse Warehouse Remote Staging Model Data staging area within the warehouse environment Transform Extract Load Staging area Data staging area in its own environment Transform Extract Load Staging area ISQS 6339, Data Management & Business Intelligence

  18. On-site Staging Model • Data staging area within the operational environment,possibly affecting the operational system Transform Extract Load Operational system Staging area Warehouse ISQS 6339, Data Management & Business Intelligence

  19. Data Anomalies • No unique key • Data naming and coding anomalies • Data meaning anomalies between groups • Spelling and text inconsistencies ISQS 6339, Data Management & Business Intelligence

  20. Transformation Routines • Cleaning data • Eliminating inconsistencies • Adding elements • Merging data • Integrating data • Transforming data before load ISQS 6339, Data Management & Business Intelligence

  21. Transforming Data: Problems and Solutions • Multipart keys • Multiple local standards • Multiple files • Missing values • Duplicate values • Element names • Element meanings • Input formats • Referential Integrity constraints • Name and address ISQS 6339, Data Management & Business Intelligence

  22. Product code = 12M654313 45 Salesperson code Country code Sales territory Productnumber Multipart Keys Problem • Multipart keys ISQS 6339, Data Management & Business Intelligence

  23. Multiple Local Standards Problem • Multiple local standards • Tools or filters to preprocess cm DD/MM/YY 1,000 GBP inches MM/DD/YY FF 9,990 cm DD-Mon-YY USD 600 ISQS 6339, Data Management & Business Intelligence

  24. Multiple Files Problem • Added complexity of multiple source files • Start simple Multiple source files Logic to detectcorrect source Transformed data ISQS 6339, Data Management & Business Intelligence

  25. Missing Values Problem • Solution: • Ignore • Wait • Mark rows • Extract when time-stamped If NULL thenfield = ‘A’ ISQS 6339, Data Management & Business Intelligence

  26. Duplicate Values Problem • Solution: • SQL self-join techniques • RDMBS constraint utilities ACME Inc ACME Inc ACME Inc SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+); ISQS 6339, Data Management & Business Intelligence

  27. Element Names Problem • Solution: Common naming conventions Customer Client Customer Contact Name ISQS 6339, Data Management & Business Intelligence

  28. Element Meaning Problem • Avoid misinterpretation • Complex solution • Document meaning in metadata Customer’s name All customer details All details except name Customer_detail ISQS 6339, Data Management & Business Intelligence

  29. Input Format Problem EBCDIC ASCII “123-73” 12373 ACME Co. áøåëéí äáàéí Beer (Pack of 8) ISQS 6339, Data Management & Business Intelligence

  30. Referential Integrity Problem • Solution: • SQL anti-join • Server constraints • Dedicated tools ISQS 6339, Data Management & Business Intelligence

  31. Name and Address Problem • Single-field format • Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 ISQS 6339, Data Management & Business Intelligence

  32. Quality Data: Importance and Benefits • Quality data: • Key to a successful warehouse implementation • Quality data helps you in: • Targeting right customers • Determining buying patterns • Identifying householders: private and commercial • Matching customers • Identify historical data ISQS 6339, Data Management & Business Intelligence

  33. Data Quality Guidelines • Operational data: • Should not be used directly in the warehouse • Must be cleaned for each increment • Is not simply fixed by modifying applications ISQS 6339, Data Management & Business Intelligence

  34. Transformation Techniques • Merging data • Adding a Date Stamp • Adding Keys to Data ISQS 6339, Data Management & Business Intelligence

  35. Merging Data • Operational transactions do not usually map one-to-one with warehouse data. • Data for the warehouse is merged to provide information for analysis. ISQS 6339, Data Management & Business Intelligence

  36. Merging Data ISQS 6339, Data Management & Business Intelligence

  37. Adding a Date Stamp • Time element can be represented as a: • Single point in time • Time span • Add time element to: • Fact tables • Dimension data ISQS 6339, Data Management & Business Intelligence

  38. Product Table Product_id Time_key Product_desc Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Item Table Item_id Dept_id Time_key Adding a Date Stamp:Fact Tables and Dimensions ISQS 6339, Data Management & Business Intelligence

  39. #1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 #5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Adding Keys to Data ISQS 6339, Data Management & Business Intelligence

  40. Summarizing Data 1. During extraction on staging area 2. After loading to the warehouse server Operational databases Staging area Warehouse database ISQS 6339, Data Management & Business Intelligence

  41. Sources Stage Rules Publish Extract Transform Load Query Maintaining Transformation Metadata • Transformation metadata contains: • Transformation rules • Algorithms and routines ISQS 6339, Data Management & Business Intelligence

  42. Maintaining Transformation Metadata • Restructure keys • Identify and resolve coding differences • Validate data from multiple sources • Handle exception rules • Identify and resolve format differences • Fix referential integrity inconsistencies • Identify summary data ISQS 6339, Data Management & Business Intelligence

  43. Transformation Timing and Location • Transformation is performed: • Before load • In parallel • Can be initiated at different points: • On the operational platform • In a separate staging area ISQS 6339, Data Management & Business Intelligence

  44. Monitoring and Tracking • Transformations should: • Be self-documenting • Provide summary statistics • Handle process exceptions ISQS 6339, Data Management & Business Intelligence

  45. Loading Data into the Warehouse • Loading moves the data into the warehouse • Loading can be time-consuming: • Consider the load window • Schedule and automate the loading • Initial load moves large volumes of data • Subsequent refresh moves smaller volumes of data Transform Extract Transport,Load Operational databases Staging area Warehouse database ISQS 6339, Data Management & Business Intelligence

  46. Initial Load and Refresh • Initial Load: • Single event that populates the database with historical data • Involves large volumes of data • Employs distinct ETL tasks • Involves large amounts of processing after load • Refresh: • Performed according to a business cycle • Less data to load than first-time load • Less-complex ETL tasks • Smaller amounts of post-load processing ISQS 6339, Data Management & Business Intelligence

  47. Operational databases T1 T2 T3 Data Refresh Models: Extract Processing Environment • After each time interval, build a new snapshot of the database. • Purge old snap shots. ISQS 6339, Data Management & Business Intelligence

  48. Operational databases T1 T2 T3 Data Refresh Models: Warehouse Processing Environment • Build a new database. • After each time interval, add changes to database. • Archive or purge oldest data. ISQS 6339, Data Management & Business Intelligence

  49. Building the Loading Process • Techniques and tools • File transfer methods • The load window • Time window for other tasks • First-time and refresh volumes • Frequency of the refresh cycle • Connectivity bandwidth ISQS 6339, Data Management & Business Intelligence

  50. Building the Loading Process • Test the proposed technique • Document proposed load • Monitor, review, and revise ISQS 6339, Data Management & Business Intelligence

More Related