1 / 27

Data Warehouse

Data Warehouse. Chapter 11. Multiple Files Problem. Added complexity of multiple source files Start simple. Logic to detect Correct source. Multiple Source files. Extracted data. Transforming Data from Multiple files. File. File. File. File. File. File. File. File. File.

kairos
Télécharger la présentation

Data Warehouse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehouse Chapter 11

  2. Multiple Files Problem • Added complexity of multiple source files • Start simple Logic to detect Correct source Multiple Source files Extracted data

  3. Transforming Data from Multiple files File File File File File File File File File

  4. Missing Values Problem Solution • Ignore • Wait • Mark rows • Extract when time-stamped If NULL then Field=‘A’

  5. Duplicate Value Problem Solution • SQL self-join techniques • RDMBS constrains utilities SELECT… FROM table_a, table_b WHERE table_a.key(+)=table_b.key UNION SELECT… FROM table_a, table_b WHERE table_a.key=table_b.key(+) ACME Inc ACME Inc ACME Inc ACME Inc ACME Inc

  6. Element Names Problem Solution • CTAS • SQL*Loader Customer Customer Client Contact Name

  7. Element Meaning Problem • Avoid misinterpretation • Complex solution • Document meaning in metadata All customer details All details Except name Customer’s name

  8. Input Format Problem EBCDIC ASCII 12373 “123-73”

  9. Referential Integrity Problem Solution • SQL anti-join • Server constraints • Dedicated tools • Emp Name Department • Smith 10 • Jones 20 • Doe 50 • 6787 Harris 60 Department 10 20 30 40

  10. Name and Address Problem • No unique key • Missing values • Personal and commercial names mixed • Different addresses for same member • Different names and spelling for same number • Many names on one line • One name on two lines

  11. Name and Address Problem • Single-field format • Multiple-field format Mr.J.Smith, 100 Main St., Bigtown, County Luth, 23565

  12. Clean and Organize • Create atomic values. • Standardize formats. • Verify data accuracy. • Match with other records. • Identify private and commercial addresses and inhabitants. • Document in metadata. Requires sophisticated tools and techniques

  13. Merging Data • Operational transactions do not usually map one-to-one with warehouse data • Data for the warehouse is merged to provide information for analysis Pizza sales/return by day, hour, seconds Sale 1/2/98 12:00:01 Ham Pizza $10.00 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 Return 1/2/98 12:00:03 Ham Pizza -$12.00 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

  14. Merging Data Sale 1/2/98 12:00:01 Ham Pizza $10.00 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 Return 1/2/98 12:00:03 Ham Pizza -$12.00 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Sale 1/2/98 12:00:01 Ham Pizza $10.00 Sale 1/2/98 12:00:02 Cheese Pizza $10.00 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

  15. Adding a Date Stamp • Enables time analysis • Label loaded data with a date stamp • Add time to fact and dimension data

  16. Adding a Date Stamp Product Table Product_id Time_key Product_desc Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Item_Table Item_id Dept_id Time_key

  17. Adding a Date Stamp • Fact table - Add triggers - Recode applications - Compare tables • Dimension table • Time representation - Point in time - Time span

  18. Adding Keys to Data #1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Sale 1/2/98 12:00:03 Ham Pizza -$12.00 #5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $10.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

  19. Summarizing Data • During extraction on staging area • After loading onto the warehouse server Warehouse database Operational databases Staging area

  20. Maintaining Transformation Metadata Contains transformation rules, algorithms, and routines Sources Stages Rules Publish Extract Transform Load Query

  21. Transformation Timing and Location • Transformation is performed: - Before load - In parallel • May be initiated at different points Unlikely Probable Possible

  22. Choosing a Transformation Point * Workload * Network bandwidth * Environment * Parallel execution * CPU use * Load window time * Disk space * User information needs

  23. Monitoring and Tracking Transformations should: • Be self-documenting • Provides summary statistics • Handle process exceptions

  24. Designing Transformation Processes • Analysis: - Sources and target mappings, business rules - Key users, metadata, grain • Design options: PL/SQL, replication, custom, third-party tools • Design issues: - Performance - Size of the staging area - Exception handling, integrity maintenance

  25. Transformation Tools • Purchased • SQL*Loader • In-house developed

  26. Data Management, Quality, and Auditing Tools • Data management: - Innovative Systems - Postalsoft - Vality Technology • Data quality and auditing: - Innovative Systems - Vality Technology

  27. Summary This lesson discussed the following topics: • Importance of data quality • Transformation processes • Data transformation issuess • Data anomalies • Name and address management • Tools

More Related