Design and Maintenance of Data Warehouses - PowerPoint PPT Presentation

design and maintenance of data warehouses n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Design and Maintenance of Data Warehouses PowerPoint Presentation
Download Presentation
Design and Maintenance of Data Warehouses

play fullscreen
1 / 120
Design and Maintenance of Data Warehouses
98 Views
Download Presentation
yukio
Download Presentation

Design and Maintenance of Data Warehouses

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Design and Maintenance of Data Warehouses Timos Sellis National Technical University of Athens KDBS Laboratoryhttp://www.dbnet.ece.ntua.gr/ Many thanks to P. Vassiliadis and A. Tsois

  2. Outline • What’s and Why’s for DW’s • DW architecture • DW Schema • Back End of the DW • Front End of the DW • DW Servers • Metadata Repository • Conclusions

  3. OLTP • On-line transaction processing (OLTP) is the traditional way of using a database • Legacy systems: relational, hierarchical, network databases / COBOL applications / … • Short transactions (read/update few records) with ACID properties • Normally, only the last version of data stored in the database

  4. DSS & OLAP • Decision support systems - help the executive, manager, analyst make faster and better decisions. • What where the sales volumes by region and product category for the last year? • Will a 10% discount increase sales volumes sufficiently? • On-line analytical processing (OLAP) is an element of decision support systems (DSS)

  5. OLTP vs. OLAP Chaudhuri & Dayal @VLDB’96

  6. Data Warehouse • A decision support database that is maintained separately from the organization’s operational database. • S. Chaudhuri, U. Dayal, VLDB’96 tutorial • A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making. • W.H. Inmon, Building the Data Warehouse, 1992

  7. Reasons for Building Data Warehouses • Semantic Reconciliation • Dispread data sources within the same organization • Different encoding of the same entities • DW encompasses the full volume of these data under a single, reconciled schema • Keeps the history of these data, too

  8. Reasons for Building Data Warehouses • Performance • OLAP applications need different organization of data • Complex OLAP queries would degrade OLTP performance • Availability • Separation increases availability • Possibly the only way to query the dispread data sources

  9. Reasons for Building Data Warehouses • Data Quality • The validity of source data is not guaranteed (data can be missing, inconsistent, out of date, violating business and database rules…) • Errors in data reach a minimum 10% in most data stores • Can lead to wasting of resources of 25-40% • DW acts as a data cleaning buffer …. and the market is there!

  10. The Market Estimated sales in millions of dollars [ShTy98] (*estimates are from [Pend00]).

  11. Client Client Query & Analysis Metadata Warehouse Integration Source Source Source Data Warehouse Architecture A Simple View

  12. Quality Issues Quality Issues Data Warehouse Architecture Reporting / OLAP tools Metadata Repository Data Marts DW DSA Sources End User Administrator Administrator Designer

  13. Two / Three Tier Architecture • Warehouse database server • almost always relational (RDBMS) • Data Marts / OLAP server • Relational OLAP (ROLAP) • Multidimensional OLAP (MOLAP) • Clients • Query and reporting tools • Analysis tools / Data mining tools

  14. Data Warehouse Architecture • Enterprise warehouse: collects all information about subjects • requires extensive business modeling • may take years to design and build • Data Marts: Departmental subsets that focus on selected subjects • Virtual warehouse: views over operational dbs

  15. How to build the DW • Top – down • Single integrated enterprise model • Reduce all sources (and clients, if necessary) to the central model • Time consuming; labor intensive; slow to produce results • Enhances the risk of the DW project due to late delivery of results • Provides a consistent, global view of the enterprise data

  16. How to build the DW • Bottom – up • Build smaller data marts first • Progressively combine pairwise • Fails to provide a global view of the enterprise data • Possibly enhances the risk since a complete integration might prove impossible late in the project • Early delivery of results • Less time consuming, less labor intensive

  17. Quality Issues Quality Issues Data Warehouse Back-End Reporting / OLAP tools Metadata Repository Data Marts DW DSA Sources End User Administrator Administrator Designer

  18. Design: Global-As-View Integration • Preintegration. What schemata to integrate and in which order • Schema Comparison. To determine the correlations among concepts of different schemata and to detect possible naming, semantic, structural, … conflicts • Schema Conforming. Conflict resolution for heterogeneous schemata • Schema Merging and Restructuring. Production of a single conformed schema

  19. Design: Local-As-View Integration • Works the other way around. • Main deliverable is a central conceptual model, produced by interactively examining user needs and existing schemata • All source and client schemata are expressed in terms of the central data warehouse schema and not the other way around.

  20. PKEY, DAY MIN(COST) DW.PARTSUPP S1_PARTSUPP V1 Aggregate1 U PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY TIME S2_PARTSUPP V2  Aggregate2 Sources DW DW = Materialized Views? Simple View of a DW

  21. DWMaterialized Views ! DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW1 DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY SUPPKEY=1 COST DATE DS.PS1 DIFF1 A2EDate SK1 $2€ Add_SPK1 DS.PS_OLD1 U rejected rejected rejected Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW2 DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY SUPPKEY=2 COST DATE=SYSDATE QTY>0 DS.PS2 NotNULL AddDate Add_SPK2 SK2 CheckQTY DIFF2 DS.PS_OLD2 rejected rejected Log Log DSA PKEY, DAY MIN(COST) DW.PARTSUPP S1_PARTSUPP V1 Aggregate1 FTP1 PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY TIME S2_PARTSUPP V2  Aggregate2 FTP2 Sources DW

  22. Operational Processes • Data extraction, transform & load • Originally treated as the ‘refreshment’ problem • Requires to transform, clean, integrate data from different sources. • Build/refresh derived data and views • Service queries • Monitor the warehouse

  23. The Refreshment Problem • Propagate updates on source data to the warehouse • Issues: • when to refresh • on every update • periodically • refresh policy set by administrator • how to refresh

  24. Refreshment Techniques • Full extract from base tables • Incremental techniques • detect changes on base tables • snapshots • transaction shipping • active rules • logical correctness • transactional correctness • Currently, in practice we use ETL tools/scripts (see next)…

  25. Data Extraction • Can take snapshot or differentials (new/deleted/updated) of source data • Transfer, encryption, compression are also involved • Time window and source system overhead involved • In general, faced with the requirement of minimal changes to existing configuration of sources

  26. Data Transformation • Schema Reconciliation: conflicts at the schema level (different attributes for the same information) • ValueIdentification&Reconciliation: different (same) id’s for same (different) objects (use surrogate keys)

  27. Data Cleaning • OffendingData: duplicates, integrity/business rules/format violations … • Incompleteness: missing data • Renicing: esp. addresses

  28. Data Loading • This final stage may still require additional preprocessing: • sorting, summarizing, performing computations • Issues: • huge volumes of data to be loaded • small time window • when to build indexes and summary tables • restart after failure with no loss of data integrity

  29. Loading Techniques • Cannot use SQL language interface to update or append data. • record at a time • too slow since it uses random disc I/O • can make rollback segment or log file to burst • Use batch load utility • sort input records on a clustering key • sequential I/O 100 times faster than random I/O • build index at the same time • use parallelism to accelerate load operations

  30. Incremental Loading • Use incremental loads during refresh to reduce data volume (e.g. Redbrick) • insert only updated tuples • incremental load conflicts with queries • break into sequence of shorter transactions • coordinate this sequence of transactions: must ensure consistency between base and derived tables and indices.

  31. Quality Issues Quality Issues Data Warehouse Front-End Reporting / OLAP tools Metadata Repository Data Marts DW DSA Sources End User Administrator Administrator Designer

  32. Front End Tools • Ad hoc query and reporting • Example: MS Excel, ProReports • OLAP: ‘Multidimensional spreadsheet’ • pivot tables, drill down, roll up, slice, dice • Data Mining

  33. Basic ideas for OLAP • Several numeric measures that are analyzed • sales, budget, revenue, inventory • Dimensions • contexts in which a measure appears • Example: store, product, date information associated with a sale. • each context is a dimension and the measure is a point in a multi-dimensional world

  34. Basic ideas for OLAP • Nature of Analysis • aggregation (total sales, percent-to-total) • comparison (budget vs. expense) • ranking (top 10) • access to detailed and aggregate data • complex criteria specification • visualization

  35. Basic ideas for OLAP • Attributes • information associated with a dimension • example: owner of store, county in which the store is located • Attribute Hierarchies • Attributes of a dimension are often related in a a hierarchical way • example: street í city ícountry

  36. Sales volume Region Product Month Multidimensional Data Dimensions: Product, Region, DateHierarchical summarization paths: Country Year Industry Category Region Quarter City Week Product Month Day Office

  37. Operations • Roll up: summarize data • Drill down: go from higher level summary to lower level summary or detailed data • Slice and dice: select and project • Pivot: re-orient cube

  38. Sales volume Products Store1 Store2 Sales volume $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $5,6$1,4$2,6$1,1 Products Store1 Store2 Q1 $14,1$2,65$6,9$2,6 ElectronicsToysClothingCosmetics $12,8$1,8$7,2$1,6 Year 1996 $8,9$0,75$4,6$1,5 ElectronicsToysClothingCosmetics $7,2$0,4$4,6$0,5 Q2 Roll up

  39. Sales volume Sales volume Electronics Store1 Store2 Products Store1 Store2 $1,4$0,6$2,0$1,2 VCRCamcorderTVCD player $1,4$0,6$2,4$1,2 $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $5,6$1,4$2,6$1,1 Q1 Q1 $2,4$3,3$2,2$1,0 VCRCamcorderTVCD player $2,4$1,3$2,5$1,0 $8,9$0,75$4,6$1,5 ElectronicsToysClothingCosmetics $7,2$0,4$4,6$0,5 Q2 Q2 Drill down

  40. Sales volume Sales volume Products Q1 Q2 Products Store1 Store2 $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $8,9$0,75$4,6$1,5 $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $5,6$1,4$2,6$1,1 Store 1 Q1 $5,6$1,4$2,6$1,1 ElectronicsToysClothingCosmetics $7,2$0,4$4,6$0,5 $8,9$0,75$4,6$1,5 ElectronicsToysClothingCosmetics $7,2$0,4$4,6$0,5 Store 2 Q2 Pivot

  41. Sales volume Sales volume Products Store1 Store2 Products Store1 $5,2$1,9$2,3$1,1 ElectronicsToysClothingCosmetics $5,6$1,4$2,6$1,1 $5,2$1,9 ElectronicsToys Q1 Q1 $8,9$0,75$4,6$1,5 ElectronicsToysClothingCosmetics $8,9$0,75 ElectronicsToys $7,2$0,4$4,6$0,5 Q2 Q2 Slice and Dice

  42. Quality Issues Quality Issues Data Warehouse Server Reporting / OLAP tools Metadata Repository Data Marts DW DSA Sources End User Administrator Administrator Designer

  43. Data Warehouse Servers - Outline • Server Technology: ROLAP & MOLAP • Indexing Techniques • Query Processing and Optimization

  44. Database Servers • Relational and Specialized Relational DBMS • Relational OLAP (ROLAP) DBMS • Multidimensional OLAP (MOLAP) DBMS

  45. Relational DBMS • Features that support DSS • Specialized Indexing techniques • Specialized Join and Scan Methods • Data Partitioning and use of Parallelism • Complex Query Processing • Intelligent Processing of Aggregates • Extensions to SQL and their processing

  46. ROLAP Servers • Exploits services of a relational engine effectively • Key functionality • needs aggregation navigation logic • ability to generate multi statement SQL • optimize for each individual database backend • Additional services • cost-based query governor • design tool for DSS schema • performance analysis tool

  47. Database Schemata for DW & ROLAP • Star Schema • Snowflake Schema • Fact Constellation • Aggregated data

  48. Star Schema • A star schema consists of one central fact table and several denormalized dimension tables. • The measures of interest for OLAP are stored in the fact table (e.g. Dollar Amount, Units in the table SALES). • For each dimension of the multidimensional model there exists a dimension table (e.g. Geography, Product, Time, Account) with all the levels of aggregation and the extra properties of these levels.

  49. Star Schema Stanford Technology Group, Inc., 1996 Time Geography Time Code Geography Code Quarter Code Region Code Quarter Name Region Manager SALES Month Code State Code Geography Code Month Name City Code Time Code Date ..... Account Code Product Code Dollar Amount Account Product Units Account Code Product Code KeyAccount Code Product Name KeyAccountName Brand Code Account Name Brand Name Account Type Prod. Line Code Account Market Prod. Line Name

  50. Snowflake Schema • The normalized version of the star schema • Explicit treatment of dimension hierarchies (each level has its own table) • Easier to maintain, slower in query answering