1 / 46

Data Warehouse Design

Data Warehouse Design. Enrico Franconi CS 636. Implementing a Warehouse. Monitoring : Sending data from sources Integrating : Loading, cleansing,... Processing : Query processing, indexing, ... Managing : Metadata, Design,. Monitoring.

said
Télécharger la présentation

Data Warehouse Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehouse Design Enrico Franconi CS 636

  2. Implementing a Warehouse • Monitoring: Sending data from sources • Integrating: Loading, cleansing,... • Processing: Query processing, indexing, ... • Managing: Metadata, Design, ... CS 336

  3. Monitoring • Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … • How to get data out? • Replication tool • Dump file • Create report • ODBC or third-party “wrappers” CS 336

  4. Monitoring Techniques • Periodic snapshots • Database triggers • Log shipping • Data shipping (replication service) • Transaction shipping • Polling (queries to source) • Screen scraping • Application level monitoring CS 336

  5. Monitoring Issues • Frequency • periodic: daily, weekly, … • triggered: on “big” change, lots of changes, ... • Data transformation • convert data to uniform format • remove & add fields (e.g., add date to get history) • Standards (e.g., ODBC) • Gateways CS 336

  6. Wrapper Data Model B Wrapper Source Converts data and queries from one data model to another Queries Data Model A Data Extends query capabilities for sources with limited capabilities Queries CS 336

  7. Wrapper Generation • Solution 1: Hard code for each source • Solution 2: Automatic wrapper generation Wrapper Generator Definition Wrapper CS 336

  8. Integration Client Client Query & Analysis Metadata Warehouse Integration Source Source Source • Data Cleaning • Data Loading • Derived Data CS 336

  9. Data Integration • Receive data (changes) from multiple wrappers/monitors and integrate into warehouse • Rule-based • Actions • Resolve inconsistencies • Eliminate duplicates • Integrate into warehouse (may not be empty) • Summarize data • Fetch more data from sources (wh updates) • etc. CS 336

  10. Data Cleaning • Find (& remove) duplicate tuples • e.g., Jane Doe vs. Jane Q. Doe • Detect inconsistent, wrong data • Attribute values that don’t match • Patch missing, unreadable data • Insert default values • Notify sources of errors found CS 336

  11. Data Cleaning billing DB customer1(Joe) merged_customer(Joe) service DB customer2(Joe) • Migration (e.g., yen to dollars) • Scrubbing: use domain-specific knowledge (e.g., social security numbers) • Fusion (e.g., mail list, customer merging) CS 336

  12. Loading Data in the Warehouse • Incremental vs. refresh • Off-line vs. on-line • Frequency of loading • At night, 1x a week/month, continuously • Parallel/Partitioned load CS 336

  13. Warehouse Maintenance • Warehouse data  materialized view • Initial loading • View maintenance • Derived Warehouse Data • indexes • aggregates • materialized views • View maintenance CS 336

  14. Materialized Views does not exist at any source • Define new warehouse relations using SQL expressions CS 336

  15. Differs from Conventional View Maintenance... • Warehouses may be highly aggregated and summarized • Warehouse views may be over history of base data • Process large batch updates • Schema may evolve CS 336

  16. Differs from Conventional View Maintenance... • Base data doesn’t participate in view maintenance • Simply reports changes • Loosely coupled • Absence of locking, global transactions • May not be queriable CS 336

  17. Warehouse Maintenance Anomalies • Materialized view maintenance in loosely coupled, non-transactional environment • Simple example Data Warehouse Sold (item,clerk,age) Sold = Sale Emp Integrator Sales Comp. Sale(item,clerk) Emp(clerk,age) CS 336

  18. Warehouse Maintenance Anomalies Data Warehouse Sold (item,clerk,age) Integrator Sales Comp. Sale(item,clerk) Emp(clerk,age) 1. Insert into Emp(Mary,25), notify integrator 2. Insert into Sale (Computer,Mary), notify integrator 3. (1)  integrator adds Sale (Mary,25) 4. (2)  integrator adds (Computer,Mary) Emp 5. View incorrect (duplicate tuple) CS 336

  19. Maintenance Anomaly - Solutions • Incremental update algorithms (ECA, Strobe, etc.) • Research issues: Self-maintainable views • What views are self-maintainable • Store auxiliary views so original + auxiliary views are self-maintainable CS 336

  20. Self-Maintainability: Examples Sold(item,clerk,age) = Sale(item,clerk) Emp(clerk,age) • Inserts into Emp If Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect • Inserts into Sale Maintain auxiliary view: Emp-clerk,age(Sold) • Deletes from Emp Delete from Sold based on clerk CS 336

  21. Self-Maintainability: Examples • Deletes from Sale Delete from Sold based on {item,clerk} Unless age at time of sale is relevant • Auxiliary views for self-maintainability • Must themselves be self-maintainable • One solution: all source data • But want minimal set CS 336

  22. Partial Self-Maintainability • Avoid (but don’t prohibit) going to sources Sold=Sale(item,clerk) Emp(clerk,age) • Inserts into Sale • Check if clerk already in Sold, go to source if not • Or replicate all clerks over age 30 • Or ... CS 336

  23. Warehouse Specification (ideally) View Definitions Warehouse Configuration Module Warehouse Integration rules Change Detection Requirements Integrator Metadata Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor ... CS 336

  24. Processing Client Client Query & Analysis Metadata Warehouse Integration Source Source Source • ROLAP servers vs. MOLAP servers • Index Structures • What to Materialize? • Algorithms CS 336

  25. ROLAP Server ROLAP server utilities relational DBMS • Relational OLAP Server tools Special indices, tuning; Schema is “denormalized” CS 336

  26. MOLAP Server Sales City B A milk soda eggs soap Product 1 2 3 4 Date utilities • Multi-Dimensional OLAP Server M.D. tools multi-dimensional server could also sit on relational DBMS CS 336

  27. Index Structures (sketch) • Traditional Access Methods • B-trees, hash tables, R-trees, grids, … • Popular in Warehouses • inverted lists • bit map indexes • join indexes • text indexes CS 336

  28. What to Materialize? • Store in warehouse results useful for common queries • Example: total sales day 2 . . . day 1 129 materialize CS 336

  29. Materialization Factors • Type/frequency of queries • Query response time • Storage cost • Update cost CS 336

  30. Cube Aggregates Lattice day 2 day 1 129 all city product date city, product city, date product, date use greedy algorithm to decide what to materialize city, product, date CS 336

  31. Dimension Hierarchies all state city CS 336

  32. Dimension Hierarchies all product city date product, date city, product city, date state city, product, date state, date state, product state, product, date not all arcs shown... CS 336

  33. Interesting Hierarchy all years weeks quarters conceptual dimension table months days CS 336

  34. Managing Client Client Query & Analysis Metadata Warehouse Integration Source Source Source • Metadata • Warehouse Design • Tools CS 336

  35. Metadata • Administrative • definition of sources, tools, ... • schemas, dimension hierarchies, … • rules for extraction, cleaning, … • refresh, purging policies • user profiles, access control, ... CS 336

  36. Metadata • Business • business terms & definition • data ownership, charging • Operational • data lineage • data currency (e.g., active, archived, purged) • use stats, error reports, audit trails CS 336

  37. Design Summary • What data is needed? • Where does it come from? • How to clean data? • How to represent in warehouse (schema)? • What to summarize? • What to materialize? • What to index? CS 336

  38. Tools • Development • design & edit: schemas, views, scripts, rules, queries, reports • Planning & Analysis • what-if scenarios (schema changes, refresh rates), capacity planning • Warehouse Management • performance monitoring, usage patterns, exception reporting • System & Network Management • measure traffic (sources, warehouse, clients) • Workflow Management • “reliable scripts” for cleaning & analyzing data CS 336

  39. Current State of Industry • Extraction and integration done off-line • Usually in large, time-consuming, batches • Everything copied at warehouse • Not selective about what is stored • Query benefit vs storage & update cost • Query optimization aimed at OLTP • High throughput instead of fast response • Process whole query before displaying anything CS 336

  40. State of Commercial Practice ... • Data extract, clean, transform, refresh • CA-Ingres Replicator • ETI-Extract • IBM Data Joiner, Data Propagator • Prism Warehouse manager • SAS Access • Sybase Replication Server • Trinzic InfoPump • Connectivity to sources • Apertus • Information Builders • Informix Enterprise Gateway • Oracle Open Connect • CA-Ingres gateway • MS ODBC • Platinum InfoHub CS 336

  41. … State of Commercial Practice ... • ROLAP Servers • HP Intelligent Warehouse • Informix Metacube • MicroStrategy DSS Server • Information Advantage Asxys • Multidimensional Database Engines • Arbor Essbase • Oracle RIR Express • Comshare Commader • SAS System • Warehouse Data Servers • CA-Ingres • Oracle 8 • RedBrick • Sybase IQ • Informix Dynamic Server • IBM DB2 CS 336

  42. … State of Commercial Practice • Multidimensional Analysis • Kenan Systems Acumate • Microsoft Excel • Arbor Essbase Analysis server • Cognos PowerPlay • IQ Software IQ/Vision • Lotus 123 • SAS OLAP++ • Business Objects • Query/Reporting Environments • IBM DataGuide • SAS Access CA Visual Express Platinum Forest&Trees • Informix ViewPoint • Lots and lots of consulting!! CS 336

  43. Future Directions • Better performance • Larger warehouses • Easier to use • What are companies & research labs working on? CS 336

  44. Research (1) • Incremental Maintenance • Data Consistency • Data Expiration • Recovery • Data Quality • Error Handling (Back Flush) CS 336

  45. Research (2) • Rapid Monitor Construction • Temporal Warehouses • Materialization & Index Selection • Data Fusion • Data Mining • Integration of Text & Relational Data • Conceptual Modelling CS 336

  46. Conclusions • Massive amounts of data and complexity of queries will push limits of current warehouses • Need better systems: • easier to use • provide quality information CS 336

More Related