1 / 79

An Introduction to Data Warehousing

An Introduction to Data Warehousing. Anand Deshpande Persistent Systems Pvt. Ltd. http://www.persistent.co.in. Optimizing the Warehouse for Decision Support. Data -- Heart of the Data Warehouse. Heart of the data warehouse is the data itself! Single version of the truth Corporate memory

sirius
Télécharger la présentation

An Introduction to Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Data Warehousing Anand Deshpande Persistent Systems Pvt. Ltd. http://www.persistent.co.in

  2. Optimizing the Warehouse for Decision Support

  3. Data -- Heart of the Data Warehouse • Heart of the data warehouse is the data itself! • Single version of the truth • Corporate memory • Data is organized in a way that represents business -- subject orientation

  4. Consider a Retail Sales Example • A retail chain sells products in retail stores. • We want to track the effect of promotions on retail sales. • Product Table (60,000 SKUs) • Store Table (500 stores) • Promotion (3000 promotions per week) • Sales (2 Billion)

  5. What is the total dollar sales and the total dollar costs of all candy sold in supermarket stores on Saturdays? SELECT p.category, sum(f.dollar_sales), sum(f.dollar.cost)FROM sales_fact f, product p, time t, store s WHERE f.product_key = p.product_key and f.time_key = t.time_keyand f.store_key = s.store_key and p.category = ‘Candy’ and t.day_of_week = ‘Saturday’and s.floor_plan_type = ‘Super_Market”GROUP BY p.category

  6. Time Dimension time_keySQL_dateday_of_weekmonthfiscal_periodseason Product Dimension product_keySKUdescriptionbrandcategorydepartmentpackage_typesize… etc. Sales FactTable time_keyproduct_keystore_keypromotion_keydollars_soldunits_solddollars_cost Store Dimension store_keystore_IDstore_nameaddressregiondivisionfloor_plan_type Promotion Dimension promotion_keypromotion_nameprice_typead_typedisplay_type… etc.

  7. SELECT NON-AGGREGATE FIELDNAME1, NON-AGGREGATE FIELDNAME2, SUM(AGGREGATE FIELDNAME3) SUM(AGGREGATE FIELDNAME4) SUM(AGGREGATE FIELDNAME5) FROM DIMENSION TABLE1 DIMENSION TABLE2 DIMENSION TABLE3 DIMENSION TABLE4 FACT TABLE WHERE JOINCONDITION1 JOINCONDITION2 AND JOINCONDITION3 AND JOINCONDITION4 AND DIMENSIONCONSTRAINT1 AND DIMENSIONCONSTRAINT2 AND DIMENSIONCONSTRAINT3 AND DIMENSIONCONSTRAINT4 AND DIMENSIONCONSTRAINT5 AND DIMENSIONCONSTRAINT6 GROUP BY NON-AGGREGATE FIELDNAME1 NON-AGGREGATE FIELDNAME2 ORDER BY NON-AGGREGATE FIELDNAME1 NON-AGGREGATE FIELDNAME2

  8. A Real Data Warehouse Query

  9. Schema Design • Database organization • must look like business • must be recognizable by business user • approachable by business user • Must be simple • Schema Types • Star Schema • Fact Constellation Schema • Snowflake schema

  10. Dimension Tables • Dimension tables • Define business in terms already familiar to users • Wide rows with lots of descriptive text • Small tables (about a million rows) • Joined to fact table by a foreign key • heavily indexed • typical dimensions • time periods, geographic region (markets, cities), products, customers, salesperson, etc.

  11. Fact Table • Central table • mostly raw numeric items • narrow rows, a few columns at most • large number of rows (millions to a billion) • Access via dimensions

  12. Star Schema • A single fact table and for each dimension one dimension table • Does not capture hierarchies directly p r o d T i m e date, custno, prodno, cityname, ... f a c t c u s t c i t y

  13. p r o d T i m e date, custno, prodno, cityname, ... f a c t c u s t r e g i o n c i t y Snowflake schema • Represent dimensional hierarchy directly by normalizing tables. • Easy to maintain and saves storage

  14. Promotion Hotels Booking Checkout Travel Agents Room Type Customer Fact Constellation • Fact Constellation • Multiple fact tables that share many dimension tables • Booking and Checkout may share many dimension tables in the hotel industry

  15. Data Warehouse Structure • Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables

  16. Data Granularity in Warehouse • Summarized data stored • reduce storage costs • reduce cpu usage • increases performance since smaller number of records to be processed • design around traditional high level reporting needs • tradeoff with volume of data to be stored and detailed usage of data

  17. Granularity in Warehouse • Can not answer some questions with summarized data • Did Anand call Seshadri last month? Not possible to answer if total duration of calls by Anand over a month is only maintained and individual call details are not. • Detailed data too voluminous

  18. Granularity in Warehouse • Tradeoff is to have dual level of granularity • Store summary data on disks • 95% of DSS processing done against this data • Store detail on tapes • 5% of DSS processing against this data

  19. Vertical Partitioning Frequently accessed Rarely accessed Product_key SKU description brand category department package_type size… etc. . . . Product_key brand category department . . . Product_key SKU Smaller table and so less I/O

  20. Derived Data • Introduction of derived (calculated data) may often help • Have seen this in the context of dual levels of granularity • Can keep auxiliary views and indexes to speed up query processing

  21. Denormalization • Normalization in a data warehouse may lead to lots of small tables • Can lead to excessive I/O’s since many tables have to be accessed • Denormalization is the answer especially since updates are rare

  22. Creating Arrays • Many time each occurrence of a sequence of data is in a different physical location • Beneficial to collect all occurrences together and store as an array in a single row • Makes sense only if there are a stable number of occurrences which are accessed together • In a data warehouse, such situations arise naturally due to time based orientation • can create an array by month

  23. Selective Redundancy • Description of an item can be stored redundantly with order table -- most often item description is also accessed with order table • Updates have to be careful

  24. Partitioning • Breaking data into several physical units that can be handled separately • Not a question of whether to do it in data warehouses but how to do it • Granularity and partitioning are key to effective implementation of a warehouse

  25. Why Partitioning? • Flexibility in managing data • Smaller physical units allow • easy restructuring • free indexing • sequential scans if needed • easy reorganization • easy recovery • easy monitoring

  26. Criterion for Partitioning • Typically partitioned by • date • line of business • geography • organizational unit • any combination of above

  27. Where to Partition? • Application level or DBMS level • Makes sense to partition at application level • Allows different definition for each year • Important since warehouse spans many years and as business evolves definition changes • Allows data to be moved between processing complexes easily

  28. Where to Partition? • Application level or DBMS level • Makes sense to partition at application level • Allows different definition for each year • Important since warehouse spans many years and as business evolves definition changes • Allows data to be moved between processing complexes easily

  29. Indexing Techniques • Bitmap index: • A collection of bitmaps -- one for each distinct value of the column • Each bitmap has N bits where N is the number of rows in the table • A bit corresponding to a value v for a row r is set if and only if r has the value for the indexed attribute

  30. 0 1 0 M Y 1 1 F Y 1 F N 1 0 0 0 0 0 M N 1 1 F Y 1 0 F N 1 0 Bitmap Index gender vote result gender (f) vote (y) Query : select * from customer where gender = ‘F’ and vote = ‘Y’ Customer

  31. Join Indexes • Pre-computed joins • A join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute • e.g., a join index on city dimension of calls fact table • correlates for each city the calls (in the calls table) that originated from that city

  32. Join Indexes • Join indexes can also span multiple dimension tables • e.g., a join index on city and time dimension of calls fact table

  33. Calls C+T Time C+T+L Loca- tion C+T+L +P Plan Star Join Processing • Use join indexes to join dimension and fact table

  34. Time Loca- tion Calls Plan Optimized Star Join Processing Apply Selections Virtual Cross Product of T, L and P

  35. Bitmaps 1 0 1 Time Calls Loca- tion 0 0 1 Calls AND Plan Calls 1 1 0 Bitmapped Join Processing

  36. Intelligent Scan • Piggyback multiple scans of a relation (Redbrick) • piggybacking also done if second scan starts a little while after the first scan

  37. Parallel Query Processing • Three forms of parallelism • Independent • Pipelined • Partitioned and “partition and replicate” • Deterrents to parallelism • startup • communication

  38. Parallel Query Processing • Partitioned Data • Parallel scans • Yields I/O parallelism • Parallel algorithms for relational operators • Joins, Aggregates, Sort • Parallel Utilities • Load, Archive, Update, Parse, Checkpoint, Recovery • Parallel Query Optimization

  39. Pre-computed Aggregates • Keep aggregated data for efficiency (pre-computed queries) • Questions • Which aggregates to compute? • How to update aggregates? • How to use pre-computed aggregates in queries?

  40. Pre-computed Aggregates • Aggregated table can be maintained by the • warehouse server • middle tier • client applications • Pre-computed aggregates -- special case of materialized views -- same questions and issues remain

  41. Summary Management Staging File Operational Data Extract Incremental Details Transforms Details and Aggregates Query Rewrite Incremental Load and Refresh MDDB Extraction Analysis Tools

  42. SQL Extensions • Extended family of aggregate functions • rank (top 10 customers) • percentile (top 30% of customers) • median, mode • Object Relational Systems allow addition of new aggregate functions

  43. SQL Extensions • Reporting features • running total, cumulative totals • Cube operator • group by on all subsets of a set of attributes (month,city) • redundant scan and sorting of data can be avoided

  44. Server Scalability • Scalability is the #1 IT requirement for Data Warehousing • Hardware Platform options • SMP • Clusters (shared disk) • MPP • Loosely coupled (shared nothing) • Hybrid

  45. SMP Characteristics • SMP -- Symmetric multi processing -- shared everything • Multiple CPUs share same memory • Workload is balanced across CPUs by OS • Scalability is limited to bandwidth of internal bus and OS architecture • Not tolerant to failure in processing node • Architecture is mostly invisible to applications

  46. SMP Benefits • Lower entry point -- can start with SMP • Mature technology

  47. MPP Characteristics • Each node owns a portion of the database • Nodes are connected via an interconnection network • Each node can be a single CPU or SMP • Load balancing done by application • High scalability due to local processing isolation

  48. MPP benefits • High availability • High scalability

  49. Viewing the Data with OLAP Making Decision Support Possible

  50. Limitations of SQL “A Freshman in Business needs a Ph.D. in SQL” -- Ralph Kimball

More Related