1 / 47

Data Warehousing

Data Warehousing. Dr. Navneet Goyal Assistant Professor Computer Science Department BITS, Pilani. Background. 1980’s to early 1990’s Focus on computerizing business processes To gain competitive advantage By early 1990’s All companies had operational systems

brody-gates
Télécharger la présentation

Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehousing Dr. Navneet Goyal Assistant Professor Computer Science Department BITS, Pilani

  2. Background • 1980’s to early 1990’s • Focus on computerizing business processes • To gain competitive advantage • By early 1990’s • All companies had operational systems • It no longer offered any advantage • How to get competitive advantage??

  3. OLTP Systems: Primary Purpose Run the operations of the business • For example: Banks, Railway reservation etc. • Based on ER Data Modeling • Transaction based system • Data is always current valued • Little history is available • Data is highly volatile • Has “Intelligent keys”

  4. OLTP Systems • Has relational normalized design • Redundant data is undesirable • Consists of many tables • High volume retrieval is inefficient • Optimized for repetitive “narrow” queries • Common data in many applications

  5. Need for Data Warehousing • Companies, over the years, gathered huge volumes of data • “Hidden Treasure” • Can this data be used in any way? • Can we analyze this data to get any competitive advantage? • If yes, what kind of advantage?

  6. Benefits of Data Warehousing • Allows “efficient” analysis of data • Competitive Advantage • Analysis aids strategic decision making • Increased productivity of decision makers • Potential high ROI • Classic example: Diaper and Beer

  7. Decision Support Systems, DW, & OLAP • Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. • Data Warehouse is a DSS • A data warewhouse is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational systems. • Data Warehouse is not an Intelligent system • On-Line Analytical Processing (OLAP) is an element of DSS

  8. DW: Interesting Statistics

  9. Data Warehouse: Characteristics • Analysis driven • Ad-hoc queries • Complex queries • Used by top managers • Based on Dimensional Modeling • Denormalized structures

  10. Data Warehouse:Major Players • SAS institute • IBM • Oracle • Sybase • Microsoft • HP • Cognos • Business Objects

  11. Data Warehouse • A decision support database that is maintained separately from the organization’s operational databases. • A data warehouse is a • subject-oriented, • integrated, • time-varying, • non-volatile collection of data that is used primarily in organizational decision making

  12. Subject Oriented • Data Warehouse is designed around “subjects” rather than processes • A company may have • Retail Sales System • Outlet Sales System • Catalog Sales System • Problems Galore!!! • DW will have a Sales Subject Area

  13. Subject Oriented OLTP Systems Data Warehouse Subject-Oriented Sales Information Retail Sales System Catalog Sales System Outlet Sales System Sales Subject Area

  14. Integrated • Heterogeneous Source Systems • Little or no control • Need to Integrate source data • For Example: Product codes could be different in different systems • Arrive at common code in DW

  15. Non-Volatile(Read-Mostly) Write USER Read Read USER OLTP DW

  16. Time Variant • Most business analysis has a time component • Trend Analysis (historical data is required) Sales 2001 2002 2003 2004

  17. Data Warehousing Architecture Monitoring & Administration OLAP servers Metadata Repository Analysis Query/ Reporting Data Mining Data Warehouse Extract Transform Load Refresh External Sources Serve Operational dbs Data Marts

  18. Populating & Refreshing the Warehouse • Data Extraction • Data Cleaning • Data Transformation • Convert from legacy/host format to warehouse format • Load • Sort, summarize, consolidate, compute views, check integrity, build indexes, partition • Refresh • Bring new data from source systems

  19. ETL Process Issues & Challenges • Consumes 70-80% of project time • Heterogeneous Source Systems • Little or no control over source systems • Source systems scattered • Source systems operating in different time zones • Different currencies • Different measurement units • Data not captured by OLTP systems • Ensuring data quality

  20. Data Staging Area • A storage area where extracted data is • Cleaned • Transformed • Deduplicated • Initial storage for data • Need not be based on Relational model • Spread over a number of machines • Mainly sorting and Sequential processing • COBOL or C code running against flat files • Does not provide data access to users • Analogy – kitchen of a restaurant

  21. Presentation Servers • A target physical machine on which DW data is organized for • Direct querying by end users using OLAP • Report writers • Data Visualization tools • Data mining tools • Data stored in Dimensional framework • Analogy – Sitting area of a restaurant

  22. Data Cleaning • Why? • Data warehouse contains data that is analyzed for business decisions • More data and multiple sources could mean more errors in the data and harder to trace such errors • Results in incorrect analysis • Detecting data anomalies and rectifying them early has huge payoffs • Long Term Solution • Change business practices and data entry tools • Repository for meta-data

  23. Soundex Algorithms • Misspelled terms • For example NAMES • Phonetic algorithms – can find similar sounding names • Based on the six phonetic classifications of human speech sounds

  24. Data Warehouse Design • OLTP Systems are Data Capture Systems • “DATA IN” systems • DW are “DATA OUT” systems OLTP DW

  25. Analyzing the DATA • Active Analysis – User Queries • User-guided data analysis • Show me how X varies with Y • OLAP • Automated Analysis – Data Mining • What’s in there? • Set the computer FREE on your data • Supervised Learning (classification) • Unsupervised Learning (clustering)

  26. OLAP Queries • How much of product P1 was sold in 1999 state wise? • Top 5 selling products in 2002 • Total Sales in Q1 of FY 2002-03? • Color wise sales figure of cars from 2000 to 2003 • Model wise sales of cars for the month of Jan from 2000 to 2003

  27. Data Mining Investigations • Which type of customers are more likely to spend most with us in the coming year? • What additional products are most likely to be sold to customers who buy sportswear? • In which area should we open a new store in the next year? • What are the characteristics of customers most likely to default on their loans before the year is out?

  28. Continuum of Analysis OLTP Specialized Algorithms SQL Data Mining OLAP Primitive & Canned Analysis Complex Ad-hoc Analysis Automated Analysis

  29. Design Requirements • Design of the DW must directly reflect the way the managers look at the business • Should capture the measurements of importance along with parameters by which these parameters are viewed • It must facilitate data analysis, i.e., answering business questions

  30. ER Modeling • A logical design technique that seeks to eliminate data redundancy • Illuminates the microscopic relationships among data elements • Perfect for OLTP systems • Responsible for success of transaction processing in Relational Databases

  31. Problems with ER Model ER models are NOT suitable for DW? • End user cannot understand or remember an ER Model • Many DWs have failed because of overly complex ER designs • Not optimized for complex, ad-hoc queries • Data retrieval becomes difficult due to normalization • Browsing becomes difficult

  32. ER vs Dimensional Modeling • ER models are constituted to • Remove redundant data (normalization) • Facilitate retrieval of individual records having certain critical identifiers • Thereby optimizing OLTP performance • Dimensional model supports the reporting and analytical needs of a data warehouse system.

  33. Dimensional Modeling:Salient Features • Represents data in a standard framework • Framework is easily understandable by end users • Contains same information as ER model • Packages data in symmetric format • Resilient to change • Facilitates data retrieval/analysis

  34. Dimensional Modeling:Vocabulary • Measures or facts • Facts are “numeric” & “additive” • For example; Sale Amount, Sale Units • Factors or dimensions • Star Schemas • Snowflake & Starflake Schemas Sales Amt = f( Product, Location, Time) Fact Dimensions

  35. Star Schema FK FK FK FK Product Dimension Location Dimension Sales Fact Table Time Dimension Promotion Dimension

  36. Dimensional Modeling • Facts are stored in FACT Tables • Dimensions are stored in DIMENSION tables • Dimension tables contains textual descriptors of business • Fact and dimension tables form a Star Schema • “BIG” fact table in center surrounded by “SMALL” dimension tables

  37. The “Classic” Star Schema Fact Table STORE KEY Store Dimension Time Dimension PRODUCT KEY STORE KEY PERIOD KEY PERIOD KEY Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. Dollars_sold Units Dollars_cost Period Desc Year Quarter Month Day Product Dimension PRODUCT KEY Product Desc. Brand Color Size Manufacturer

  38. Fact Tables • Contains numerical measurements of the business • Each measurement is taken at the intersection of all dimensions • Intersection is the composite key • Represents Many-to-many relationships between dimensions • Examples of facts Sale_amt, Units_sold, Cost, Customer_count

  39. Dimension Tables • Contains attributes for dimensions • 50 to 100 attributes common • Best attributes are textual and descriptive • DW is only as good as the dimension attributes • Contains hierarchal information albeit redundantly • Entry points into the fact table

  40. Types of Facts • Fully-additive-all dimensions • Units_sold, Sales_amt • Semi-additive-some dimensions • Account_balance, Customer_count 28/3,tissue paper,store1, 25, 250,20 28/3,paper towel,store1, 35, 350,30 Is no. of customers who bought either tissue paper or paper towel is 50? NO. • Non-additive-none • Gross margin=Gross profit/amount • Note that GP and Amount are fully additive • Ratio of the sums and not sum of the ratios

  41. Data Warehouse:Design Steps Step 1: Identify the Business Process Step 2: Declare the Grain Step 3: Identify the Dimensions Step 4: Identify the Facts

  42. Grocery Store:The Universal Example The Scenario: • Chain of 100 Grocery Stores • 60000 individual products in each store • 10000 of these products sold on any given day(average) • 3 year data

  43. Grocery Store DW • Step 1: Sales Business Process • Step 2: Daily Grain • A word about GRANULARITY • Temp sensor data: per ms, sec, min, hr? • Size of the DW is governed by granularity • Daily grain (club products sold on a day for each store) Aggregated data • Receipt line Grain (each line in the receipt is recorded – finest grain data)

  44. Grocery Store:DW Size Estimate • Daily Grain • Size of Fact Table = 100*10000*3*365 = 1095 million records • 3 facts & 4 dimensions (49 bytes) • 1095 m * 49 bytes = 53655 m bytes • i.e. ~ 50 GB

  45. Data Cube Goa Location Dubai Dimensions: Time, Product, Location Attributes: Product (upc, price, …) Location… … Hierarchies: Product  Brand  … Day  Week  Quarter City  Region  Country Pilani 10 34 56 32 12 56 Juice Milk Coke Cream Soap Bread Product roll-up to week M T W Th F S S Time 56 units of bread sold in Pilani on M

  46. Q & A

  47. Thank You

More Related