510 likes | 724 Vues
Data Warehousing. Dr. Navneet Goyal Assistant Professor Computer Science Department BITS, Pilani. Background. 1980’s to early 1990’s Focus on computerizing business processes To gain competitive advantage By early 1990’s All companies had operational systems
E N D
Data Warehousing Dr. Navneet Goyal Assistant Professor Computer Science Department BITS, Pilani
Background • 1980’s to early 1990’s • Focus on computerizing business processes • To gain competitive advantage • By early 1990’s • All companies had operational systems • It no longer offered any advantage • How to get competitive advantage??
OLTP Systems: Primary Purpose Run the operations of the business • For example: Banks, Railway reservation etc. • Based on ER Data Modeling • Transaction based system • Data is always current valued • Little history is available • Data is highly volatile • Has “Intelligent keys”
OLTP Systems • Has relational normalized design • Redundant data is undesirable • Consists of many tables • High volume retrieval is inefficient • Optimized for repetitive “narrow” queries • Common data in many applications
Need for Data Warehousing • Companies, over the years, gathered huge volumes of data • “Hidden Treasure” • Can this data be used in any way? • Can we analyze this data to get any competitive advantage? • If yes, what kind of advantage?
Benefits of Data Warehousing • Allows “efficient” analysis of data • Competitive Advantage • Analysis aids strategic decision making • Increased productivity of decision makers • Potential high ROI • Classic example: Diaper and Beer
Decision Support Systems, DW, & OLAP • Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. • Data Warehouse is a DSS • A data warewhouse is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational systems. • Data Warehouse is not an Intelligent system • On-Line Analytical Processing (OLAP) is an element of DSS
Data Warehouse: Characteristics • Analysis driven • Ad-hoc queries • Complex queries • Used by top managers • Based on Dimensional Modeling • Denormalized structures
Data Warehouse:Major Players • SAS institute • IBM • Oracle • Sybase • Microsoft • HP • Cognos • Business Objects
Data Warehouse • A decision support database that is maintained separately from the organization’s operational databases. • A data warehouse is a • subject-oriented, • integrated, • time-varying, • non-volatile collection of data that is used primarily in organizational decision making
Subject Oriented • Data Warehouse is designed around “subjects” rather than processes • A company may have • Retail Sales System • Outlet Sales System • Catalog Sales System • Problems Galore!!! • DW will have a Sales Subject Area
Subject Oriented OLTP Systems Data Warehouse Subject-Oriented Sales Information Retail Sales System Catalog Sales System Outlet Sales System Sales Subject Area
Integrated • Heterogeneous Source Systems • Little or no control • Need to Integrate source data • For Example: Product codes could be different in different systems • Arrive at common code in DW
Non-Volatile(Read-Mostly) Write USER Read Read USER OLTP DW
Time Variant • Most business analysis has a time component • Trend Analysis (historical data is required) Sales 2001 2002 2003 2004
Data Warehousing Architecture Monitoring & Administration OLAP servers Metadata Repository Analysis Query/ Reporting Data Mining Data Warehouse Extract Transform Load Refresh External Sources Serve Operational dbs Data Marts
Populating & Refreshing the Warehouse • Data Extraction • Data Cleaning • Data Transformation • Convert from legacy/host format to warehouse format • Load • Sort, summarize, consolidate, compute views, check integrity, build indexes, partition • Refresh • Bring new data from source systems
ETL Process Issues & Challenges • Consumes 70-80% of project time • Heterogeneous Source Systems • Little or no control over source systems • Source systems scattered • Source systems operating in different time zones • Different currencies • Different measurement units • Data not captured by OLTP systems • Ensuring data quality
Data Staging Area • A storage area where extracted data is • Cleaned • Transformed • Deduplicated • Initial storage for data • Need not be based on Relational model • Spread over a number of machines • Mainly sorting and Sequential processing • COBOL or C code running against flat files • Does not provide data access to users • Analogy – kitchen of a restaurant
Presentation Servers • A target physical machine on which DW data is organized for • Direct querying by end users using OLAP • Report writers • Data Visualization tools • Data mining tools • Data stored in Dimensional framework • Analogy – Sitting area of a restaurant
Data Cleaning • Why? • Data warehouse contains data that is analyzed for business decisions • More data and multiple sources could mean more errors in the data and harder to trace such errors • Results in incorrect analysis • Detecting data anomalies and rectifying them early has huge payoffs • Long Term Solution • Change business practices and data entry tools • Repository for meta-data
Soundex Algorithms • Misspelled terms • For example NAMES • Phonetic algorithms – can find similar sounding names • Based on the six phonetic classifications of human speech sounds
Data Warehouse Design • OLTP Systems are Data Capture Systems • “DATA IN” systems • DW are “DATA OUT” systems OLTP DW
Analyzing the DATA • Active Analysis – User Queries • User-guided data analysis • Show me how X varies with Y • OLAP • Automated Analysis – Data Mining • What’s in there? • Set the computer FREE on your data • Supervised Learning (classification) • Unsupervised Learning (clustering)
OLAP Queries • How much of product P1 was sold in 1999 state wise? • Top 5 selling products in 2002 • Total Sales in Q1 of FY 2002-03? • Color wise sales figure of cars from 2000 to 2003 • Model wise sales of cars for the month of Jan from 2000 to 2003
Data Mining Investigations • Which type of customers are more likely to spend most with us in the coming year? • What additional products are most likely to be sold to customers who buy sportswear? • In which area should we open a new store in the next year? • What are the characteristics of customers most likely to default on their loans before the year is out?
Continuum of Analysis OLTP Specialized Algorithms SQL Data Mining OLAP Primitive & Canned Analysis Complex Ad-hoc Analysis Automated Analysis
Design Requirements • Design of the DW must directly reflect the way the managers look at the business • Should capture the measurements of importance along with parameters by which these parameters are viewed • It must facilitate data analysis, i.e., answering business questions
ER Modeling • A logical design technique that seeks to eliminate data redundancy • Illuminates the microscopic relationships among data elements • Perfect for OLTP systems • Responsible for success of transaction processing in Relational Databases
Problems with ER Model ER models are NOT suitable for DW? • End user cannot understand or remember an ER Model • Many DWs have failed because of overly complex ER designs • Not optimized for complex, ad-hoc queries • Data retrieval becomes difficult due to normalization • Browsing becomes difficult
ER vs Dimensional Modeling • ER models are constituted to • Remove redundant data (normalization) • Facilitate retrieval of individual records having certain critical identifiers • Thereby optimizing OLTP performance • Dimensional model supports the reporting and analytical needs of a data warehouse system.
Dimensional Modeling:Salient Features • Represents data in a standard framework • Framework is easily understandable by end users • Contains same information as ER model • Packages data in symmetric format • Resilient to change • Facilitates data retrieval/analysis
Dimensional Modeling:Vocabulary • Measures or facts • Facts are “numeric” & “additive” • For example; Sale Amount, Sale Units • Factors or dimensions • Star Schemas • Snowflake & Starflake Schemas Sales Amt = f( Product, Location, Time) Fact Dimensions
Star Schema FK FK FK FK Product Dimension Location Dimension Sales Fact Table Time Dimension Promotion Dimension
Dimensional Modeling • Facts are stored in FACT Tables • Dimensions are stored in DIMENSION tables • Dimension tables contains textual descriptors of business • Fact and dimension tables form a Star Schema • “BIG” fact table in center surrounded by “SMALL” dimension tables
The “Classic” Star Schema Fact Table STORE KEY Store Dimension Time Dimension PRODUCT KEY STORE KEY PERIOD KEY PERIOD KEY Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. Dollars_sold Units Dollars_cost Period Desc Year Quarter Month Day Product Dimension PRODUCT KEY Product Desc. Brand Color Size Manufacturer
Fact Tables • Contains numerical measurements of the business • Each measurement is taken at the intersection of all dimensions • Intersection is the composite key • Represents Many-to-many relationships between dimensions • Examples of facts Sale_amt, Units_sold, Cost, Customer_count
Dimension Tables • Contains attributes for dimensions • 50 to 100 attributes common • Best attributes are textual and descriptive • DW is only as good as the dimension attributes • Contains hierarchal information albeit redundantly • Entry points into the fact table
Types of Facts • Fully-additive-all dimensions • Units_sold, Sales_amt • Semi-additive-some dimensions • Account_balance, Customer_count 28/3,tissue paper,store1, 25, 250,20 28/3,paper towel,store1, 35, 350,30 Is no. of customers who bought either tissue paper or paper towel is 50? NO. • Non-additive-none • Gross margin=Gross profit/amount • Note that GP and Amount are fully additive • Ratio of the sums and not sum of the ratios
Data Warehouse:Design Steps Step 1: Identify the Business Process Step 2: Declare the Grain Step 3: Identify the Dimensions Step 4: Identify the Facts
Grocery Store:The Universal Example The Scenario: • Chain of 100 Grocery Stores • 60000 individual products in each store • 10000 of these products sold on any given day(average) • 3 year data
Grocery Store DW • Step 1: Sales Business Process • Step 2: Daily Grain • A word about GRANULARITY • Temp sensor data: per ms, sec, min, hr? • Size of the DW is governed by granularity • Daily grain (club products sold on a day for each store) Aggregated data • Receipt line Grain (each line in the receipt is recorded – finest grain data)
Grocery Store:DW Size Estimate • Daily Grain • Size of Fact Table = 100*10000*3*365 = 1095 million records • 3 facts & 4 dimensions (49 bytes) • 1095 m * 49 bytes = 53655 m bytes • i.e. ~ 50 GB
Data Cube Goa Location Dubai Dimensions: Time, Product, Location Attributes: Product (upc, price, …) Location… … Hierarchies: Product Brand … Day Week Quarter City Region Country Pilani 10 34 56 32 12 56 Juice Milk Coke Cream Soap Bread Product roll-up to week M T W Th F S S Time 56 units of bread sold in Pilani on M