350 likes | 530 Vues
CSD305 Data Warehouse Design. An introduction to data warehousing. Connolly and Begg Database Systems 4 th edition chapters 31 and 32 Introduction to Data Mining, Tan, Steinbach and Kumar, Pearson Educational. Agenda. Business Intelligence The architecture of data warehouse and data mart
E N D
CSD305Data Warehouse Design An introduction to data warehousing Connolly and Begg Database Systems 4th edition chapters 31 and 32 Introduction to Data Mining, Tan, Steinbach and Kumar, Pearson Educational
Agenda • Business Intelligence • The architecture of data warehouse and data mart • Data integration and cleansing • Data extraction and OLAP
Business Intelligence (BI)Data Mining (DM) and Knowledge Discovery in Databases (KDD) • Recent advances in technologies have resulted in an explosion in the amount of data generated and collected by businesses and organizations • Point of sale data in retail • Smart card technology (e.g. Oyster cards) • Satellite data Extracting useful information from these data sets can be challenging
Benefits Potential high returns on investment (ROI) • Can cost tens thousands to millions to implement data warehouse • Study by International Data Corporation (IDC) DW projects delivered average three-year ROI of 401% in 1996 later study analytical tools delivered average one-year ROI of 431% in 2002 Competitive advantage • Huge ROI evidence of enormous competitive advantage. • Taps into previous unknown untapped info on customers, trends and demands. Increased productivity of corporate decision makers • Consistent, subject orientated historical data • Integrates from multiple incompatible systems providing a consistent view • Allows for more substantive, accurate and consistent analysis
Knowledge Discovery in Databases (KDD) Filtering Pattern analysis Visualization Cleansing
How to get to BI? • Data mining techniques, technologies and algorithms are being developed to provide BI • Before we get to data mining we must first pre process the data to get it into the right form • This will often involve the design of a data warehouse: • Design the data warehouse architecture • Extract and cleanse the data • Integrate the data into a common data mart schema
Data Warehousing concepts • Subject-oriented • Integrated • Time-variant • Non-volatile
Data Integration Component • Systems storing data may be different • Data Type Differences • Value differences (Colour:Black(0, or BL, or Black)) • Semantic Differences ( Terms -> Different interpretations) • E.g., Column ‘Title’ in one database means ‘Job Title’ while in another database it means ‘Person Title’ • Missing Values (NULL VALUES) • Different Schemata
Data Integration Conflicts • Schema level conflicts are due to • different perceptions • different focus • Value level conflicts are due to • different representations, coding, etc • different precision • incorrect information • data entry errors • Data cleaning
The Warehousing Approach • Information integrated in advance • Stored in WH for direct querying and analysis
Dimensional modelling • Dimensional modelling - presenting the data in a standard form • Each Dimensional Model • One fact table • Multiple Dimension tables • Each dimensions gives a different focus to the data • Star schema or Star join
Star schema Fact table • one table with a composite primary key. • factual data generated by events occurred in the past. • Unlikely to change, regardless of how they are analysed. • Can be large relative to dimension tables • Numerical measures or ‘facts’ that occur for each record • Other examples: offer price, selling price, sale commission, sale revenue. Dimension Table • Each has a simple PK • Corresponds to exactly one component of composite key in fact table • All natural keys are replaced by surrogate keys • Natural keys has relationship with data, ISBN, vehicle registration number. • Surrogate key no relationship with data, a generated value to make the data unique. • General structure based on integers
Star schema essentials • One fact table • Multiple dimension tables • Each dimensions gives a different focus to the data
Star join essentials • Again - fact tables and dimension tables. • Said to be star join when one large central table joined to two or more dimension tables • The fact table represents the structure that holds the majority of the occurrences of the data. • Fact tables typically combine data and cross reference keys from a variety of other tables. • Dimension tables contain data which is not terribly voluminous. Dimension tables are related to fact tables by means of a foreign key relationship. • Typical use would be for an end-user query on any fact table
Why star joins? • By building star joins, the designer has created a structure for efficient access of large volumes of data and natural end-user viewing. • Problem with star joins. • In order to know how to create the star join, the designer must make assumptions about the usage of the data. • One department will look at data very differently from another department. The star join for finance will be very different than the star join for production, for example.
Star Schema development steps • Choose the process • Choose the grain • Identify the dimensions • Choose the facts
Choose Business Process Choose business process for the data mart • Data Mart contains subset of corporate data to support requirements of business unit e.g. sales department • DreamHome processes include: • Property sales • Property rentals • Property viewing • Property advertising • Property maintenance • Best choice for a first data mart is sales and finance. • Data source likely accessible, high quality
Choose the Grain • Balance between meeting business requirements and what is possible given the data source • Grain determines what the fact table represents • In this case sales Facts for each product • Best to build model using lowest level of detail available. • Only when the grain is chosen can we identify dimensions. • Time is included as a core dimension, always present in dimensional models. • Grain decision determines grain for dimension tables e.g. if grain is for saleFacts of each product, StoreDimension details of the store each sale took place in.
Choose Dimensions • Dimension are context for asking questions about the facts. • Identify dimensions in sufficient detail to describe things at the correct grain. • Dimensions can be used in more than one dimensional model (Data Mart) • Referred to as being conformed. Must be exactly the same or a subset. • Allows for individual data marts to form the enterprise data warehouse support.
Choose Facts • The grain of the fact table determines facts to be used. • All facts must be expressed at level implied by grain • E.g. if grain is for sale of individual products • All numerical facts must refer to this particular sale • Facts must also be numerical and addative • (can be summed across any dimension) • Additional facts can be added to the fact table at any time • Provided they are consistent with the grain
Data extractions • Data mart software will often include extraction and visualization tools • These are given the umbrella term OLAP • On-Line Analytical Process
Data extractions • Image from http://projects.cs.dal.ca/panda/olap.html
OLAP and Data Mining • We will see how a data warehouse could be examined in different dimensions • In the following week we shall look at Data Mining and begin investigating various data mining models
Dimensional Modelling Step 1: Select business process The process (function) refers to the subject matter of a particular data mart. First data mart built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions.
Step 2: Declare grain • Decide what a record of the fact table is to represent. • Identify dimensions of the fact table. The grain decision for the fact table also determines the grain of each dimension table. • Also include time as a core dimension, which is always present in star schemas.
Step 3: Choose dimensions • Dimensions set the context for asking questions about the facts in the fact table. • If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other. • A dimension used in more than one data mart is referred to as being conformed.
Step 4: Identify facts • The grain of the fact table determines which facts can be used in the data mart. • Facts should be numeric and additive. • Unusable facts include: • non-numeric facts • non-additive facts • fact at different granularity from other facts in table Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations.