320 likes | 446 Vues
MIS 451 Building Business Intelligence Systems. Logical Design (1). Project Planning. Requirements Analysis. Logical Design. Physical Design. Data Staging. Data Analysis (OLAP). Introduction to Dimensional Modeling.
E N D
MIS 451Building Business Intelligence Systems Logical Design (1)
Project Planning Requirements Analysis Logical Design Physical Design Data Staging Data Analysis (OLAP)
Introduction to Dimensional Modeling • Dimensional Modeling is a DW logical design technique that seeks to present data in a standard framework that is intuitive for data access and allows for high performance data access. • Intuitive: easy to write SQL • High performance: high performance SQL
Dimensional Model (Star Schema) ER Model For detailed information, please refer handout 1.
Introduction to Dimensional Modeling • Analytical Report: 2-dimension January sales report by customer state and product category • Query: list sales in Jan. by customer state and product category?
Introduction to Dimensional Modeling Query based on Dimensional Model: Select State, PCName, SUM(Sales) From Sales S, Customer C, Product P, Time T Where S.Time_ Key = T.Time_Key and S.Product_ Key = P.Product_Key and S.Customer_Key = C.Customer_Key and T.Month= ’JAN’ Group by State, PCName Join: 4 tables Query based on ER Model: Select State, PCName, SUM(Price*Quantity) From OrderLine OL, Customer C, Product_Category PC, Product P, Order O Where OL.OID = O.OID and OL.PID = P.PID and O.CID = C.CID and to_char(O.OrderDate,’MON’) = ’JAN’ and P.PCID = PC.PCID Group by State, PCName Join: 5 tables
Fact and Dimension Fact table Dimension table
Fact and Dimension • There are two types of tables in dimensional modeling: • Fact table: attributes in fact tables are measurements for analysis or contents in reports. • Dimension table: attributes in dimension tables are constraints for the measurements or headers in reports. Dimensions Facts
Facts and Dimensions • How to identify facts and dimensions? • Requirements Analysis: • Analytical requirements: Marketing managers want to know sales performance for different product category in different states? • Information requirements: quantity of product sold, sales amount, product category, and customer states • ER Model
F1: Calculation F: refers to special considerations for fact table or special type of fact table
F1: Calculation • Normalization in RDB • 1NF • 2NF • 3NF • Non-volatile property of data warehouse enables DW design to resist normalization and improve query performance.
D1: Slowly changing dimension D: refers to special considerations for dimension table or special type of dimension table
CID CName State City 101 Jon Arizona Tucson 102 Tom Arizona Tucson 103 Mark Arizona Phoenix D1: Slowly changing dimension • Values of attributes in dimension tables may evolve over time. For example, customers moved from one city to another city. Phoenix Tom moved from Tucson to Phoenix
CID CID CName CName State State City City 101 101 Jon Jon Arizona Arizona Tucson Tucson 102 102 Tom Tom Arizona Arizona Tucson Phoenix 103 103 Mark Mark Arizona Arizona Phoenix Phoenix D1: Slowly changing dimension • There are three ways to handle slowly changing dimension. • Method 1: Overwrite old values with new values
D1: Slowly changing dimension Drawbacks of method 1: Historical information is totally lost. We will never know that customer 102 lived in Tucson before. Moreover, when listing sales by city, all the sales of customer 102 will be counted as part of Phoenix sales, although 102 was in Tucson before.
CID CID CName CName State State Original City City Current City 101 Jon Arizona Tucson 101 Jon Arizona Tucson Tucson 102 Tom Arizona Tucson 102 Tom Arizona Tucson Phoenix 103 Mark Arizona Phoenix 103 Mark Arizona Phoenix Phoenix D1: Slowly changing dimension • Method 2: Add a new attribute to record current value of the changing attribute.
D1: Slowly changing dimension Drawbacks of method 2: Only partial Historical information (original & current) is kept. Considering that customer 102 moved from Tucson to Flagstaff then to Phoenix, the customer information of customer 102 only includes Tucson and Phoenix.
CID CName State City 101 Jon Arizona Tucson 102 Tom Arizona Tucson 103 Mark Arizona Phoenix D1: Slowly changing dimension • Method 3: Add a record whenever a dimension attribute changes.
D1: Slowly changing dimension Method 3 keep all the information. However, Is there any problem?
CID CName State City 101 Jon Arizona Tucson 102 Tom Arizona Tucson 103 Mark Arizona Phoenix D1: Slowly changing dimension • Method 4: warehouse key + method 3 • Warehouse key is a sequence of non-negative integers served as primary keys of tables in data warehouse. Warehouse key
D1: Slowly changing dimension • Why warehouse key is needed in data warehouse? • Solve slowly changing dimension problem • Compared with natural keys (i.e., primary keys of tables in RDB, such as CID of customer table), warehouse keys have high join performance.
D1: Slowly changing dimension • Warehouse key • Primary keys in dimensional tables are warehouse keys. • Primary key in fact table is a collection of warehouse keys of all/part of its associated dimensions.
D1: Slowly changing dimension Notation: Primary key
D2: Time Dimension D: refers to special considerations for dimension table or special type of dimension table
D2: Time Dimension • Data warehouse needs an explicit time dimension table instead of just a time attribute (e.g, ORDERDATE). • Besides the time attribute, time dimension table includes the following additional attributes: • Day_of_week (1-7); Day_number_in_month (1-31); • Day_number_in_year (1-365) • Week_number (1-52); month (1-12), Quarter (1-4) • Holiday_flag (y/n) • Fiscal_quarter, Fiscal_year
D2: Time Dimension • Time dimension can: • Save computation effort and improve query performance • Complex queries regarding calendar calculation are hidden from end users of data warehouse.
D3: Snowflake D: refers to special considerations for dimension table or special type of dimension table
D3: Snowflake Snowflake structure
D3: Snowflake • Snowflake structure should be avoided in data warehouse design • Tradeoff of avoiding snowflake • Advantage: improve query performance • Disadvantage: require more storage space