ISQS 6339, Business Intelligence Dimensional Modeling

ISQS 6339, Business IntelligenceDimensional Modeling Zhangxi Lin Texas Tech University 1

Outline • Principles of Dimensional Modeling • Data Warehousing Methodology • Three Phases of Dimensional Modeling

Principles of Dimensional Modeling

Dimensional Model • Also called star schema (but snowflake schema is also fine) • Fact table is in the middle and dimensions serving as the points on the star. • A normalized fact table plus denormalized dimension tables • Reference: database normalization • Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the First Normal Form (1NF) in 1970. Codd went on to define the Second Normal Form (2NF) and Third Normal Form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974. • Informally, a relational database table is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies. 4

Star Schema Model Denormalized dimensions Central fact table Product Table Product_id Product_disc,... Store Table Store_id District_id,... Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Time Table Day_id Month_id Year_id,... Item Table Item_id Item_desc,... 5

Snowflake Schema Model Store Table Store_id Store_desc District_id Product Table Product_id Product_desc District Table District_id District_desc Sales Fact Table Item_id Store_id Product_id Week_id Sales_amount Sales_units Item Table Item_id Item_desc Dept_id Dept Table Dept_id Dept_desc Mgr_id Mgr Table Dept_id Mgr_id Mgr_name Time Table Week_id Period_id Year_id 6

Snowflake Schema Model • Direct use by some tools • More flexible to change • Provides for speedier data loading • Can become large and unmanageable • Degrades query performance • More complex metadata Country State County City 7

Facts • Definition • Measure – a numeric quantity expressing some aspect of the organization’s performance • Aggregate – formed by combining values from a given dimension or set of dimensions to create a single value. • Measurements associated with a specific business process. • Most facts are additive (calculative); others are semi-additive, non-additive, or descriptive (e.g. factless fact table). • Many facts can be derived from other facts. So, non-additive facts can be avoided by calculating it from additive facts.

Fact Table Characteristics • Contain numerical metrics of the business • Can hold large volumes of data • Can grow quickly • Can contain base, derived,and summarized data • Are typically additive • Are joined to dimension tables through foreign keys that reference primary keys in the dimension tables Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units ... 9

The Three Fact Table Types • Transaction fact table • The most basic and fundamental • “One row per line in a transaction", e.g., every line on a receipt • A transactional fact table holds data of the most detailed level • have a great number of dimensions associated with it • Periodic snapshot fact table • Takes a "picture of the moment“ • Cumulative performance over specific time intervals • Dependent on the transactional table • Valuable to combine data across several business processes in the value chain • Accumulating snapshot fact table • Used to show the activity of a process that has a well-defined beginning and end • Constantly updated over time 10

Types of facts • Transaction fact: each row • Periodic snapshot fact: (OldBal, NewBal) on each transaction • Accumulating snapshot fact: The average numbers in a week, such as average balance, number of transactions, average amount of transactions

Dimensions • Definition: a categorization used to spread out an aggregate measure to reveal its constituent part • The foundation of the dimensional model to describe the objects of the business • The nouns of the DW/BI system • Business processes (facts) are the verbs of the business • Dimension tables link to all the business processes. • A dimension shared across all processes is called conformed dimension • The analysis involving data from more than one business process is called drill-across. 12

Attributes • An additional piece of information pertaining to a dimension member that is not the unique identifier or the description of the member. • Attributes can be used to more fully describe dimension members

Dimension Table Characteristics • Dimension tables have the following characteristics: • Contain textual information that represents the attributes of the business • Contain relatively static data • Are joined to a fact table through a foreign key reference 14

Star Dimensional Model Characteristics • The model is easy for users to understand. • Primary keys represent a dimension. • Nonforeign key columns are values. • Facts are usually highly normalized. • Dimensions are completely denormalized. • Fast response to queries is provided. • Performance is improved by reducing table joins. • End users can express complex queries. • Support is provided by many front-end tools. 15

Sales fact Timedimension Where should the element of time be stored? The Time Dimension • Time is critical to the data warehouse. A consistent representation of time is required for extensibility. 16

Hierarchies • Meaningful, standard ways to group the data within a dimension • Variable-depth hierarchies • Frequently changing hierarchies • Examples of hierarchy in a dimension • Address: street, city, state, country • Organization: section, division, branch, region • Time: year, quarter, month, date 17

Data Cube • Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another. • Data cubes aren't restricted to just three dimensions. Most OLAP systems can build data cubes with many more dimensions allows up to 64 dimensions. • In practice, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions. 18

Data Cube Region Product Time

OLAP system • OLAP – allows users to retrieve information from data quickly for analysis purposes • Features • Multidimensional database • Easily understood • What is OLAP? 5’04” • SQL OLAP Tutorial - Data Warehouse Schema Design 9’45”

Dimensional Modeling Process • High level dimensional model design • Choosing business model • Declaring the grain • Choosing dimensions • Identifying the facts • Detailed dimensional model development • Dimensional model review and validation • IS • Core users • Business community • Final design iteration ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Data Warehousing Methodology 22

Data Warehouse Development Approaches Data warehouse development approaches Kimball Model: Data mart approach Data marts - EDW Inmon Model: EDW approach EDW – Data Marts Which model is better? There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse 23

Comparison • Kimball Model • Kimball’s model follows a bottom-up approach. The Data Warehouse (DW) is provisioned from Datamarts (DM) as and when they are available or required. • The Datamarts are sourced from OLTP systems are usually relational databases in Third normal form (3NF). • The Data Warehouse which is central to the model is a de-normalized star schema. The OLAP cubes are built on this DW. • Inmon Model • Inmon’s model follows a top-down approach. The Data Warehouse (DW) is sourced from OLTP systems and is the central repository of data. • The Data Warehouse in Inmon’s model is in Third Normal Form (3NF). • The Datamarts (DM) are provisioned out of the Data Warehouse as and when required. Datamarts in Inmon’s model are in 3NF from which the OLAP cubes are built.

Strengths and Weaknesses • Scalable vs. structural • Kimball’s model is more scalable because of the bottom-up approach and hence you can start small and scale-up eventually. The ROI is usually faster with Kimball’s model. Because of this approach it is difficult to created re-usable structures/ ETL for different data marts. • On the other hand Inmon’s model is more structured and easier to maintain while it is rigid and takes more time to build. The significant advantage of Inmon’s model is because the DW is in 3NF; it is easier to build data mining models. • Both Kimball and Inmon models agree and emphasis that DW is the central repository of data and OLAP cubes are built of de-normalized star schemas. • In conclusion, when it comes to data modeling, it is irrelevant which camp you belong to as long as you understand why you are adopting a specific model. Sometimes it makes sense to take a hybrid approach.

General Data Warehouse Development Approaches • “Big bang” approach • Incremental approach: • Top-down incremental approach • Bottom-up incremental approach ISQS 6339, Data Mgmt & BI, Zhangxi Lin 27

Analyze enterprise requirements Build enterprise data warehouse Report in subsets or store in data marts “Big Bang” Approach ISQS 6339, Data Mgmt & BI, Zhangxi Lin 28

Incremental Approach to Warehouse Development • Multiple iterations • Shorter implementations • Validation of each phase Increment 1 Strategy Definition Analysis Design Build Iterative Production ISQS 6339, Data Mgmt & BI, Zhangxi Lin 29

Top-Down Approach • Analyze requirements at the enterprise level • Develop conceptual information model • Identify and prioritize subject areas • Complete a model of selected subject area • Map to available data • Perform a source system analysis • Implement base technical architecture • Establish metadata, extraction, and load processes for the initial subject area • Create and populate the initial subject area data mart within the overall warehouse framework ISQS 6339, Data Mgmt & BI, Zhangxi Lin 30

Bottom-Up Approach • Define the scope and coverage of the data warehouse and analyze the source systems within this scope • Define the initial increment based on the political pressure, assumed business benefit and data volume • Implement base technical architecture and establish metadata, extraction, and load processes as required by increment • Create and populate the initial subject areas within the overall warehouse framework ISQS 6339, Data Mgmt & BI, Zhangxi Lin 31

Note: There are many details about data warehouse design, which need a lot effort to learn. Because of limited time to spend for this part, here are only some of the details. THREE PHASES OF DATA WAREHOUSE DESIGN

Data Warehouse Database Design Phases • Phase 1: Defining the business model • Phase 2: Defining the dimensional model • Phase 3: Defining the physical model 33

Phase 1: Defining the Business Model • Performing strategic analysis • Creating the business model • Documenting metadata 34

Performing Strategic Analysis • Identify crucial business processes • Understand business processes • Prioritize and select the business processes to implement High Business Benefit Low Feasibility Low High 35

Creating the Business Model • Defining business requirements: • Identifying the business measures • Identifying the dimensions • Identifying the grain • Identifying the business definitions and rules • Verifying data sources 36

Business Requirements Drive the Design Process Business Requirements Existing Metadata Production ERD Model Research 37 • Primary input • Secondary input

Identifying Measures and Dimensions • The attribute varies continuously: • Balance • Units Sold • Cost • Sales Measures Dimensions 38 • The attribute is perceived as constant or discrete: • Product • Location • Time • Size

Using a Business Process Matrix Sample of business process matrix 39

Determining Granularity YEAR? QUARTER? MONTH? WEEK? DAY? 40

Identifying Business Rules Product Type Monitor Status PC 15 inch New Server 17 inch Rebuilt 19 inch Custom None Time Month > Quarter > Year Store Store > District > Region Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles 41

Documenting Metadata • Documenting metadata should include: • Documenting the design process • Documenting the development process • Providing a record of changes • Recording enhancements over time 42

Metadata Documentation Approaches • Automated • Data modeling tools • ETL tools • End-user tools • Manual 43

Phase 2: Defining the Dimensional Model • Identify fact tables: • Translate business measures into fact tables • Analyze source system information for additional measures • Identify dimension tables • Link fact tables to the dimension tables • Model the time dimension 44

Illustrative case – IMW Data • Transaction fact – The transaction table • Periodic snapshot fact table – current records in Land & Office facts • Accumulating snapshot fact table – N/A in this case

Steps in designing a fact table • Identify a business process for analysis (like sales). • Identify measures or facts (sales dollar), by asking questions like 'What number of XX are relevant for the business process?', replacing the XX with various options that make sense within the context of the business. • Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension), by asking questions that make sense within the context of the business, like 'Analyse by XX', where XX is replaced with the subject to test. • List the columns that describe each dimension (region name, branch name, business unit name). • Determine the lowest level (granularity) of summary in a fact table (e.g. sales dollars). • An alternative approach is the four step design process described in Kimball. – Check what it is

Using Time in the Data Warehouse • Defining standards for time is critical. • Aggregation based on time is complex. 47

Using Data Modeling Tools • Tools with a GUI enable definition, modeling, and reporting. • Avoid a mix of modeling techniques caused by: • Development pressures • Developers with lack of knowledge • No strategy • Determine a strategy. • Write and publish formally. • Make available electronically. 48

Phase 3: Defining the Physical Model • Why • Huge amount of data must be effectively processed and retrieved in realtime. • How • Translate the dimensional design to a physical model for implementation. • Define storage strategy for tables and indexes. • Perform database sizing. • Define initial indexing strategy. • Define partitioning strategy. • Update metadata document with physical information. 49

Storage and Performance Considerations • Database sizing • Data partitioning • Indexing • Star query optimization 50

ISQS 6339, Business Intelligence Dimensional Modeling