Data Warehousing

Data Warehousing
2. Warehouse Data Modelling

Objectives Requirements to data design Dimensional modelling vs ER modelling The Star Schema Fact and Dimensions Advanced topics

Requirements and Data Design Ponniah develops the idea of the Information Package Think of information as business dimensions Using SALES as a subject example Dimensions Granularity

Information Packages allow us to: Define common subject areas Define Key metrics How will data be presented Determine aggregate and roll up Establish granularity Estimate warehouse size Determine frequency of loading

Dimensions Dimensional analysis is the underlying basis for designing the Data Warehouse The questions we want to answer help us decide what dimensions we need to include in the Warehouse Eg. Hotel Occupancy (as a SUBJECT) Occupancy of rooms in various branches of the hotel chain, by individual hotels, room types, time and perhaps customer demographics; Age, Gender, income range, marital status, household size, number of vehicles, home value, education etc.

Dimension Data Data by which the user performs query analysis Dimension tables contain a key that links to primary key values in the Fact table Data quality is important Encoded attributes (for operational systems) are expanded for the dimension Data warehouse fact tables are refreshed but dimension tables typically remain static.

Dimensions Contain textual or qualitative data Smaller volumes Discrete values Examples include Time Item or product Store Region Promotion Payment method

Fact Data The numerical measures of the business Access via dimensions (eg. how many sales did we make during March in the Wellington region) SALES is the fact table, the TIME dimension and REGION dimensions provide the other data Facts are point in time snapshots of operational data Contain a time element (order date, sale data, date shipped etc) Contain composite primary key values – Each part of the key references a dimension

Fact Tables May contain aggregated or summary data Or, this may be represented in separate tables, but access is seamless May contain derived or calculated data created from multiple sources of data Derived data is often stored to make queries more efficient (no need to perform the calculation) Derived data is usually created as part of the routine that transforms data prior to loading into the warehouse

More about Fact Tables Contains large volumes – millions of rows Historic in nature. Date time stamped Fact data is loaded according to the refresh cycle (weekly, monthly) May be more than one fact table – containing different business measures May be made up of summarised data held in different fact tables Linked to Dimensions with many FK values

Facts or Dimensions How to decide if an attribute is a fact or a dimension If the data changes value – it is a fact – units sold, account balances etc. If the data is constant it is a dimension – colour, address of a customer, store location, etc.

Keys Fact table composite primary keys join to key values in dimension tables Indexed for speed Example Time_id + Product_id + Store_id + Promotion + Customer Above could be representative of any of the attributes in the dimension tables, including heavily summarised data -

Granularity The level of detail of data stored in the warehouse Granularity affects Size of the warehouse Type (and complexity) of analytical query that you can perform How flexible the warehouse is from the users perspective

Levels of Granularity Individual transactions Daily snapshots Monthly, quarterly, yearly, etc. High to lowest level – the level is called the grain. Key driver in determining granularity is the business requirements – what the users need to make decisions.

Grain Choosing the wrong level of granularity can result in lost data For eg. If you store sales data rolled up to a quarterly level – analyst wants to determine sales by month – Problem! Review granularity with “proof of concept”, prototyping, feedback sessions

Modelling criteria Model provides for data access Model should be query centric Optimised for query and analysis Dimension tables interact with fact table Drill down and roll up along dimension hierarchies The modelling approach is often called the STAR Schema approach.

Dimensional v ER ER modelling Removes and controls data redundancy to ensure data consistency and integrity Often not intuitive to users Dimensional (Star Schema) Intuitive structure to users Views data along the dimensions Captures critical business measures in FACT table

Dimensions Generally constant over time. Key values don’t change Data may change (slowly), new products, new stores etc. Descriptions change, in OLTP systems old values overwrite the new ones – in Dimensions you may wish to keep older values

Large Dimensions Some dimensions can contain considerable volumes of data with a great many attributes. Customer dimensions for large US or European service and retail franchises can be very large – consider Vodafone in the UK, travel industry warehouses and large retail loyalty schemes

Shattered Stars In the previous case – consider splitting dimensions with many attributes into mini dimensions Basic Customer account dimension Customer account status dimension Customer demographic dimension

Denormalisation Warehouse data is often termed denormalised Redundancy is apparent, repeating data exists Contains derived data, summarised data etc. Usually done for speed of query / performance

Summary Data Provides fast data access Pre-calculated from millions of rows from fact tables Also referred to as Aggregated fact tables Can be lightly or heavily summarised In Oracle called Materialised Views You can design and pre-calculate or let users calculate on the fly.

Summary Data Tables Can be rolled up data – sales by state, city, region, store etc. Presummarised by hierarchical granularity Can be many summary tables – potentially all the possible combinations of grain (hundreds). Thus, need for maintainance – get rid of those not being used.

Snowflake schema A method of normalising STAR dimensions So PRODUC T dimension broken down into CATEGORY, BRAND, etc. May help with storage space Easier to maintain But.. Less intuitive to users, may degrade performance due to joins being required

Data Warehousing