1 / 59

TechnoLogica Ltd. 3, Sofiisko Pole Str. tel : ( + 359 2) 91 91 2 (ten lines)

DW Concepts Dimension Modeling Techniques. Milena Gerova Project Manager. TechnoLogica Ltd. 3, Sofiisko Pole Str. tel : ( + 359 2) 91 91 2 (ten lines) e-mail: office@technologica.com, http:// www.technologica.com. TechnoLogica DW Projects.

meryl
Télécharger la présentation

TechnoLogica Ltd. 3, Sofiisko Pole Str. tel : ( + 359 2) 91 91 2 (ten lines)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DW Concepts Dimension Modeling Techniques Milena Gerova Project Manager TechnoLogica Ltd. 3, Sofiisko Pole Str. tel: (+ 3592) 91 91 2 (ten lines) e-mail: office@technologica.com, http:// www.technologica.com

  2. TechnoLogica DW Projects • Business Management System National Health Insurance Fund (10.2004 – current) • Customer Data Integration Allianz Bulgaria Holding (10.2004 – current) • Regulatory Reporting System BULBANK (2002 - 2003) • Information System Monetary StatisticsBulgarian National Bank (April 2003 – August 2004) • Management Information System BULBANK (January 2001 - June 2002)

  3. Agenda • DW Terminology Overview • Dimensional Modeling • Dimension Types • History and Dimensions • Hierarchy in Dimensions

  4. The data warehouse must • Make an organization’s information easily accessible. • Present the organization’s information consistently. • Be adaptive and resilient to change • Be a secure bastion that protects our informationassets. • Serve as the foundation for improved decision making • The business community must accept the data warehouse if it is to bedeemed successful.

  5. Components of a Data Warehouse

  6. Dimensional Modeling • Dimensional modeling is a new name for an old technique for making databasessimple and understandable • Dimensional modeling is quite different from third-normal-form (3NF) modeling ERM ->The Transaction Processing Model • One table per entity • Minimize data redundancy • Optimize update DM -> The data warehousing model • One fact table for a process in the organization • Maximize understandability • Optimized for retrieval • Resilient to change

  7. Star Dimensional Modeling

  8. Four-Step Dimensional Design Process • 1. Select the business process to model. • 2. Declare the grain of the business process. • 3. Choose the dimensions that apply to each fact table row. • 4. Identify the numeric facts that will populate each fact table row.

  9. Dimensions • Determine these by the ways you want to slice and dice the data • Small number of rows compared to facts • Usually 5-10 dimensions surrounding a fact table • Time is almost always a dimension used by every fact • Track history • Uses Surrogate Keys • Hierarchies are usually built into them if possible

  10. Date Dimension • The date dimension is the one dimensionnearly guaranteed to be in every data mart • Date Dimension = Time Dimension before • We can build the date dimension table in advance (5-10 years -> only 3,650 rows)

  11. Date Dimension

  12. Date Dimension

  13. Date Dimension Data warehouses always need an explicit date dimension table. There are manydate attributes not supported by the SQL date function, including fiscal periods,seasons, holidays, and weekends. Rather than attempting to determine these nonstandardcalendar calculations in a query, we should look them up in a datedimensiontable. select sum(f.amount_sold)from DATE_DIM d, FACT fwhere d.Calendar_Month = ‘January’ and d.id = f.date_dim_id;

  14. Dimension Normalization(Denormalized dimension)

  15. Dimension Normalization(Denormalized dimension)

  16. Dimension Normalization(Snowflaking)

  17. Dimension Normalization(Snowflaking) • The dimension tables should remain as flat tables physically. • Normalized,snowflaked dimension tables penalize cross-attribute browsing and prohibit the useof bit-mapped indexes. • Disk space savings gained by normalizing the dimension tablestypically are less than 1 percent of the total disk space needed for the overallschema

  18. Too Many Dimensions

  19. Too Many Dimensions • A very large number of dimensions typically is a sign that several dimensions are notcompletely independent and should be combined into a single dimension. • If our design has 25 or more dimensions, we should look forways to combine correlated dimensions into a single dimension • It is a dimensionalmodeling mistake to represent elements of a hierarchy as separate dimensionsin the fact table.

  20. Surrogate Keys • Every join between dimension and fact tables in the data warehouse should bebased on meaningless integer surrogate keys. • You should avoid using the natural operationalproduction codes. None of the data warehouse keys should be smart,where you can tell something about the row just by looking at the key.

  21. Surrogate Keys Surrogate keys are like an immunization for the data warehouse • Buffer the data warehouseenvironment from operational changes • Performance advantagesThe smaller surrogate key translates into smaller fact tables,smaller fact table indices, and more fact table rows per blockinput-outputoperation • Surrogate keys are used to record dimension conditionsthat may not have an operational code“No Promotion inEffect”, “Date Not Applicable.”

  22. Surrogate Keys • The date dimension is the one dimension where surrogate keys should beassigned in a meaningful, sequential order • Surrogate keys are needed to support one of the primary techniquesfor handling changes to dimension table attributes • Don’t use concatenatedor compound keys for dimension tables

  23. Data Warehouse Bus Architecture

  24. Data Warehouse Bus Matrix

  25. Conformed Dimensions • Most dimensions are defined naturally at the most granular levelpossible • Conformed dimensions are either identical or strict mathematicalsubsets of the most granular, detailed dimension • They have consistent dimension keys, consistent attribute column names,consistent attribute definitions, and consistent attribute values • Theconformed dimension may be the same physical table within the database or may be duplicated synchronously in each data mart

  26. Conformed Dimensions • Roll-up dimensions conform to the base-level atomic dimension if they are a strictsubset of that atomic dimension.

  27. Conformed Dimensions • They should be built once in the staging area • They must be published prior to staging of the fact data • The dimension authority has responsibility fordefining,maintaining, and publishing a particular dimension or its subsets to all the datamart clients who need it

  28. Tracking History in Dimensions • Unchanging Dimensions • Changing, but Original Values are Irrelevant A phone number in a customer record • Slowly Changing Dimensions (SCD) A customer address, manager • Rapidly Changing DimensionsIncome range of a customer • Continuously Changing Dimensions Customer age

  29. Type 1: Overwrite the Value • The type 1 response is easy to implement, but: • it does not maintain any history ofprior attribute values • any preexisting aggregations based on thedepartment value will need to be rebuilt

  30. Type 2: Add a Dimension Row • The type 2 response is the primary technique for accurately tracking slowly changingdimension attributes. It is extremely powerful because the new dimension row automaticallypartitions history in the fact table. • It’s not suitable for dimension tables that already exceed a million rows

  31. Type 2: Add a Dimension Row

  32. Type 3: Add a Dimension Column • The type 3 slowly changing dimension technique allows us to see new and historicalfact data by either the new or prior attribute values.

  33. Hybrid SCD TechniquesSeries of Type 3 Attributes Predictable Changes withMultiple Version Overlays • Report each year’s sales using the district map for that year. • Report each year’s sales using a district map from an arbitrarydifferent year. • Report an arbitrary span of years’ sales using a single district map fromany chosen year. The most common version of this requirement wouldbe to report the complete span of fact data using the current district map.

  34. Hybrid SCD TechniquesType 2 with "Current" Overwrite Unpredictable Changes withSingle-Version Overlaypreserves historical accuracywhile supporting the abilityto report historical data according to the current values

  35. Dimension Table Staging

  36. Dimension Table Staging

  37. Junk Dimensions • What to do with flags and indicators • Leave the flags and indicators unchanged in the fact table row. • Make each flag and indicator into its own separate dimension • Strip out all the flags and indicators from the design. • A junk dimension is a convenient grouping of typically low-cardinality flags and indicators

  38. Junk Dimensions • Whether to use junk dimension • 5 indicators, each has 3 values -> 243 (35) rows • 5 indicators, each has 100 values -> 100 million (1005) rows • When to insert rows in the dimension

  39. Multiple Currencies

  40. Customer Dimension • Critical element for effective CRM • The most challenging dimension for anydata warehouse • extremely deep (with millions of rows) • extremely wide (with dozens or evenhundreds of attributes) • sometimes subject to rather rapid change

  41. Customer Dimension Name and Address Parsing

  42. Customer Dimension Other Common Customer Attributes • Gender • Ethnicity • Age or other life-stage classifications • Income or other lifestyle classifications • Status (for example, new, active, inactive, closed) • Referring source • Business-specific market segment • Scores characterizing the customer, such as purchase behavior, paymentbehavior, product preferences

  43. Customer Dimension Aggregated Facts as Attributes • These attributes are to be used for constraining and labeling;they are not to be used in numeric calculations • Focus on those which will beused frequently • Minimize the frequency withwhich these attributes need to be updated • Replace metrics with more meaningfuldescriptive values, such as “High Spender”

  44. Dimension Outriggers for aLow-Cardinality Attribute Set

  45. Rapidly Changing CustomerDimensions • Challenges • It generally takes too long to constrain or browse among the relationships in such a big table • It is difficult to use previously described techniques for tracking changes in these large dimensions • One solution is to break offfrequently analyzed or frequently changing attributes into a separate dimension,referred to as a minidimension

  46. Rapidly Changing CustomerDimensions The Mini Dimension with "Current" Overwrite

  47. Rapidly Changing CustomerDimensions • The minidimension terminology refers to when the demographics key is part of thefact table composite key • If the demographics key is a foreign key in the customer dimension,we refer to it as an outrigger

  48. Rapidly Changing CustomerDimensions Type 2 with Natural Keys in Fact Table

  49. Implications of Type 2 CustomerDimension Changes • Be careful to avoidovercounting because we may have multiple rows in the customer dimensionfor the same individual • COUNT DISTINCT • A most recent row indicator • The comparison operators depend on the businessrules used to set our effective/expiration dates.

  50. Customer Behavior Study Groups • Capture the keysof the customers or products whose behavior you are tracking

More Related