1 / 29

CS 345: Topics in Data Warehousing

CS 345: Topics in Data Warehousing. Tuesday, October 12, 2004. Review of Thursday’s Class. Facts Semi-additive facts “Factless” fact tables Slowly Changing Dimensions Overwrite history Preserve history Hybrid schemes More dimension topics Dimension roles Junk dimension

alicia
Télécharger la présentation

CS 345: Topics in Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 345:Topics in Data Warehousing Tuesday, October 12, 2004

  2. Review of Thursday’s Class • Facts • Semi-additive facts • “Factless” fact tables • Slowly Changing Dimensions • Overwrite history • Preserve history • Hybrid schemes • More dimension topics • Dimension roles • Junk dimension • More fact topics • Multiple currencies • Master/Detail facts and fact allocation • Accumulating Snapshot fact tables

  3. Outline of Today’s Class • Customer Relationship Management (CRM) • Dimension-focused queries • Drill-across • Conformed dimensions • Customer dimension • Behavioral attributes • Auxiliary tables • Techniques for very large dimensions • Outriggers • Mini-dimensions • Hierarchies • Bridge tables

  4. Customer Relationship Management (CRM) • Currently a hot topic in business data analysis • Idea: Gain better understanding of customer behavior by integrating data from various sources • Multiple interaction types • Orders • Returns • Customer support • Billing • Service / repairs • Multiple interaction channels • Retail store • E-mail • Call center (Inbound / Outbound) • Web site

  5. CRM questions • Customer profitability • Identify most / least profitable customers • 80/20 rule • Customer retention • Which customers are most likely to defect to a competitor? • Which retention measures work best? • Customer acquisition • Which prospects are most promising? • What offers will entice them to become customers? • Up-sell / Cross-sell • Gain additional business from existing customers • Provide targeted offers during “inbound” communications

  6. Dimension-focused Queries • Standard OLAP queries are fact-focused • Query touches one fact table and its associated dimensions • Some types of analysis are dimension-focused • Bring together data from different fact tables that have a dimension in common • Common dimension used to coordinate facts • Sometimes referred to as “drilling across”

  7. Drill-Across Example • Example scenario: • Sales fact with dimensions (Date, Customer, Product, Store) • CustomerSupport fact with dimensions (Date, Customer, Product, ServiceRep) • Question: How does frequency of support calls by California customers affect their purchases of Product X? • Step 1: Query CustomerSupport fact • Group by Customer SSN • Filter on State = California • Compute COUNT • Query result has schema (Customer SSN, SupportCallCount) • Step 2: Query Sales fact • Group by Customer SSN • Filter on State = California, Product Name = Product X • Compute SUM(TotalSalesAmt) • Query result has schema (Customer SSN, TotalSalesAmt) • Step 3: Combine query results • Join Result 1 and Result 2 based on Customer SSN • Group by SupportCallCount • Compute COUNT, AVG(TotalSalesAmt)

  8. A Problem with the Example • What if some customers don’t make any support calls? • No rows for these customers in CustomerSupport fact • No rows for these customers in result of Step 1 • No data for these customers in result of Step 3 • Solution: use outer join in Step 3 • Customers who are in Step 2 but not Step 1 will be included in result of Step 3 • Attributes from Step 1 result table will be NULL for these customers • Convert these NULLs to an appropriate value before presenting results • Using SQL NVL() function

  9. Conformed Dimensions • Bottom-up data warehousing approach builds one data mart at a time • Drill-across between data marts requires common dimension tables • Common dimensions and attributes should be standardized across data marts • Create master copy of each common dimension table • Three types of “conformed” dimensions: • Dimension table identical to master copy • Dimension table has subset of rows from the master copy • Can improve performance when many dimension rows are not relevant to a particular process • Dimension table has subset of attributes from master copy • Allows for roll-up dimensions at different grains

  10. Conformed Dimension Example • Monthly sales forecasts • Predicted sales for each brand in each district in each month • POS Sales fact recorded at finer-grained detail • Product SKU vs. Brand • Date vs. Month • Store vs. District • Use roll-up dimensions • Brand dimension is rolled-up version of master Product dimension • One row per brand • Only include attributes relevant at brand level or higher • Month dimension is rolled-up Date • District dimension is rolled-up Store • Schema • Sales (Date, Product, Store, Promotion, Transaction ID) • Forecast (Month, Brand, District)

  11. Drill-Across Example • Question: How did actual sales diverge from forecasted sales in Sept. ‘04? • Drill-across between Forecast and Sales • Step 1: Query Forecast fact • Group by Brand Name, District Name • Filter on MonthAndYear =‘Sept 04’ • Calculate SUM(ForecastAmt) • Query result has schema (Brand Name, District Name, ForecastAmt) • Step 2: Query Sales fact • Group by Brand Name, District Name • Filter on MonthAndYear =‘Sept 04’ • Calculate SUM(TotalSalesAmt) • Query result has schema (Brand Name, District Name, TotalSalesAmt) • Step 3: Combine query results • Join Result 1 and Result 2 on Brand Name and District Name • Result has schema (Brand Name, District Name, ForecastAmt, TotalSalesAmt) • Outer join unnecessary assuming: • Forecast exists for every brand, district, and month • Every brand has some sales in every district during every month

  12. The Customer Dimension • Customer dimensions can be very wide • Often dozens or even hundreds of attributes • Contact information (name, address, phone, e-mail) • Demographics (age, ethnicity, gender, education, profession, income, household size, etc.) • Psychographics (interests, values, beliefs, attitudes) • Dates (birthday, first purchase, last purchase, online reg. date) • Behavioral scores (RFM, churn propensity, etc.) • Data available from many sources • Information provided directly by customers • Prospect lists acquired from partners or vendors • Syndicated data • Market research • Customs data • Data derived from warehouse analysis

  13. Behavioral Attributes • Customers can be segmented based on past behavior • Aggregated fact data converted to dimensional attributes • Examples: • RFM scoring • Recency of last purchase • Frequency of purchases • Monetary value of purchases • Scores based on predictive models • Propensity to churn • Probability of default • Segmentation based on clustering algorithms • Raw aggregated data • Total dollar sales in past year

  14. Behavioral Attributes • Two techniques for handling behavioral attributes • Dimension attributes generated during ETL process • Stored in dimension table • Good query performance • Limited flexibility • Preserving history possible (but may be expensive) • Virtual attributes created “on demand” via user queries • Stored in auxiliary tables • Very flexible and customizable • Increased management complexity • Increased query complexity • Query performance may suffer • Can’t easily preserve history

  15. Auxiliary Tables for User-Defined Attributes User-CreatedAuxiliary Table Customer Dimension Natural keyof customerdimension User-definedattribute fromquery result Name is natural key • Join dimension and auxiliary table using natural key • Join result looks like “expanded” customer dimension

  16. Another Use of Auxiliary Tables • Track a set of customers over time • For example, a focus group or pre-selected sample • Set of customers may be defined based on a query • Query results may change over time as customer attributes slowly change • How to preserve the initial set? • Create single-column auxiliary table containing natural key of customers in the set • Join to the auxiliary table to filter based on the initial customer set

  17. Continuous vs. Discrete Values • Some attributes have large number of possible values on a continuous scale • Income • Age • Most simple behavioral attributes are of this type • TotalSalesOfProductXIn2003 • Disadvantages of continuous attributes: • Grouping by continuous attribute produces huge, meaningless report • Number of unique attribute combinations explodes • Greater number of rows in dimension • More frequent changes to value of dimension attributes • Group continuous attributes into discrete bands • Like a histogram • Instead of Salary = 47540, use Salary = $40K-$50K • Avoids above disadvantages • Downside = loss of information

  18. Very Large Dimensions • Customer dimensions can be very wide • Dozens or hundreds of attributes • Customer dimensions can be very large • Tens of millions of rows in some warehouses • Sometimes includes prospects as well as actual customers • Size can lead to performance challenges • One case when performance concerns can trump simplicity • Can we reduce width of dimension table? • Can we reduce number of rows caused by preserving history for slowly changing dimension?

  19. Outrigger Tables • Limited normalization of large dimension table to save space • Identify attribute sets with these properties: • Highly correlated • Low in cardinality (compared to # of customers) • Change in unison • Example: • External data provider computes demographic data for each county • 100 demographic attributes are provided • Updates are supplied every six months • Follow these steps for each attribute set: • Create a separate “outriggerdimension” for each attribute set • Remove the attributes from the customer dimension • Replace with a foreign key to the outrigger table • No foreign key from fact row to outrigger • Outrigger attributes indirectly associated with facts via customer dim.

  20. Outrigger Example Customer Dimension County Demographics Outrigger

  21. Outrigger Tables • Advantages: • Space savings • Customer dimension table becomes narrower • Outrigger table has relatively few rows • One copy per county vs. one copy per customer • Disadvantages: • Additional tables introduced • Accessing outrigger attributes requires an extra join • Users must remember which attributes are in outrigger vs. main customer dimension • Creating a view can solve this problem

  22. Mini-Dimensions • Some attributes change relatively frequently • Behavior-based scores • Certain demographic attributes • Age, Income, Marital Status, # of children • How to preserve history without row explosion? • Some attributes are queried relatively frequently • Queries using huge customer dimension are slowed • How to improve query performance? • Create a mini-dimension: • Remove frequently-changing or frequently-queried attributes from the customer dimension • Add them to a separate mini-dimension table instead • Discretize mini-dimension attributes to reduce cardinality • Group continuously-valued attributes into buckets or bands • Example: Age < 20, Age 20-29, Age 30-39, Age 40-49, Age 50+ • Include foreign keys to both customer dimension & mini-dimension in fact table

  23. Mini-Dimensions • Advantages: • History preserved without space blow-up • Fact table captures historical record of attribute values • Mini-dimension has small number of rows • # of unique combinations of mini-dimension attributes is small • Consequence of discretization • Limit number of attributes in a single mini-dimension! • Improved performance for queries that use mini-dimension • At least for those queries that can avoid the main customer dimension • Disadvantages: • Fact table width increases • Due to increased number of dimension foreign keys • Information lost due to discretization • Less detail is available • Impractical to change bucket / band boundaries • Additional tables introduced • Users must remember which attributes are in mini-dimension vs. main customer dimension

  24. Outrigger vs. Mini-Dimension CustomerDimension CustomerDimension Fact Fact Outrigger Mini-dimension

  25. Outrigger vs. Mini-Dimension • Mini-dimension approach • Explicit link between fact table and mini-dimension • No explicit link between customer dimension and mini-dimension • Difficult to express queries that group past customer behavior based on current demographic values • Implicit association via fact table • Outrigger approach • No explicit link between fact table and outrigger • Explicit link between customer dimension and outrigger • Associating facts with outrigger requires join through customer dimension • Preserving history for rapidly-changing attributes leads to customer dimension blow-up • Hybrid approach • Separate some attributes into their own mini-dimension • Add foreign key to mini-dimension to both fact table and customer dimension • Customer dimension foreign key updated using “overwrite history” semantics • Queries based on historically accurate attribute values use fact table foreign key • Queries based on latest attribute values use customer dimension foreign key • Greater expressive power, and greater risk of confusing users!

  26. More Outriggers / Mini-Dims • Lots of information about some customers, little info about others • A common scenario • Example: web site browsing behavior • Web User dimension (= Customer dimension) • Unregistered users • User identity tracked over time via cookies • Limited information available • First active date, Latest active date, Behavioral attributes • Possibly ZIP code through IP lookup • Registered users • Lots of data provided by user during registration • Many more unregistered users than registered users • Most attribute values are unknown for unregistered users • Split registered user attributes into a separate table • Either an outrigger or a mini-dimension • For unregistered users, point to special “Unregistered” row

  27. Handling Hierarchies • Hierarchical relationships among dimension attributes are common • There are various ways to handle hierarchies • Store all levels of hierarchy in denormalized dimension table • The preferred solution in almost all cases! • Create “snowflake” schema with hierarchy captured in separate outrigger table • Only recommended for huge dimension tables • Storage savings have negligible impact in most cases • What about variable-depth hierarchies? • Examples: • Corporate organization chart • Parts composed of subparts • Previous two solutions assumed fixed-depth • Creating recursive foreign key to parent row is a possibility • Employee dimension has “boss” attribute which is FK to Employee • The CEO has NULL value for boss • This approach is not recommended • Cannot be queried effectively using SQL • Alternative approach: bridge table

  28. Fact Bridge Customer parent_id cust_id cust_id child_id Bridge Tables Customer 1 • Customer dimension has one row for each customer entity at any level of the hierarchy • Separate bridge table has schema: • Parent customer key • Subsidiary customer key • Depth of subsidiary • Bottom flag • Top flag • One row in bridge table for every (ancestor, descendant) pair • Customer counts as its own Depth-0 ancestor • 16 rows for the hierarchy at right • Fact table can join: • Directly to customer dimension • Through bridge table to customer dimension Customer 2 Customer 3 Customer 4 Customer 5 Customer 6 Customer 7

  29. Using Bridge Tables in Queries • Two join directions • Navigate up the hierarchy • Fact joins to subsidiary customer key • Dimension joins to parent customer key • Navigate down the hierarchy • Fact joins to parent customer key • Dimension joins to subsidiary customer key • Safe uses of the bridge table: • Filter on customer dimension restricts query to a single customer • Use bridge table to combine data about that customer’s subsidiaries or parents • Filter on bridge table restricts query to a single level • Require Top Flag = Y • Require Depth = 1 • For immediate parent / child organizations • Require (Depth = 1 OR Top Flag = Y) • Generalizes the previous example to properly treat top-level customers • Other uses of the bridge table risk over-counting • Bridge table is many-to-many between fact and dimension

More Related