1 / 57

Database Schema

Database Schema. Akanksha (HT00-6255Y) E. V. Anantha Lakshmi (HD99-9054A) Sze Eng Koon (HD99-2800H) Material Available at http://www.comp.nus.edu.sg/~szeek/cs6203. Agenda. Data warehouse Identifying facts and dimensions Designing fact tables Designing dimension tables

amelia
Télécharger la présentation

Database Schema

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Schema Akanksha (HT00-6255Y) E. V. Anantha Lakshmi (HD99-9054A) Sze Eng Koon (HD99-2800H) Material Available at • http://www.comp.nus.edu.sg/~szeek/cs6203

  2. Agenda • Data warehouse • Identifying facts and dimensions • Designing fact tables • Designing dimension tables • Designing starflake schemas • Multidimensional schema

  3. Data Warehouse • Relational database designed for query and analysis • Contains historical data derived from transaction data • Enables organization to consolidate data from several sources

  4. Schema • Consists of DB objects such as tables, views, indexes • There are many types of schema, e.g., star, snowflake

  5. Fact • In decision support DW, a number of queries ask about an essential fact • E.g., top 20% spending customers • Fact: Loyalty card transactions • Characteristics • Transactions that occurred in past and won’t change • Analyzed in many ways by cross-referencing with attributes

  6. Fact Table • A table that contains facts • Contains numeric, additive fields (measurements of the business) • It has two types of columns: • containing facts • foreign keys to dimension tables

  7. Dimension • A structure that categorizes data to enable end users to answer business questions • e.g., Time, Location, Customer

  8. Example • Video chain store builds DW to analyze sales of its products across all stores over time • Effect of promoting one product on sale of related product that is not promoted? • Product sales before and after the promotion? • Facts • sales (units sold or rented) and profits • Dimensions • products, locations (stores), promotions, time

  9. Major Considerations • Data warehouse supports business process, not specific queries • Must understand what information will be used • Challenge: • Solution should be effective for a reasonable time period (3-5 years) • Design should allow unknown queries to be able to perform

  10. Identifying Facts & Dimensions Given large entity model • Look for elemental transactions in business process • Determine key dimensions that apply to facts • Check that a candidate fact is not a dimension table • Check that a candidate dimension is not a fact table • If 3 or 4 changes assignment go to step 2

  11. 1) Look for Elemental Transactions in Business • Examine business and identify transactions of interest • Transactions that describe fundamental events • e.g., record of account transactions made by a banking customer • Is information operated on by business? • Don’t assume reported transactions are facts • Design to store most detailed transaction to prevent restructuring in future

  12. Example • Consider sales information in retail shop • Stored record Store_id, Product_id, Date, Sale • Business operates on basket, i.e.,individual sales transaction and not aggregate sales • Fact should be basket transaction

  13. 2) Determine Key Dimensions for Facts • Find out entities that are associated with the facts • Need to focus on key dimensions

  14. Account_id Owner_id Account_type Balance Account_id Owner_id Account_type Balance Owner_id Name Address Contact_no account customer analyze account transaction by account Account transaction Owner_id Name Address Contact_no customer analyze how customer uses services

  15. 3) Check If Fact Is Dimension • Entities may appear to be fact tables but may be combination of both facts and dimensions • Consider whether attributes were designed around data entry requirements • Data in fact table rows should not vary over time

  16. Example Address Mistaken for fact table because users will query on information about address

  17. Example (Cont.) Address Date Cable laid Date sub added • Will affect the DB size • Query of operational events much easier

  18. 4) Check If Dimension Is Fact • An entity could be either dimensions or facts • Consider focal point of analysis of the business • Good check: by how many dimensions entity can be viewed, >3 probably fact

  19. Account_id Owner_id Account_type Balance Owner_id Name Address Contact_no account customer Account transaction Owner_id Name Address Contact_no customer analyze how customer uses services

  20. Customer_id Name Age Location_id Occupation_id Location Occupation customer Locationwise % patronage

  21. Designing Fact Table • No practical limit to size of fact table • Balance between size of fact table and value of data

  22. Techniques to Reduce Size of Fact Tables • Identification of significant historical period for each supported function • Determination of samples to satisfy requirements of detailed data • Selection of relevant / appropriate columns

  23. Techniques (Cont.) • Minimization of column size • Determination of intelligent or non-intelligent keys • Determination of format of storing time • Partition the fact table (Next group)

  24. 1.Identification Of Significant Historical Period • Draw retention period graph showing period and detail necessary for each business function • e.g., comparison of sales of seasonal products such as winter clothing, text books etc SalesAnalysis Jan ‘99 July ‘98 Jan ‘98

  25. 2. Determine samples to satisfy requirement of detailed data • Appropriate in situations where analysis requirement is to analyze trends • The business requirement does not require all the detailed data • Store samples and aggregate the rest

  26. 3.Selection of Relevant / appropriate columns • Decide whether optional data needs to be stored in the fact table • e.g., for a table related to bank accounts • The columns typically required are: Acc#, IC#, Amt#, Bank # or Branch# • Optional columns related to Teller Name and Time of entry of records to be decided based on queries’ requirements

  27. 4. Minimization of Column Size • Saving per row can have significant effect on total table size • e.g., a typical transaction table contains 4 billion rows (2 million customers, 3 transactions per day per customer for 2 or 3 yr period) saving of 20 bytes per row will save 20 x 4 =80 GB

  28. 5. Intelligent /Non-intelligent keys • A fact table can be structured in two ways using • Intelligent keys: • Each key represents unique identifier for the item in the real world • Non-Intelligent: • Each unique key is generated automatically and refers to unique identifier of the item in the real world

  29. Intelligent Keys Sales Salesman_name Customer Vehicle_desc Date Sales_price

  30. Non – Intelligent Keys Sales Analysis Salesman Time Sales id Salesman Name Time id Date Vehicle Sales Vehicle id Vehicle Description Sales id Cust id Vehicle id Time id Sales_price

  31. Comparison • Advantages • Intelligent keys improve query performance by avoiding joins • Disadvantages • For intelligentkeys,if any of the unique identifier changes, a lot of rows in fact table have to be updated to reflect new identifiers

  32. 6. Determination of format of storing time • Time information can be stored in many ways within fact table, e.g., • Use of foreign key into time dimension table • Actual physical dates are stored within dimension table

  33. Determination of format of storing time (Cont.) • Possible techniques to store dates are • Storing the physical date • Storing an offset from the inherent start of the table • Storing a date range • Use physical dates rather than foreign keys as the cost of storing is minimum

  34. Determination of format of storing time (Cont.) • A fact table is usually partitioned in a unit of time (e.g. week, month,quarter etc) • Dates can be referred as offsets from the inherent start • Storage costs are low. Two bytes enough to store up to 31 numbers

  35. Determination of format of storing time (Cont.) • Disadvantages: • Queries have to be constructed to convert physical dates into date offsets. • E.g.: To return sum of transactions on ‘9-Jan-00’ Select count(*) from customer where date=‘9-Jan-00’ – ‘1-Jan-00’; • Solution: • If access tools can’t interpret offset, create view that looks like logical table by adding offset to start date

  36. Date Range • Significant saving in disk capacity and query performance • Reduces table size • Has significance when changes occur substantially over time

  37. Date Range (Cont.) • Select count (*) from customer where date between (st_date,end_date); • A number of access tools may not be able to process fact tables in this view • Sometimes a Cartesian product is formed by joining time dimension table against date range in the fact table

  38. Date Range (Cont.) • Cost of Cartesian product is high, it requires • Significant processing power • Temporary space • Consider the use of date range if access tools can cope directly with structure and do not allow data expansion

  39. Designing Dimension Table • Done after fact tables are designed • dimension tables support querying on fact tables • Wrong design is not a major disaster • volumes relatively small • restructuring costs relatively small

  40. Considerations • Star dimension • Hierarchies and networks • Dimensions that vary over time • Managing large dimension tables

  41. Star Dimension • Denormalized tables • put reference information into a single table • speed up query performance • Rely on perceived use of information by typical queries • Not appropriate when additional data is not accessed often • overhead of scanning expanded table

  42. mod#, title, dept# 1 • Put under same table n dept#, department, fac# 1 n mod#, title, dept#, department, faculty#, faculty fac#, faculty Star Dimension (cont.) • View enrollment size by department, faculty acad_year, mod#, enrollment

  43. Star Dimension (cont.) • Shape of a star dimension dimension dimension fact dimension dimension

  44. Hierarchies and Network • Not possible to denormalize all entities into star dimensions • Many-to-many relationships should not be denormalized into star dimension • not efficient • Multiple hierarchies are used to represent different views • denormalized the hierarchy likely to be used by most queries

  45. Dimension Dimension Dimension Dimension Dimension Dimension Hierarchies and Network (cont.) • Example Fact Dimension Dimension Dimension Dimension

  46. Hierarchies and Network (cont.) • Useful if query profiles do not change to a point where other hierarchy becomes more popular • existing star dimension table is useless • For minor changes • add columns • note impact on canned queries if columns are dropped or modified

  47. Dimensions That Vary Over Time • Change in business • t-shirts move from menswear to unisex • baked beans move from canned foods to canned vegetables • May need to support queries that compare facts based on present and past groupings • for example, to compare sales by departments between past and present, need to use both old and new classification • instead of updating record in dimension table, add new record and validity dates

  48. Dimensions That Vary Over Time (cont.) • Year 1996 select sum(s.revenue_achieved) from sales_year_to_date s, product_dimension pd where s.product_id = pd.id and pd.department = ‘Menswear’ and pd.end_date > 1-Jan-96 and pd.start_date < 31-Dec-96; • Year 1997 select sum(s.revenue_achieved) from sales_year_to_date s, product_dimension pd where s.product_id = pd.id and pd.department = ‘Menswear’ and pd.end_date > 1-Jan-97 and pd.start_date < 31-Dec-97;

  49. Dimensions That Vary Over Time (cont.) • If query needs to compare a significant event in corporate calendar, year on year, keep in time dimension • Great Singapore Sales in 1998 is from 15 Jul to 15 Aug • Great Singapore Sales in 1999 is from 10 Jun to 10 Jul • Instead of just joining sales_year_to_date and product_dimension tables, now we also join with time_dimension table.

  50. Managing Large Dimension Tables • Need to watch out for dimension tables that grow too large over time • especially for those dimension tables that store new values instead of updating existing tuples • Indicated by: • size similar to that of fact table • full table scan takes too long • Horizontal partition the large dimension tables • product_dimension_1990s • product_dimension_2000s

More Related