The Fitness Benchmark for Data Warehousing

The Fitness Benchmark for Data Warehousing University of Minnesota, Fall 2006 CSci 8701 Group 12 Michael Goshey

Outline • Problem statement • Related work • Contributions • The Fitness approach • Application of benchmark • Assumptions • Summary and conclusions • Suggestions for future work • Citations

Problem Statement • Given: an existing database and related schema, an accompanying workload of user-submitted queries • Find: produce an empirical measurement that rates the relationship between the database and its queries • Objectives: the measurement should reflect best practices, focus on high-impact concerns, be applicable to major database management systems, be easy to understand and be easily remedied when diagnosed as problematic • Constraints: specifically limited to decision support (data warehouse) databases • Importance: data warehouses are a major area of databases today and their relative importance continues to grow • Difficulty: designing measures and benchmarks can be a political, subjective exercise; broad applicability and relevance are required to achieve consensus support

Related Work • Decision support optimization: much work on optimizing data warehouses but the focus is on improving query performance such as clustering schemes, cache management, materialized view maintenance, improvements to indexing, query approximation techniques • Database benchmarking: work on OLTP and decision support benchmarks such as TPC (industry standard), DWEB and others but the primary focus is on measuring query performance. • Data warehouse quality: broad coverage of data accuracy assurance, do not explicitly deal with decision support system effectiveness • Metadata in data warehousing: best practices for using metadata in data warehouse environments by Kimball, Inmon and others but focus on data age and other attributes as key drivers of change rather than employing usage-based metadata as the framework for decision support benchmarks.

Contributions • Data warehouse benchmark for evaluating decision support capability instead of query execution speed • How fast does it return queries?...No. • How well does it support decisions?...Yes. • Framework of constituent metrics • Metadata-driven assessment

The Fitness Approach • Selection criteria • desirability: accepted as a best practice • impactfulness: related to high-impact concerns • portability: applicable to major dbms • simplicity: easy to understand concepts • improvability: easily remedied when problematic • Metrics • Utilization: thorough use of existing data structures • Coverage: non-reliance on OLTP source systems • Granularity: robust support of required aggregations

Utilization • based on the notion of requested structures: those elements of the warehouse (tables, materialized views, etc.) that are directly referenced in the queries that users submit. • measures the ratio of the size of requested structures to the size of all structures in the warehouse. A high utilization score indicates that the warehouse is efficiently sized.

Coverage • based on the notion of contained queries: the subset of queries submitted that do not require access to or reference of source (transactional) systems. • measures the ratio of contained queries to overall queries. A high coverage score indicates that the warehouse includes the necessary data sources.

Granularity • based on the notion of distinct groups: the strict subset of unique fields appearing in one or more GROUP BY aggregations. • measures the ratio of distinct groups to total groups. A high granularity score indicates that the warehouse provides enough aggregated data to avoid most redundant aggregation.

Fitness • combines the three metrics • introduces three coefficients (a, b, c) for weighting • initially set a = 0.3, b = 0.5, c = 0.2 • 1 = highest score, 0 = lowest score

Application of Benchmark • Reference Data Set from Microsoft • AdventureWorks • AdventureWorksDW • Reference Metadata • DBMS system catalog • user queries (fabricated)

1st Application of Benchmark

2nd Application of Benchmark

Assumptions • Correctness of queries: only required data from the correct sources was requested. • Ad hoc nature of queries: canned queries that are known in advance and not subject to change do not require this type of benchmarking. • Accessibility of both OLTP (AdventureWorks) and decision support (AdventureWorksDW) databases: the coverage metric requires visibility to both.

Summary and Conclusions • Challenged the prevailing notion that speed of query execution is the critical metric in benchmarking data warehouses. • Defined the new fitness benchmark and a framework of metrics for rating the ability of a data warehouse to provide decision support. • Applying the benchmark to an actual data warehouse is feasible and provides meaningful results. • The fitness benchmark can be used to guide directed change and measurable improvement in a warehouse’s ability to provide decision support.

Suggestions for Future Work • Automate the scoring engine • Fine-tune weighting coefficients • Increase precision of metrics • Replace simulated queries with actual • Test with much larger data warehouse • Incorporate additional metrics

Citations • S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy, “Join synopses for approximate query answering.” in SIGMOD Conference, 1999, pp. 275–286. • E. Baralis, S. Paraboschi, and E. Teniente, “Materialized views selection in a multidimensional database.” in VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases. Morgan Kaufmann, 1997, pp. 156–165. • S. Chaudhuri and U. Dayal, “An overview of data warehousing and olap technology.” SIGMOD Record, vol. 26, no. 1, pp. 65–74, 1997. • J. Darmont, F. Bentayeb, and O. Boussaid, “The design of dweb,” ERIC, University of Lyon 2, France, Tech. Rep., June 2005. • ——, “Dweb: A data warehouse engineering benchmark,” in 7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 05), Copenhagen, Denmark, ser. LNCS, vol. 3589. Heidelberg, Germany: Springer, August 2005, pp. 85–94. • H. Garcia-Molina, W. Labio, and J. Yang, “Expiring data in a warehouse.” in VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases. Morgan Kaufmann, 1998, pp. 500–511. • H. Gupta, V. Harinarayan, A. Rajaraman, and J. D. Ullman, “Index selection for olap.” in Proceedings of the Thirteenth International Conference on Data Engineering. IEEE Computer Society, 1997, pp. 208–219. • J. Hammer, H. Garcia-Molina, J. Widom, W. Labio, and Y. Zhuge, “The stanford data warehousing project.” IEEE Data Eng. Bull., vol. 18, no. 2, pp. 41–48, 1995. • V. Harinarayan, A. Rajaraman, and J. D. Ullman, “Implementing data cubes efficiently.” in Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. ACM Press, 1996, pp. 205–216. • H. V. Jagadish, L. V. S. Lakshmanan, and D. Srivastava, “Snakes and sandwiches: Optimal clustering strategies for a data warehouse.” in SIGMOD Conference. ACM Press, 1999, pp. 37–48. • R. Kimball, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition. John Wiley and Sons, 2002. • R. Kimball, L. Reeves, M. Ross, and W. Thornthwaite, The Data Warehouse Lifecycle Toolkit: Tools and Techniques for Designing, Developing, and Deploying Data Warehouses. John Wiley and Sons, 1998.

Citations (cont.) • Y. W. Lee, D. M. Strong, B. K. Kahn, and R. Y. Wang, “Aimq: a methodology for information quality assessment,” Information and Management, vol. 40, no. 2, pp. 133–146, 2002. • S. T. Leutenegger and D. M. Dias, “A modeling study of the tpc-c benchmark.” in SIGMOD Conference. ACM Press, 1993, pp. 22–31. • Microsoft. (2006) AdventureWorks Sample Data Warehouse. SQL Server 2005 Books Online: accessed 11/10/2006. [Online]. Available: http://msdn2.microsoft.com/en-us/library/ms124623.aspx • M. P¨oss and C. Floyd, “New tpc benchmarks for decision support and web commerce.” SIGMOD Record, vol. 29, no. 4, pp. 64–71, 2000. • P. Scheuermann, J. Shim, and R. Vingralek, “Watchman : A data warehouse intelligent cache manager.” in VLDB’96, Proceedings of 22th International Conference on Very Large Data Bases,, 1996, pp. 51–62. • M. Sheina. (2006) A new approach to data warehousing? Computer Business Review OnLine: accessed 10/10/2006. [Online]. Available: http://www.cbronline.com/article cbr.asp?guid=902820B7-06C8-462D-AF69-1FA025C75885 • S. E. Sim, S. Easterbrook, and R. C. Holt, “Using benchmarking to advance research: A challenge to software engineering,” in Proceedings of the 25th International Conference on Software Engineering, vol. 00. Los Alamitos, CA, USA: IEEE Computer Society, 2003, p. 74. • J. Skyt, C. S. Jensen, and T. B. Pedersen, “Specification-based data reduction in dimensional data warehouses.” in Proceedings of the 18th International Conference on Data Engineering. IEEE Computer Society, 2002, p. 278 • Transaction Processing Performance Council. (2006) DBGEN and QGEN and Reference Data Set. TPC: accessed 11/12/2006. [Online]. Available: http://www.tpc.org/tpch/ • P. Vassiliadis, M. Bouzeghoub, and C. Quix, “Towards quality-oriented data warehouse usage and evolution.” in Advanced Information Systems Engineering, 11th International Conference CAiSE’99, ser. Lecture Notes in Computer Science, vol. 1626. Springer, 1999, pp. 164–179. • J. Widom, “Research problems in data warehousing.” in CIKM ’95, Proceedings of the 1995 International Conference on Information and Knowledge Management. ACM, 1995, pp. 25–30.

Key Concepts • Decision Support Systems: OLTP speedy and efficient handling of insert, update and delete transactions vs. decision support systems designed for reporting and ad hoc query and analysis. • Dimensional Modeling: highly normalized OLTP data vs. ‘star’ schemas of de-normalized, flattened dimensions surrounding tables of additive facts. Simplifies filtering and aggregation for users and minimizes the number of required joins. • Data Warehouses and Data Marts: overall decision support system of ETL processes, staging area, data and tools vs. a single logical star schema within a warehouse. • Metadata: data about data; accessible in system catalog; also mined from user queries via tracing and monitoring tools.

The Fitness Benchmark for Data Warehousing

The Fitness Benchmark for Data Warehousing

Presentation Transcript

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing for the SUNY System

Benchmark Data

Data Warehousing

Benchmark Data

Data Warehousing

DATA WAREHOUSING

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

DATA WAREHOUSING

Benchmark Data

Data Warehousing

Data Warehousing

Data Warehousing

THE COMPELLING NEED FOR DATA WAREHOUSING

Data Warehousing

Data Warehousing