1 / 41

Best Practices for Query Performance In a Data Warehouse

Best Practices for Query Performance In a Data Warehouse. Calisto Zuzarte IBM calisto@ca.ibm.com Session Code: D09 May 13, 2010 8:30AM–9:30AM Platform: Linux, Unix and Windows. Data Warehouse Life Cycle. Database design / Application design

Télécharger la présentation

Best Practices for Query Performance In a Data Warehouse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Best Practices for Query Performance In a Data Warehouse Calisto Zuzarte IBM calisto@ca.ibm.com Session Code: D09 May 13, 2010 8:30AM–9:30AM Platform: Linux, Unix and Windows

  2. Data Warehouse Life Cycle • Database design / Application design • The Warehouse Application architects and Database Administrators work together to design the queries and schema before they put the application in production • Database performance layer implementation • In order to meet SLAs, DBAs usual go through some iterations augmenting the database with performance layer objects and set up the initial configuration to get good performance • Database tuning operations • During production, with changing requirements and change in data, there is on-going tuning required to keep operations smooth.

  3. Motivation • Data warehouse environments characteristics: • Large volumes of data • Millions/Billions of rows involved in some tables • Large amounts of data rolled-in and rolled-out • Complex queries • Large Joins • Large Sorts, • Large amounts of Aggregations • Many tables involved • Ad Hoc Queries • It is important to pay attention to query performance

  4. Objectives • Provide recommendations so that you can improve data warehouse query performance • Database Design considerations • Application Design considerations • Performance Layer Considerations • Ongoing Tuning Considerations

  5. Agenda • Best Practices – Database Design • Best Practices – Application Design • Best Practices – Performance Layer • Best Practices – Configuration and Operations

  6. Best Practices – Database Design • Best Practices - Parallelism • Inter-partition Shared nothing parallelism • Intra-Query Parallelism (SMP) • Best Practices - Partitioning • Database Partitioning • Table Partitioning • Multi-Dimension Clustering • UNION ALL Views • Best Practices – Schema • Best Practices - Compression

  7. Best Practices - Parallelism • Database partition feature (DPF) is recommended • To achieve parallelism in a data warehouse • For scalability and query performance • SMP (Intra-Query Parallelism) not recommended • In concurrent multi-user environments with heavy CPU usage • SMP recommended • When CPUs are highly under utilized • When DPF is not an option

  8. Partitioning (Complimentary Strategies in DB2) • Database Partitioning (DPF) … DISTRIBUTE BY HASH • Key Benefit : Better scalability and performance through parallelism • Multidimensional Clustering (MDC) … ORGANIZE BY DIMENSION • Key Benefit : Better query performance through data clustering • Table (Range) Partitioning … PARTITION BY RANGE • Table Partitioning • Key Benefit : Better data management (roll-in and roll-out of data) • UNION ALL Views • Key Benefit : Independent branch optimization

  9. Divide And Conquer ! Distribute, Partition, Organize ! Organize By Partition By Distribute By

  10. Best Practices – Database Partitioning • Collocate the fact and largest dimension • Choose to avoid significant skew on some partitions • Avoid DATE dimension where active transactions for current date all fall on one database partition (TIMESTAMP is good) • Possibilities for workload isolation for data marts • Different partition groups but common dimension tables • Needs replicated tables (discussed later)

  11. Best Practices – Table Partitioning • Recommend partitioning the fact tables • Typically based on DATE dimension • Works better with application key predicates applied directly • Table or Range Partitioning • Recommend table or range partitioning (V9.7 :partitioned indexes) • Choose partitioning based on roll-in / roll-out granularity • UNION ALL Views • Each branch optimized independently • Use with well designed applications (Dangers of materialization) • Large number of branches require time and memory to optimize • Needs predicates with constants for branch elimination

  12. Best Practices – Multidimensional Clustering (MDC) • Recommend defining MDC on the fact table • Guaranteed clustering (Avoids the need to REORG for clustering) • I/O optimization • Compact indexes (compact, coexists with regular indexes) • Choose dimensions based on query predicates • Recommend the use of 1 to 4 dimensions • Need to ensure dimensions are chosen such that they do not waste storage • Could choose a finer granularity of Table partitioning range • For example: Table partition range by month, MDC by date

  13. Star Schema STORE PRODUCT Store_id Region_id … Product_id Class_id Group_id Family_id Line_id Division_id … SALES Product_id Store_id Channel_id Date_id Amount Quantity … TIME CHANNEL Date_id Month_id Quarter_id Year_id Channel_id …

  14. Dimension Hierarchy Product Dimension Division Level 5 Line Level 4 Time Dimension Year Family Level 3 Quarter Group Level 2 Store Dimension Month Retailer Class Level 1 Channel Dimension Date Product Channel Store Level 0 Sales Fact

  15. Best Practices - Schema • Surrogate Keys • As far as possible use application keys themselves • allows predicates to be applied/transferred directly on the fact table • DATE is a good candidate (easier to roll-in/roll-out and for MDC ) • Star Schema / Snowflakes • Separate tables for each dimension hierarchy (snowflake) may result in a large number of joins • Flattened dimensions may contain a lot of redundancy (space) • Define Columns NOT NULL when appropriate • Many optimizations that are done based on NOT NULL • Define Uniqueness when appropriate • Primary Keys / Unique Constraints / Unique Indexes

  16. Compression • Table, Index and Temp Table compression • Huge benefits with storage savings • With table and TEMP compression 30-70% • With Index compression 30-40% • Performance gains because • Less I/O and better use of bufferpools • TEMP table compression helps operators like Hash Join, Merge Join, Sorts and Table Queues if they spill

  17. Best Practices - Compression • Consider compression particularly with the fact table • Strongly recommend compression on the fact table when not CPU bound

  18. Agenda • Best Practices – Database Design • Best Practices – Application Design • Best Practices – Performance Layer • Best Practices – Configuration and Operations

  19. Best Practices – Application Considerations • Use constants instead of expressions in the query • Example • WHERE DateCol <= CURRENT DATE – 5 • Use VALUES(CURRENT DATE – 5) to get the resulting constant first and use it in the query • Avoid expressions on indexed columns • Example • WHERE DATECOL – 2 DAYS > ‘2009-10-22’ • WHERE DATECOL > ‘2009-10-22’ + 2 DAYS

  20. Best Practices – Application Considerations • Avoid mixing data types in join predicates • Example • WHERE IntegerCol = DecimalCol • Use Global Temporary Tables to split a query if it contains more than 10-15 tables • Reduces optimization time

  21. Agenda • Best Practices – Database Design • Best Practices – Application Design • Best Practices – Performance Layer • Best Practices – Configuration and Operations

  22. Best Practices – Performance Layer • Indexes • Statistics • Distribution Statistics • Column Group Statistics • Statistical Views • Constraints • Referential Integrity • Materialized Query Tables • Replicated Tables

  23. Indexes • Indexes are a vertical subset of the data in the table • Indexes provide ORDER • Indexes may allow for clustered access to the table

  24. Index Considerations • To get Index Only Access instead of more expensive ISCAN-FETCH or TSCAN (Table Scan) • To avoid SORTs particularly those that spill • To promote index-ORing and index-ANDing • To promote Star Joins • When you have range join predicates • Better possibilities with Nested Loop Join • Indexes for clustering (MDC)

  25. Cardinality Estimation • Estimating the size of intermediate results is critical to getting good query execution plans • Without sufficient information, the optimizer can only guess based on some assumptions • Data skew and statistical correlation between multiple column values introduce uncertainty • Pay attention to DATE columns

  26. Best Practices - Statistics • Collect distribution Statistics when there is skew and predicates use constants • Consider a high number of quantile statistics on columns with DATE range predicates and character string columns

  27. Column Group Statistics 1 2 3 • Example: COUNTRY = ‘Germany’ And CITY = ‘Frankfurt’ • No CGS: Selectivity = ½ * 1/3 =1/6 …Estimate 1 row • With CGS: Selectivity = 1/3 …Estimate 2 rows

  28. Problem Scenario - Skew 10000000 rows CUST Table 100 rows, 100 custids Frequency Statistics – SALES Table SELECT … FROM SALES, CUST WHERE CUST.CNAME = ‘IBM’ AND CUST.CUSTID = SALES.CUSTID Cardinality Estimate with Uniformity = 100,000 Actual Cardinality : 2,000,000 !!!!!!!!!!!!!!!!!!!!

  29. Best Practices - Statistics • Collect Column Group Statistics with multiple predicates on the same table • WHERE Country = ‘CANADA’ and City = ‘TORONTO’ RUNSTATS … ON ALL COLUMNS AND ON COLUMNS ((country, city) … ) … • Consider Statistical Views when • There is skew on the join column • There is a significant difference in the range of values in the fact and the dimension CREATE VIEW SV1 AS (SELECT C.* FROM CUST C, FACT F WHERE C.CUST_ID = F.CUST_ID) ALTER VIEW cust_fact ENABLE QUERY OPTIMIZATION RUNSTATS ON TABLE dba.cust_fact WITH DISTRIBUTION

  30. Referential Integrity (RI) • Facilitates aggregation push down • Example in the appendix section • Eliminates redundant joins in views • RI helps determine that queries that do not require data from a primary key table need not do that join even if it is in the view • Helps with Materialized Query Table matching • Allows Queries to match MQTs with more dimension table joins

  31. GB GB GB GB JOIN JOIN JOIN JOIN JOIN JOIN JOIN JOIN Dim2 Dim2 Dim2 Dim2 Fact Fact Fact Fact Dim1 Dim1 Dim1 Dim1 Consider Materialized Query Tables Joe’s Query Q9 Q9 Sue’s Joe’s Q Bob’s Q GB Sue’s Query JOIN JOIN Dim2 Bob’s Query MQT Fact Dim1

  32. Best Practices - Defining Materialized Query Tables • What MQTs should I define ? • Estimate the size of the candidate MQTs by executing COUNT queries against base tables. • Try to achieve at least a 10X reduction in size between fact and the MQT • Build MQTs with a reasonable number of GROUP BY columns (3 to 6 dimension keys) at a time based on query patterns • As far as possible build the MQT from the fact table alone • Use Table Partitioning for the fact table and the MQTs

  33. Best Practices - MQT Matching • Define Referential Integrity to help with matching MQTs that contain more tables than the queries • Define Functional Dependencies for thinner MQTs • Use COUNT_BIG instead of COUNT for DPF MQTs • Define indexes on MQTs • Keep statistics up-to-date • Define base table columns NOT NULL as far as possible • For example we can match SUM(A + B) with SUM(A) + SUM(B)

  34. Best Practices – MQT Maintenance • REFRESH IMMEDIATE • Create an index on the GROUP BY columns • Create the index on the set of columns that form a unique key • Always keep the base table and MQT statistics up-to-date • REFRESH DEFERRED • If log space is an issue, consider NOT LOGGED INITIALLY or LOAD from cursor • An MQT can be temporarily toggled into a regular table by using • ALTER TABLE … DROP MATERIALIZED QUERY • ALTER TABLE … ADD MATERIALIZED QUERY • Use ATTACH / DETACH if fact table and MQT are range partitioned tables

  35. Replicated Tables • Replicate dimension tables (unless collocated with fact ) • Benefit : Avoids data movement • Important : Define suitable indexes • If too large, replicate a subset of frequently used columns JOIN JOIN JOIN BTQ BTQ BTQ CUST CUST COPY SALES CUST COPY SALES CUST COPY SALES

  36. Agenda • Best Practices – Database Design • Best Practices – Application Design • Best Practices – Performance Layer • Best Practices – Configuration and Operations

  37. Best Practices – Configuration • Optimization Level 5 • Registry Variables • DB2_ANTIJOIN=EXTEND • If slow queries have NOT EXISTS, NOT IN predicates • DB2_REDUCED_OPTIMIZATION=YES • If compile time is an issue • Configuration thumb rules • BUFFPOOL ~= SHEAPTHRES • SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN)

  38. Best Practices - Statistics • The DB2 Query Optimizer relies on reasonably accurate statistics to get a good query plans • User runs RUNSTATS when data changes (part of ETL) • Statistics Fabrication (unreliable) • DB2 keeps UPDATE / DELETE / INSERT counters • Fabrication limited to a few statistics – Not enough • Consider configuring Automatic Statistics • Automatically collects statistics on tables in need • Runs in the background as a low priority job • Consider configuring Real Time Statistics • Collects statistics on-the-fly

  39. Summary – Best Practices • Database Design : • Parallelism, Partitioning, Schema, Compression • Application Design • SQL Tips • Performance Layer • Indexes, Statistics, Referential Integrity, Materialized Query Tables, Replicated Tables • Configuration and Operations • Configuration, Collecting Statistics

  40. Calisto Zuzartecalisto@ca.ibm.com

More Related