多表连接查询优化的相关研究

多表连接查询优化的相关研究 吕彬 2009.3.5

motivation • 图中表示连接顺序对查询效率的影响： • 准确估计选择度要考虑属性间的相关性 • 问题关键： • 高效地计算属性间的相关度

Agenda • Multi-table join overview • Heuristic and randomized optimization for the join ordering problem • Michael Steinbrunn，et al, The VLDB Journal (1997) 6: 191–208 • Attribute correlation detection • BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data • Paul G. Brown Peter J. Haas , Proceedings of the 29th VLDB Conference, 2003 • CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies • Ihab F. Ilyas，Volker Markl, et al, SIGMOD 2004, June 13–18, 2004, • COCA: More Accurate Multidimensional Histograms out of More Accurate Correlations Detection • CAO Wei1, QIN Xiongpai, WANG Shan,WAIM2008 • Star join • Star Gazing from atop your DB2 z/OS Database Server • Terry Purcell, et al, Intelligent Optimizer • Star join revisited: Performance internals for cluster architectures • Josep Aguilar-Saborit, Data & Knowledge Engineering 63 (2007) 995–1013

Multi-table join overview • Heuristic and randomized optimization for the join ordering problem • Choosing join type based on cost • Solution space for the join ordering problem • Join ordering strategies • Quantitative analysis • Conclusion

Multi-table join overview Elapsed Time IO Cost CPU Cost Base Cost Row Cost Page Cost Scan Cost • Choosing join type based on cost • Cost models • Nested loop join • Sort-merge join • Hash join

Multi-table join overview • Solution space for the join ordering problem • Left-deep trees • n! ways to allocate n base relations to the tree’s leaves • good solutions because of exploiting the cost-reducing pipelining technique • Bushy trees • an adaptable plan enumeration strategy • linear graphs (n3 − n)/6 • star graphs (n−1) 2n−2

Multi-table join overview • Join ordering strategies • Deterministic algorithms • heuristic or exhaustive search • Randomized algorithms • Define a set of moves which constitute edgesbetween the different solutions of the solution space • performs a random walk along the edges according to certain rules, • terminating as soon as no more applicable moves exist or a time limit is exceeded • Genetic algorithms • make use of a randomized search strategy very similar to biological evolution • Hybrid algorithms • combine the strategies of pure deterministic and pure randomized algorithms

Multi-table join overview • Quantitative analysis • Class of the join graph • Relation cardinalities and domain sizes

Multi-table join overview • Quantitative analysis • Solution spaces

Multi-table join overview • Quantitative analysis • deterministic algorithms

Multi-table join overview • Quantitative analysis • Randomized and genetic algorithms

Multi-table join overview • Quantitative analysis • Total running time • Find final solution time

Multi-table join overview • Conclusion • Heuristic algorithms Vs Randomized and genetic algorithms • compute quickly, but far from the optimum. • better suited for join optimizations; although require a longer running time • Solution space • Except the star join graph, the bushy tree is preferable than left-deep processing trees. • The extensibility of randomized and genetic algorithms

Attribute correlation detection • Query-Driven approaches • Query workload • Query feedback • Multidimensional histogram • SASH algorithm • Advantage and disadvantage • Data-Driven approaches • Statistical • Chi-square test, Log-linear model • Probability • Bayesian network, Markov network • Full scan vs. Sample • hard FDs, soft FDs, Correlation

Attribute correlation detection • BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data • Example

Attribute correlation detection • Overview of BHUNT • Algebraic Constraints • Example • a1 : deliveries.deliveryDate, • a2 : orders.shipDate, • ⊕:as the subtraction operator, • P : orders.orderID = deliveries.orderID, • I = { 2, 3, 4, 5} ∪ {12, 13, . . . , 19 }∪ {31, 32, 33, 34, 35 } .

Attribute correlation detection • Overview of BHUNT • Generating Candidates C = (a1, a2, P,⊕). • Generating Pairing Rules • Turning Pairing Rules Into Candidates • Identifying Fuzzy Constraints • Constructing Bump Intervals, by applying statistical histogramming, segmentation, or clustering techniques to a sample of the column values • Choosing the Sample Size, The sample size is selected to control the number of “exception” records that fail to satisfy the constraint. • Exploiting the Constraints • Identify the most useful set of constraints, and create “exception tables” to hold all of the exception records. • Modify the queries to incorporate the constraints • the optimizer uses the constraints to identify new, more efficient access paths.

Attribute correlation detection • Experimental results of BHUNT

Attribute correlation detection • CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies • .

Attribute correlation detection • CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies • soft functional dependencies • C1 => C2, the value of C1 determines the value of C2 not with certainty, but merely with high probability. • hard functional dependencies • the value of C1 completely determines the value of C2.

Attribute correlation detection • CORDS overview • First: Enumerating and Pruning • Pruning rule • Type constraint • Statistical constraint • Paring constraint • Workload constraint • Identify trivial cases

Attribute correlation detection • CORDS overview Analyze distinct value of sampled column (to test for SOFT FD) Soft FD? Chi-squared analysis (to test for statistical dependency) NO YES Column Group (Top-k pair) YES Correlated?

Attribute correlation detection • CORDS discovers three property and relationship • Trivial cases • “soft” keys • “trivial” column • Soft FDsC1 => C2 • |C1|/|C1, C2| • CORDS estimates the |C1| and |C1, C2| using sample • Correlations • Detect statistical dependence using the sample-based chi-squared analysis

Attribute correlation detection • CORDS experimental result • Synthetic Data: • The Accidents database • Benchmarking Data: • The TPC-H benchmark • Real-world Data: • include a subset of the Census database and the Auto database,

Attribute correlation detection • CORDS experimental result

Attribute correlation detection • COCA: More Accurate Multidimensional Histograms out of More Accurate Correlations Detection • robust and informative metric —— entropy correlation coefficients • a novel yet simple kind of multi-dimensional synopses —— COCA-Hist to cope with different correlations

Attribute correlation detection • Entropy correlation coefficients

Attribute correlation detection • Merits of COCA • Simple and straightforward • Accurate and robust • Informative and unified

Attribute correlation detection • Different histograms for different degrees of correlations • Histograms for value sparsity • MHIST+squeezing • Histograms for mutual independence • AVI assumption • Histograms for other situations • MHist-MaxDiff histograms • Other improvements • discarding empty buckets

Attribute correlation detection • Experiments • CORDS: #1(0.39, 0.29) ,#3 (0.45, 0.58)

Attribute correlation detection • Experiments

Star join • Star Gazing from atop your DB2 z/OS Database Server • Index key feedback • The problems are: • The cartesian join • difficult to create suitable multi-column indexes unless the combination of filtering dimensions that are included in the queries are known.

Star join • Star Gazing from atop your DB2 z/OS Database Server • PAIR-WISE JOIN

Star join • Star Gazing from atop your DB2 z/OS Database Server • Performance

Star join • Star join revisited: Performance internals for cluster architectures

Star join • Cluster architectures and horizontal partitioning • partitioning schemes: • round robin, • range partitioning • hash partitioning • Collocated and non-collocated • Repartitioning Operator

Star join • Star join processing techniques • Bitmap join (BJ) • size ||T|| · ||S|| bits • large size of the indices • the maintenance of these indices • Multi hierarchical clustering (MHC) • reducing the number of I/O accesses to the fact table. • a chunk-based file system specific to OLAP cubes is proposed. • star join queries are transformed into range queries • Star hash join (SHJ) • Push down bit filters technique

Star join • Analysis

Thanks!Q & A

Agenda • 卡方检验(Chi - square test) 作为非参数检验的一种,是常用的检验方法,通过比较两项或多项频数,旨在检测在一定显著性水平上实际频数与以某种理论模型或分布特征假设为基础的期望频数的差异度[3 ] 。卡方检验第一步,建立原假设H0 和备择假设H1 ;第二步,根据理论经验或理论分布计算期望频数;第三步,根据实际频数和期望频数计算样本卡方值。如果运算出的卡方值大于卡方临界值,接受原假设,反之,接受备择假设[4 ,5 ,6 ]

多表连接查询优化的相关研究

多表连接查询优化的相关研究

Presentation Transcript