1 / 35

Consistently Estimating the Selectivity of Conjuncts of Predicates

Consistently Estimating the Selectivity of Conjuncts of Predicates. Volker Markl , Nimrod Megiddo, Marcel Kutsch, Tam Minh Tran, Peter Haas, Utkarsh Srivastava. Title slide. Agenda. Consistency and Bias Problems in Cardinality Estimation The Maximum Entropy Solution Iterative Scaling

brandona
Télécharger la présentation

Consistently Estimating the Selectivity of Conjuncts of Predicates

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Consistently Estimating the Selectivity of Conjuncts of Predicates Volker Markl, Nimrod Megiddo, Marcel Kutsch, Tam Minh Tran, Peter Haas, Utkarsh Srivastava Title slide

  2. Agenda • Consistency and Bias Problems in Cardinality Estimation • The Maximum Entropy Solution • Iterative Scaling • Performance Analysis • Related Work • Conclusions

  3. What is the problem? • Consider the following three attributes: Make Model Color Legend: Correlation

  4. How to estimate the cardinality of the predicate… 200,000 100,000 Make ‘Mazda’ Model ‘323’ (real cardinality: 49,000) … Make = ‘Mazda’ AND Model = ’323’ AND Color = ‘red’ 200,000 Legend: cardinality Color ‘red’ attribute value

  5. Without any additional knowledge 100,000 200,000 Base cardinality: 1000,000 Make ‘Mazda’ Model ‘323’ 200,000 Selectivity( Make= ‘Mazda’ AND Model = 323 AND Color = ‘red’ ) Color red  Independence assumption: s(Make = ‘Mazda’ ) * s( Model = ‘323’ ) * s( Color =‘red’ ) = 100,000 * 200,000 * 200,000 = 0.004 1,000,000 1,000,000 1,000,000 Legend: denote by s(?) the selectivity of ? Estimated Cardinality: 0.004 * 1,000,000 = 4000

  6. Additional knowledge given (1): 100,000 200,000 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Make ‘Mazda’ Model ‘323’ 200,000 case 1: s( Make AND Model ) * s( Color ) = Color ‘red’ estimated card: 10,000 50,000 * 200,000 = 0.01 1,000,000 1,000,000 Additional knowledge: Make AND Model card(‘Mazda’AND ‘323’) = 50,000 Legend: Conjunct Pred X AND Pred Y Cardinality

  7. Additional knowledge given (2): 100,000 200,000 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Make ‘Mazda’ Model ‘323’ 200,000 case 1: s( Make AND Model ) * s(Color) = 0.01  estimated card: 10,000 Color ‘red’ case 2: s( Make AND Color ) * s( Model ) = estimated card: 18,000 200,000 * 90,000 = 0.018 1,000,000 1,000,000 Additional knowledge: Make AND Model card(‘Mazda’AND ‘323’) = 50,000 Legend: Make AND Color cardl(‘Mazda’AND ‘red’) = 90,000 Conjunct Pred X AND Pred Y Cardinality

  8. Additional knowledge given (3): 100,000 200,000 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Make ‘Mazda’ Model ‘323’ 200,000 case 1: s( Make AND Model ) * s(Color) = 0.01  estimated card: 10,000 Color ‘red’ case 2: s( Make AND Color ) * s( Model ) = 0.018  estimated card: 18,000 Additional knowledge: case 3: s( Model AND Color ) * s( Make ) = estimated card: 15,000 150,000 * 100,000 = 0.015 1,000,000 1,000,000 Make AND Model card(‘Mazda’AND ‘323’) = 50,000 Legend: Make AND Color cardl(‘Mazda’AND ‘red’) = 90,000 Conjunct Pred X AND Pred Y Cardinality Model AND Color card(‘323’AND ‘red’) = 150,000

  9. 15,000 18,000 FETCH Make FETCH Model 150,000 90,000 Index Scan Index Scan Model, Color Make, Color Why is this a problem? case 0: s( Make) * s(Model ) * s(Color) = 0.004  estimated card: 4,000 Cardinality Bias Fleeing from Knowledge to Ignorance case 3: s( Model AND Color ) * s( Make ) = 0.015  estimated card: 15,000 case 2: s( Make AND Color ) * s( Model ) = 0.018  estimated card: 18,000 4,000 Index Intersect Make Color Model

  10. What has happened? • Inconsistent model • different estimates for the same intermediate result • due to multivariate statistics with overlapping information • Bias during plan selection • results in the selection of sub-optimal plans • Bias Avoidance means keeping the model consistent • State-of-the-art is to do bookkeeping of the first multivariate statistic used, and ignore further overlapping multivariate statistics • Does not solve the problem, as ignoring knowledge also means bias • Bias is arbitrary, depends on what statistics are used first during optimization • Only possible solution is to exploit all knowledge consistently

  11. Problem: Only partial knowledge of the DNF atoms 100,000 200,000 Mazda 323 Make ‘Mazda’ Model ‘323’ 200,000 Mazda& 323& red Mazda&323&red Color ‘red’ Mazda &323&red Mazda & 323 & red Additional knowledge: Mazda&323 &red Mazda&323 &red Make AND Model p(‘Mazda’AND ‘323’) = 50,000 Legend: Mazda &323&red DNF = disjunctive normal form X denotes not X Make AND Color pl(‘Mazda’AND ‘red’) = 90,000 Mazda &323&red Model AND Color p(‘323’AND ‘red’) = 150,000 red

  12. How to compute the missing values of the distribution? Mazda 323 Mazda& 323& red Mazda&323&red Mazda &323&red Mazda & 323 & red Mazda&323 &red Mazda&323 &red Probability( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Mazda &323&red Mazda &323&red red

  13. Solution: Information EntropyH( X ) = -∑ xi log( xi ) • Entropy is a measure for the “uninformedness” of a probability distribution X=(x1, …, xm) with x1 + … + xm = 1 • Maximizing information entropy • for unknown selectivities • using known selectivities as constraints will avoid bias • The less is known about a probability distribution, the larger the entropy • Nothing uniformity: s(X = ?) = 1/m • Marginals independence: s(X = ? and Y = ?) = s(X=?) * s(Y=?) • Thus: the principle of maximum entropy generalizes uniformity and independence used in today’s query optimizers

  14. Entropy Maximization for Cardinality Estimation given some selectivities (single and conjunctive) over a space of n predicates p1, …, pn  choose a model which is consistent with this knowledge but otherwise as uniform as possible  maximize the entropy of the probability distribution X = (xb | b {0,1}n) xb is the selectivity of the DNF atom bi = 0 means that predicate pi is negated in the DNF bi = 1 means that predicate pi is a positive term in the DNF Legend: {0,1}n denotes the n-fold cross product of theset {0,1}, i.e., {0,1}  …  {0,1} n times Also, for a predicate p1 = pp0 = notp

  15. 0 0 0 Mazda 323 Mazda&323&red 1 0 0 0 1 0 Mazda& 323& red 1 1 0 Mazda&323&red Mazda&323&red Mazda& 323& red 1 1 1 1 0 1 0 1 1 Mazda&323 &red Mazda&323 &red 0 0 1 red Mazda&323&red Maximum Entropy Principle – Example: Knowledge sY, Y T: s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2 s1,2 =s(Mazda &323)= 0.05 s1,3 =s(Mazda& red)= 0.09 s2,3 =s(red&323)= 0.15 T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, } • Constraints: s1= Mazda&323&red + s1 = x100 + Mazda &323&red + x101 + Mazda&323&red + x110 + Mazda&323&red x111

  16. 0 0 0 Mazda 323 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 red Maximum Entropy Principle – Example: Knowledge sY, Y T: s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2 s1,2 =s(Mazda &323)= 0.05 s1,3 =s(Mazda& red)= 0.09 s2,3 =s(red&323)= 0.15 T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, } • Constraints:  0.10 = s1 = x100 + x101 + x110 + x111  0.20 = s2 = x010 + x011 + x110 + x111  0.20 = s3 = x101 + x111 + x011 + x001  0.05 = s1,2 = x110 + x111  0.09 = s1,3 = x101 + x111  0.15 = s2,3 = x011 + x111  1.00 = s = x000 + x001 + x010 + x011 + x100 + x101 + x110 + x111 • Objective Function:

  17. 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 Solving the Constrained Optimization Problem Minimize the objective function: Satisfying the |T|2{1, .., n} constraints: Legend: 2{1,…,n} denotes the powerset of {1,..,n} C(Y) denotes all DNF atoms that contribute to Y, i.e., formally, C(Y) := {b {0,1}n | iY : bi = 1} and C() := {0,1}n  General solution: Iterative Scaling

  18. Differentiation w.r. to xb and equating to zero yields conditions for minimum Legend: P(b, T)  Tdenotes the indexes Y of all known selectivities sY to which DNF atom b contributes its value xb: P(b,T) = {YT | iY: bi = 1}  {} Maximum Entropy and Lagrange Multipliers • is convex. • Replacing xb in each constraint yields a condition in the exponentiated Lagrange multipliers zX • We can build a Lagrangian function by associating a multiplier Ywith each constraint and subtracting the constraints from the objective function • Exponentiation of the Lagrange Multipliers in the derivatives yields product form

  19. Iterative Scaling • We can now isolate zY for a particular Y T • and thus iteratively compute zY from all zW, W T\{Y} • This algorithm is called Iterative Scaling (Darroch and Ratcliff, 1972) and converges to a stable set of Lagrangian multipliers zY, Y T • This stable point minimizes the objective function and satisfies all constraints • We can compute all DNF atoms xb from these stable multipliers using • and can in turn compute all missing selectivities

  20. Maximum Entropy Solution of the Example Knowledge: Mazda s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2 s(Mazda &323)= s1,2 = 0.05 s(Mazda& red)= s1,3 = 0.09 s(red &323)= s2,3 = 0.15 323 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) s1,2,3 = x111 = ??? 0 0 1 red

  21. Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s1 x 110 z1 = 0.067957 z2 = 1 z1,2 = 1 z3 = 1 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.1 s2 = 0.785759 s1,2 = 0.05 s3 = 0.785759 s1,3 = 0.05 s2,3 = 0.392879 s = 1.571518 x 100 x 111 x 101

  22. Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s2 x 010 x 110 z1 = 0.067957 z2 = 0.254531 z1,2 = 1 z3 = 1 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.062727 s2 = 0.2 s1,2 = 0.012727 s3 = 0.492879 s1,3 = 0.031363 s2,3 = 0.1 s = 0.985759 x 111 x 011

  23. Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s1,2 x 110 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 1 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.1 s2 = 0.237273 s1,2 = 0.05 s3 = 0.511516 s1,3 = 0.05 s2,3 = 0.118637 s = 1.023032 x 111

  24. Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s3 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.069550 s2 = 0.165023 s1,2 = 0.034775 s3 = 0.2 s1,3 = 0.019550 s2,3 = 0.046386 s = 0.711516 x 111 x 011 x 101 x 001

  25. Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s1,3 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 4.603645 z2,3 = 1 z = 1 s1 = 0.14 s2 = 0.200248 s1,2 = 0.07 s3 = 0.27045 s1,3 = 0.09 s2,3 = 0.081611 s = 0.781966 x 111 x 101

  26. Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s2,3 Mazda 323 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 4.603645 z2,3 = 1.837978 z = 1 s1 = 0.177709 s2 = 0.268637 s1,2 = 0.107709 s3 = 0.338839 s1,3 = 0.127709 s2,3 = 0.15 s = 0.850355 x 111 x 011 red

  27. Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ sØ z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 4.603645 z2,3 = 1.837978 z = 1.175979 s1 = 0.208982 s2 = 0.315911 s1,2 = 0.126664 s3 = 0.398468 s1,3 = 0.150183 s2,3 = 0.176397 s = 1 0.029399 0.029399 0.110115 0.097264 0.079133 0.052919 0.169152 0.432619

  28. Maximum Entropy Solution of the Example Knowledge: Mazda s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2 s(Mazda &323)= s1,2 = 0.05 s(Mazda& red)= s1,3 = 0.09 s(red &323)= s2,3 = 0.15 323 0.009918 0.049918 0.000082 0.049918 0.040082 0.100082 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) s1,2,3 = x111 = 0.049918 Iterations: 241 0.009918 0.740082 red

  29. Let’s compare: Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Real : s( Model AND Color AND Make ) = 0.049 actual card:49,000 case 0: s( Make) * s( Model ) * s(Color) = 0.004  estimated card: 4,000 Error: 10x case 1: s( Make AND Model ) * s(Color) = 0.010  estimated card: 10,000 Error: 5x case 2: s( Make AND Color ) * s( Model ) = 0.018  estimated card: 18,000 Error: 2.5x case 3: s( Model AND Color ) * s( Make ) = 0.015  estimated card: 15,000 Error: 3x ME: s( Model AND Color ) * s( Make ) = 0.049918 estimated card: 49,918 Almost no error

  30. Forward Estimation: Predicting s1,2,3 , given … Legend: 4th quartile 3rd quartile median 200 queries 2nd quartile 1st quartile s1,2s1,3s2,3 s1s2s3 s1,3s2,3 s1,2s1,3 s1,2s2,3 s1,3 s2,3 s1,2 s1,2,3

  31. Comparing DB2 and ME : Predicting s1,2,3 , given … Legend: Legend: th th 4 4 quartile quartile rd rd 3 3 quartile quartile median 200 200 mean queries queries nd nd 2 2 quartile quartile st st 1 1 quartile quartile DB2 ME DB2 ME DB2 ME DB2 ME s1,3s2,3 s1,2s1,3 s1,2s2,3 s1,2,s1,3, s2,3

  32. Backward Estimation: Given s1,2,3 , predicting … Legend: Legend: th th 4 4 quartile quartile rd rd 3 3 quartile quartile median 200 200 mean queries queries nd nd 2 2 quartile quartile st st 1 1 quartile quartile DB2 ME DB2 ME DB2 ME s1,2 s1,3 s2,3

  33. Computation Cost

  34. Related Work • Selectivity Estimation • SAC+79 P.G. Selinger et al: Access Path Selection in a Rela­tional DBMS. SIGMOD 1979 • Chr83 S. Christodoulakis: Estimating record selectivities. Inf. Syst. 8(2): 105-115 (1983) • Lyn88 C. A. Lynch: Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Col­umn Values. VLDB 1988: 240-251 • PC84 G. Piatetsky-Shapiro, C. Connell: Accurate Estimation of the Number of Tuples Satisfying a Condition. SIGMOD Confer­ence 1984: 256-276 • PIH+96 V. Poosala, et. al: Improved histograms for selectivity estima­tion of range predicates. SIGMOD 1996 • Recommending, Constructing, and Maintaining Multivariate Statistics • AC99 A. Aboulnaga, S. Chaudhuri: Self-tuning Histograms: Build­ing Histograms Without Looking at Data. SIGMOD 1999: 181-192 • BCG01 N. Bruno, S. Chaudhuri, L. Gravano: STHoles: A Multidi­mensional Workload-Aware Histogram. SIGMOD 2001 • BC02 N. Bruno and S. Chaudhuri: Exploiting Statistics on Query Expressions for Optimization. SIGMOD 2002 • BC03 N. Bruno, S. Chaudhuri: Efficient Creation of Statistics over Query Expressions. ICDE 2003: • BC04 N. Bruno, S. Chaudhuri: Conditional Selectivity for Statistics on Query Expressions. SIGMOD 2004: 311-322 • SLM+01 M. Stillger, G. Lohman, V. Markl, and M. Kandil: LEO – DB2’s Learning Optimizer. VLDB 2001 • IMH+04 I. F. Ilyas, V. Markl, P. J. Haas, P. G. Brown, A. Aboulnaga: CORDS: Automatic discovery of correlations and soft func­tional dependencies. Proc. 2004 ACM SIGMOD, June 2004. • CN00 S. Chaudhuri, V. Narasayya: Automating Statistics Manage­ment for Query Optimizers. ICDE 2000: 339-348 • DGR01 A. Deshpande, M. Garofalakis, R. Rastogi: Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. SIGMOD 2001 • GJW+03 C. Galindo-Legaria, M. Joshi, F. Waas, et al: Statistics on Views. VLDB 2003: 952-962 • GTK01 L. Getoor, B. Taskar, D. Koller: Selectivity Estimation using Probabilistic Models. SIGMOD 2001 • PI97 V. Poosala and Y. Ioannidis: Selectivity Estimation without value independence. VLDB 1997 • Entropy and Maximum Entropy • Sha48 C. E. Shannon: A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948 • DR72 J.N. Darroch and D. Ratcliff: Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics (43), 1972:1470–1480. • GP00 W. Greiff, J. Ponte: The maximum-entropy approach and probabilistic IR models. ACM TIS. 18(3): 246-287, 2000 • GS85 S. Guiasu and A. Shenitzer: The principle of maximum-en­tropy. The Mathematical Intelligencer, 7(1), 1985.

  35. Conclusions • Problem: Inconsistent Cardinality Model and Bias in today’s Query Optimizers • due to overlapping Multivariate Statistics (MD Histograms, etc.) • To reduce bias, today’s optimizers only use a consistent subset of available multivariate statistics •  Cardinality estimates suboptimal despite better information •  Bias towards plans without proper statistics (“fleeing from knowledge to ignorance”) • Solution: Maximizing Information Entropy • Generalizes concepts of uniformity and independence used in today’s query optimizers • All statistics are utilized  Cardinality estimates improve, some by orders of magnitude • Cardinality Model is consistent  No bias towards particular plans • Consistent estimates are computed in subsecond time • for up to 10 predicates per table • however, algorithm is exponential in the number of predicates • Not covered in the talk (see paper): • Reducing algorithm complexity through pre-processing • Impact on query performance  speedup, sometime by orders of magnitude • Future Work: • Extension to join estimates

More Related