Consistently Estimating the Selectivity of Conjuncts of Predicates

Consistently Estimating the Selectivity of Conjuncts of Predicates Volker Markl, Nimrod Megiddo, Marcel Kutsch, Tam Minh Tran, Peter Haas, Utkarsh Srivastava Title slide

Agenda • Consistency and Bias Problems in Cardinality Estimation • The Maximum Entropy Solution • Iterative Scaling • Performance Analysis • Related Work • Conclusions

What is the problem? • Consider the following three attributes: Make Model Color Legend: Correlation

How to estimate the cardinality of the predicate… 200,000 100,000 Make ‘Mazda’ Model ‘323’ (real cardinality: 49,000) … Make = ‘Mazda’ AND Model = ’323’ AND Color = ‘red’ 200,000 Legend: cardinality Color ‘red’ attribute value

Without any additional knowledge 100,000 200,000 Base cardinality: 1000,000 Make ‘Mazda’ Model ‘323’ 200,000 Selectivity( Make= ‘Mazda’ AND Model = 323 AND Color = ‘red’ ) Color red  Independence assumption: s(Make = ‘Mazda’ ) * s( Model = ‘323’ ) * s( Color =‘red’ ) = 100,000 * 200,000 * 200,000 = 0.004 1,000,000 1,000,000 1,000,000 Legend: denote by s(?) the selectivity of ? Estimated Cardinality: 0.004 * 1,000,000 = 4000

Additional knowledge given (1): 100,000 200,000 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Make ‘Mazda’ Model ‘323’ 200,000 case 1: s( Make AND Model ) * s( Color ) = Color ‘red’ estimated card: 10,000 50,000 * 200,000 = 0.01 1,000,000 1,000,000 Additional knowledge: Make AND Model card(‘Mazda’AND ‘323’) = 50,000 Legend: Conjunct Pred X AND Pred Y Cardinality

Additional knowledge given (2): 100,000 200,000 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Make ‘Mazda’ Model ‘323’ 200,000 case 1: s( Make AND Model ) * s(Color) = 0.01  estimated card: 10,000 Color ‘red’ case 2: s( Make AND Color ) * s( Model ) = estimated card: 18,000 200,000 * 90,000 = 0.018 1,000,000 1,000,000 Additional knowledge: Make AND Model card(‘Mazda’AND ‘323’) = 50,000 Legend: Make AND Color cardl(‘Mazda’AND ‘red’) = 90,000 Conjunct Pred X AND Pred Y Cardinality

Additional knowledge given (3): 100,000 200,000 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Make ‘Mazda’ Model ‘323’ 200,000 case 1: s( Make AND Model ) * s(Color) = 0.01  estimated card: 10,000 Color ‘red’ case 2: s( Make AND Color ) * s( Model ) = 0.018  estimated card: 18,000 Additional knowledge: case 3: s( Model AND Color ) * s( Make ) = estimated card: 15,000 150,000 * 100,000 = 0.015 1,000,000 1,000,000 Make AND Model card(‘Mazda’AND ‘323’) = 50,000 Legend: Make AND Color cardl(‘Mazda’AND ‘red’) = 90,000 Conjunct Pred X AND Pred Y Cardinality Model AND Color card(‘323’AND ‘red’) = 150,000

15,000 18,000 FETCH Make FETCH Model 150,000 90,000 Index Scan Index Scan Model, Color Make, Color Why is this a problem? case 0: s( Make) * s(Model ) * s(Color) = 0.004  estimated card: 4,000 Cardinality Bias Fleeing from Knowledge to Ignorance case 3: s( Model AND Color ) * s( Make ) = 0.015  estimated card: 15,000 case 2: s( Make AND Color ) * s( Model ) = 0.018  estimated card: 18,000 4,000 Index Intersect Make Color Model

What has happened? • Inconsistent model • different estimates for the same intermediate result • due to multivariate statistics with overlapping information • Bias during plan selection • results in the selection of sub-optimal plans • Bias Avoidance means keeping the model consistent • State-of-the-art is to do bookkeeping of the first multivariate statistic used, and ignore further overlapping multivariate statistics • Does not solve the problem, as ignoring knowledge also means bias • Bias is arbitrary, depends on what statistics are used first during optimization • Only possible solution is to exploit all knowledge consistently

Problem: Only partial knowledge of the DNF atoms 100,000 200,000 Mazda 323 Make ‘Mazda’ Model ‘323’ 200,000 Mazda& 323& red Mazda&323&red Color ‘red’ Mazda &323&red Mazda & 323 & red Additional knowledge: Mazda&323 &red Mazda&323 &red Make AND Model p(‘Mazda’AND ‘323’) = 50,000 Legend: Mazda &323&red DNF = disjunctive normal form X denotes not X Make AND Color pl(‘Mazda’AND ‘red’) = 90,000 Mazda &323&red Model AND Color p(‘323’AND ‘red’) = 150,000 red

How to compute the missing values of the distribution? Mazda 323 Mazda& 323& red Mazda&323&red Mazda &323&red Mazda & 323 & red Mazda&323 &red Mazda&323 &red Probability( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Mazda &323&red Mazda &323&red red

Solution: Information EntropyH( X ) = -∑ xi log( xi ) • Entropy is a measure for the “uninformedness” of a probability distribution X=(x1, …, xm) with x1 + … + xm = 1 • Maximizing information entropy • for unknown selectivities • using known selectivities as constraints will avoid bias • The less is known about a probability distribution, the larger the entropy • Nothing uniformity: s(X = ?) = 1/m • Marginals independence: s(X = ? and Y = ?) = s(X=?) * s(Y=?) • Thus: the principle of maximum entropy generalizes uniformity and independence used in today’s query optimizers

Entropy Maximization for Cardinality Estimation given some selectivities (single and conjunctive) over a space of n predicates p1, …, pn  choose a model which is consistent with this knowledge but otherwise as uniform as possible  maximize the entropy of the probability distribution X = (xb | b {0,1}n) xb is the selectivity of the DNF atom bi = 0 means that predicate pi is negated in the DNF bi = 1 means that predicate pi is a positive term in the DNF Legend: {0,1}n denotes the n-fold cross product of theset {0,1}, i.e., {0,1}  …  {0,1} n times Also, for a predicate p1 = pp0 = notp

0 0 0 Mazda 323 Mazda&323&red 1 0 0 0 1 0 Mazda& 323& red 1 1 0 Mazda&323&red Mazda&323&red Mazda& 323& red 1 1 1 1 0 1 0 1 1 Mazda&323 &red Mazda&323 &red 0 0 1 red Mazda&323&red Maximum Entropy Principle – Example: Knowledge sY, Y T: s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2 s1,2 =s(Mazda &323)= 0.05 s1,3 =s(Mazda& red)= 0.09 s2,3 =s(red&323)= 0.15 T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, } • Constraints: s1= Mazda&323&red + s1 = x100 + Mazda &323&red + x101 + Mazda&323&red + x110 + Mazda&323&red x111

0 0 0 Mazda 323 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 red Maximum Entropy Principle – Example: Knowledge sY, Y T: s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2 s1,2 =s(Mazda &323)= 0.05 s1,3 =s(Mazda& red)= 0.09 s2,3 =s(red&323)= 0.15 T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, } • Constraints:  0.10 = s1 = x100 + x101 + x110 + x111  0.20 = s2 = x010 + x011 + x110 + x111  0.20 = s3 = x101 + x111 + x011 + x001  0.05 = s1,2 = x110 + x111  0.09 = s1,3 = x101 + x111  0.15 = s2,3 = x011 + x111  1.00 = s = x000 + x001 + x010 + x011 + x100 + x101 + x110 + x111 • Objective Function:

1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 Solving the Constrained Optimization Problem Minimize the objective function: Satisfying the |T|2{1, .., n} constraints: Legend: 2{1,…,n} denotes the powerset of {1,..,n} C(Y) denotes all DNF atoms that contribute to Y, i.e., formally, C(Y) := {b {0,1}n | iY : bi = 1} and C() := {0,1}n  General solution: Iterative Scaling

Differentiation w.r. to xb and equating to zero yields conditions for minimum Legend: P(b, T)  Tdenotes the indexes Y of all known selectivities sY to which DNF atom b contributes its value xb: P(b,T) = {YT | iY: bi = 1}  {} Maximum Entropy and Lagrange Multipliers • is convex. • Replacing xb in each constraint yields a condition in the exponentiated Lagrange multipliers zX • We can build a Lagrangian function by associating a multiplier Ywith each constraint and subtracting the constraints from the objective function • Exponentiation of the Lagrange Multipliers in the derivatives yields product form

Iterative Scaling • We can now isolate zY for a particular Y T • and thus iteratively compute zY from all zW, W T\{Y} • This algorithm is called Iterative Scaling (Darroch and Ratcliff, 1972) and converges to a stable set of Lagrangian multipliers zY, Y T • This stable point minimizes the objective function and satisfies all constraints • We can compute all DNF atoms xb from these stable multipliers using • and can in turn compute all missing selectivities

Maximum Entropy Solution of the Example Knowledge: Mazda s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2 s(Mazda &323)= s1,2 = 0.05 s(Mazda& red)= s1,3 = 0.09 s(red &323)= s2,3 = 0.15 323 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) s1,2,3 = x111 = ??? 0 0 1 red

Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s1 x 110 z1 = 0.067957 z2 = 1 z1,2 = 1 z3 = 1 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.1 s2 = 0.785759 s1,2 = 0.05 s3 = 0.785759 s1,3 = 0.05 s2,3 = 0.392879 s = 1.571518 x 100 x 111 x 101

Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s2 x 010 x 110 z1 = 0.067957 z2 = 0.254531 z1,2 = 1 z3 = 1 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.062727 s2 = 0.2 s1,2 = 0.012727 s3 = 0.492879 s1,3 = 0.031363 s2,3 = 0.1 s = 0.985759 x 111 x 011

Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s1,2 x 110 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 1 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.1 s2 = 0.237273 s1,2 = 0.05 s3 = 0.511516 s1,3 = 0.05 s2,3 = 0.118637 s = 1.023032 x 111

Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s3 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.069550 s2 = 0.165023 s1,2 = 0.034775 s3 = 0.2 s1,3 = 0.019550 s2,3 = 0.046386 s = 0.711516 x 111 x 011 x 101 x 001

Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s1,3 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 4.603645 z2,3 = 1 z = 1 s1 = 0.14 s2 = 0.200248 s1,2 = 0.07 s3 = 0.27045 s1,3 = 0.09 s2,3 = 0.081611 s = 0.781966 x 111 x 101

Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s2,3 Mazda 323 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 4.603645 z2,3 = 1.837978 z = 1 s1 = 0.177709 s2 = 0.268637 s1,2 = 0.107709 s3 = 0.338839 s1,3 = 0.127709 s2,3 = 0.15 s = 0.850355 x 111 x 011 red

Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ sØ z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 4.603645 z2,3 = 1.837978 z = 1.175979 s1 = 0.208982 s2 = 0.315911 s1,2 = 0.126664 s3 = 0.398468 s1,3 = 0.150183 s2,3 = 0.176397 s = 1 0.029399 0.029399 0.110115 0.097264 0.079133 0.052919 0.169152 0.432619

Maximum Entropy Solution of the Example Knowledge: Mazda s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2 s(Mazda &323)= s1,2 = 0.05 s(Mazda& red)= s1,3 = 0.09 s(red &323)= s2,3 = 0.15 323 0.009918 0.049918 0.000082 0.049918 0.040082 0.100082 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) s1,2,3 = x111 = 0.049918 Iterations: 241 0.009918 0.740082 red

Let’s compare: Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Real : s( Model AND Color AND Make ) = 0.049 actual card:49,000 case 0: s( Make) * s( Model ) * s(Color) = 0.004  estimated card: 4,000 Error: 10x case 1: s( Make AND Model ) * s(Color) = 0.010  estimated card: 10,000 Error: 5x case 2: s( Make AND Color ) * s( Model ) = 0.018  estimated card: 18,000 Error: 2.5x case 3: s( Model AND Color ) * s( Make ) = 0.015  estimated card: 15,000 Error: 3x ME: s( Model AND Color ) * s( Make ) = 0.049918 estimated card: 49,918 Almost no error

Forward Estimation: Predicting s1,2,3 , given … Legend: 4th quartile 3rd quartile median 200 queries 2nd quartile 1st quartile s1,2s1,3s2,3 s1s2s3 s1,3s2,3 s1,2s1,3 s1,2s2,3 s1,3 s2,3 s1,2 s1,2,3

Comparing DB2 and ME : Predicting s1,2,3 , given … Legend: Legend: th th 4 4 quartile quartile rd rd 3 3 quartile quartile median 200 200 mean queries queries nd nd 2 2 quartile quartile st st 1 1 quartile quartile DB2 ME DB2 ME DB2 ME DB2 ME s1,3s2,3 s1,2s1,3 s1,2s2,3 s1,2,s1,3, s2,3

Backward Estimation: Given s1,2,3 , predicting … Legend: Legend: th th 4 4 quartile quartile rd rd 3 3 quartile quartile median 200 200 mean queries queries nd nd 2 2 quartile quartile st st 1 1 quartile quartile DB2 ME DB2 ME DB2 ME s1,2 s1,3 s2,3

Computation Cost

Related Work • Selectivity Estimation • SAC+79 P.G. Selinger et al: Access Path Selection in a Relational DBMS. SIGMOD 1979 • Chr83 S. Christodoulakis: Estimating record selectivities. Inf. Syst. 8(2): 105-115 (1983) • Lyn88 C. A. Lynch: Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values. VLDB 1988: 240-251 • PC84 G. Piatetsky-Shapiro, C. Connell: Accurate Estimation of the Number of Tuples Satisfying a Condition. SIGMOD Conference 1984: 256-276 • PIH+96 V. Poosala, et. al: Improved histograms for selectivity estimation of range predicates. SIGMOD 1996 • Recommending, Constructing, and Maintaining Multivariate Statistics • AC99 A. Aboulnaga, S. Chaudhuri: Self-tuning Histograms: Building Histograms Without Looking at Data. SIGMOD 1999: 181-192 • BCG01 N. Bruno, S. Chaudhuri, L. Gravano: STHoles: A Multidimensional Workload-Aware Histogram. SIGMOD 2001 • BC02 N. Bruno and S. Chaudhuri: Exploiting Statistics on Query Expressions for Optimization. SIGMOD 2002 • BC03 N. Bruno, S. Chaudhuri: Efficient Creation of Statistics over Query Expressions. ICDE 2003: • BC04 N. Bruno, S. Chaudhuri: Conditional Selectivity for Statistics on Query Expressions. SIGMOD 2004: 311-322 • SLM+01 M. Stillger, G. Lohman, V. Markl, and M. Kandil: LEO – DB2’s Learning Optimizer. VLDB 2001 • IMH+04 I. F. Ilyas, V. Markl, P. J. Haas, P. G. Brown, A. Aboulnaga: CORDS: Automatic discovery of correlations and soft functional dependencies. Proc. 2004 ACM SIGMOD, June 2004. • CN00 S. Chaudhuri, V. Narasayya: Automating Statistics Management for Query Optimizers. ICDE 2000: 339-348 • DGR01 A. Deshpande, M. Garofalakis, R. Rastogi: Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. SIGMOD 2001 • GJW+03 C. Galindo-Legaria, M. Joshi, F. Waas, et al: Statistics on Views. VLDB 2003: 952-962 • GTK01 L. Getoor, B. Taskar, D. Koller: Selectivity Estimation using Probabilistic Models. SIGMOD 2001 • PI97 V. Poosala and Y. Ioannidis: Selectivity Estimation without value independence. VLDB 1997 • Entropy and Maximum Entropy • Sha48 C. E. Shannon: A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948 • DR72 J.N. Darroch and D. Ratcliff: Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics (43), 1972:1470–1480. • GP00 W. Greiff, J. Ponte: The maximum-entropy approach and probabilistic IR models. ACM TIS. 18(3): 246-287, 2000 • GS85 S. Guiasu and A. Shenitzer: The principle of maximum-entropy. The Mathematical Intelligencer, 7(1), 1985.

Conclusions • Problem: Inconsistent Cardinality Model and Bias in today’s Query Optimizers • due to overlapping Multivariate Statistics (MD Histograms, etc.) • To reduce bias, today’s optimizers only use a consistent subset of available multivariate statistics •  Cardinality estimates suboptimal despite better information •  Bias towards plans without proper statistics (“fleeing from knowledge to ignorance”) • Solution: Maximizing Information Entropy • Generalizes concepts of uniformity and independence used in today’s query optimizers • All statistics are utilized  Cardinality estimates improve, some by orders of magnitude • Cardinality Model is consistent  No bias towards particular plans • Consistent estimates are computed in subsecond time • for up to 10 predicates per table • however, algorithm is exponential in the number of predicates • Not covered in the talk (see paper): • Reducing algorithm complexity through pre-processing • Impact on query performance  speedup, sometime by orders of magnitude • Future Work: • Extension to join estimates

Consistently Estimating the Selectivity of Conjuncts of Predicates

Consistently Estimating the Selectivity of Conjuncts of Predicates

Presentation Transcript

Estimating the Cost of Capital

Psychology of Estimating

Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

On the Complexity of Join Predicates

Frequency selectivity of the auditory system

Workshop on Quality/Selectivity of the DLESE Collections

Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

The interpretation of vague predicates - experimental insights

Estimating the age of

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Estimating the Cost of Operations

Estimating the numbers of OVC

Estimating the Cost of Capital

Clinical Relevance of Clot Selectivity

Selectivity

A retrospective investigation of selectivity for Pacific halibut CAPAM Selectivity workshop

ESTIMATING COSTS OF THE RESOLUTIONS

Multiscale Models of Selectivity of Calcium Channels

Estimating the Cost of Operations

Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

The interpretation of vague predicates - experimental insights

Logic of Propositions and Predicates