Optimal Workload-Based Weighted Wavelet Synopsis
This paper presents a novel approach to creating optimal workload-based weighted wavelet synopses for approximating answers in operational databases. It addresses the challenge of long response times for SQL queries by developing compact data synopses that yield the most accurate answers using minimal memory, even in massive data sets. Key contributions include the introduction of Workload-Based Mean Squared Error (WB-MSE) and Workload-Based Mean-squared Relative Error (WB-MRE) metrics. The research builds on existing wavelet synopsis techniques, offering a significant enhancement in accuracy for query workloads.
Optimal Workload-Based Weighted Wavelet Synopsis
E N D
Presentation Transcript
Optimal Workload-Based Weighted Wavelet Synopsis Yossi Matias Daniel Urieli School of Computer Science Tel Aviv University
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Compact Data Synopses “Transformed” Query Approximate Answer KB/MB FAST!! Approximate Query Processing OperationalDatabase SQL Query Long Response Times! Exact Answer GB/TB
Goals • Develop data synopses • Most accurate answers • Using a small amount of memory • Massive data sets efficient construction • Time • I/O
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Data synopses • Samples: random samples, stratified samples, congressional samples, reservoir-sampling, backing samples, join synopses, sketches • [Olken-Rotem, Vitter, Alon-Matias-Szegedy, Gibbons-Matias-Poosala, Acharia et al…] • Used in commercial DB systems • Histograms: equi-depth, compressed, v-optimal, spline, multi-dimensional, dynamic, Max-diff, MHIST • [Poosala-Ionnidis, etc.] • Used in commercial DB systems • Wavelets synopses: basic, multi-dim, probabilistic, dynamic, extended • Adapts to nature of data effectively • [Matias-Vitter-Wang, Garafolakis-Gibbons, Chakrabarti et al, Rousopoulous-Kiotidis…] • Workload-based wavelet synopses [Matias, Portman]
Workload-based synopses • Future queries correlated to past queries • Can be thought of as taken from a probability distribution roughly determined by the workload • Workload based synopses: optimized for a given query workload • “Standard” synopses assume uniform workload
Workload-based synopses – prior work • Workload-based sampling • Overcoming limitations of sampling for aggregation queries [Chaudhuri, Das, Datar, Motwani, and Narasayya] • Icicles: Self-tuning samples for approximate query answering [Ganti, Lee, Ramakrishnan] • Workload-based histograms • Self-tuning histograms [Aboulnaga and Chaudhuri] • ST-holes [ Bruno et al. ] • Hierarchical range histogram [Guha-Koudas-Srivastava-02] • Workload-based wavelets • By Yossi Matias and Leon Portman
Workload-Based Wavelet synopses [MP03] • Adapts effectively to a given query workload (not only to data) • Reduces the mean-squared-absolute / relative error over a workload of queries • Order magnitude improvement over prior wavelet synopses • Not necessarily optimal
Contributions • Optimal Workload-based Weighted Wavelet (WWW) synopses • WB-MSE (Workload-Based Mean Squared Error) • WB-MRE (Workload-Based Mean-squared Relative Error) • Equivalently, minimize the expected squared, absolute or relative error over a point query • First to minimize the MRE over the data • WB-MRE with uniform distribution • Both WWW synopses are optimal enhanced wavelet synopses • A generalized definition which allows coefficients with arbitrary values • Optimal cost construction • Linear construction time • I/O optimal
Techniques • Problem definition in terms of • Weighted norm • Weighted-inner-product • Weighted-inner-product-space • Weighted wavelets for building data synopses
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Resolution Averages Detail Coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] 1 [1.5, 4] [0.5, 0] 0 [2.75] [-1.25] [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] Haar wavelet decomposition • Wavelets: mathematical tool for hierarchical decomposition of functions/signals • Haar wavelets: simplest wavelet basis, easy to understand and implement • Recursive pair wise averaging and differencing at different resolutions. • A linear time algorithm.
+ 2.75 + - -1.25 0.5 0 0 -1 0 -1 + - + - + - + - + - + - 2 2 0 2 3 5 4 4 Wavelet error tree [MVW98] Original data
+ + - + - + - + - + - + - + - 1 -1 0 1 The Haar Basis
+ 2.75 + - -1.25 0.5 0 0 -1 0 -1 + - + - + - + - + - + - 2 2 0 2 3 5 4 4 Wavelet error tree [MVW98] How should we choose which coefficients to retain? 1 1 Original data
Parseval-based optimal thresholding • Given a vector with respect to some orthonormal basis • Goal: approximate the vector using only M << N basis coefficients • Then, choosing the largest M coefficients is optimal • Minimizes the L2 norm of the error vector
Haar Wavelet Synopses - summary • Compute Haar wavelet decomposition of D • Coefficient thresholding: only M<<|D| = N coefficients can be kept • Parseval-based thresholding • optimal w.r.t the MSE • Several other greedy heuristics exists
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Given a synopsis S 3.5 3.5 -0.5 -1 0 -2 -1 0 0 -0.5 standard thresholding -0.707 0 0 -1 -0.5 0 WL2(S) 0.498 standard: 4 4 2 2 2 6 4 4 Importance: 0.001 0.001 0.001 0.001 0.249 0.249 0.249 0.249 Workload Example 2 2 2 6 3 5 4 4
0 3.5 -0.5 -1 0 0 -2 -1 WL2(S) 0.008 Importance: 0.001 0.001 0.001 0.001 0.249 0.249 0.249 0.249 Workload Example 3.5 Workload- based thresholding -0.5 -0.707 0 0 -1 -0.5 0 2 2 2 6 3 5 4 4 standard: 4 4 2 2 2 6 4 4 Workload based 5 4 2 2 4 4 3 4
Error definition • D = (d1,…,dN) - our data. • - the point query • - the approximated answer • abs-error: rel-error: • The purpose: reduce a norm of • For example:
Workload-based Error • A workload: (c1,…,cN), where ci is the probability that qi appears. • Given a workload W = (c1,…,cN) we define the Weighted L2 Norm: • When ci = 1/N: WL2(E) = MSE
Our goal • Minimizing the WL2 norm of the errors vector E • For given data set D and query workloads W • Equivalently: minimizing the expected squared error over a point query taken from a given distribution
Regular Haar transform Given a data set D = (d0,…,dN-1) D Haar Transform (HT) HT(D) standard thresholding wavelet synopsis
Parseval’s formula, the WL2 norm, the weighted inner product, and the algorithm for computing the WH basis from the workload Overview Given a data set D = (d0,…,dN-1) and a workload vector W = (c0,…,cN-1) W D WHB(W) Weighted Haar Basis (WHB( Weighted Haar Transform (WHT) WHT(D) standard thresholding WB – wavelet synopsis
x -y 0 1 The weighted Haar basis • The Weighted Haar Basis would also look like but
0 1 c0,c1,… , cN-1 Compute theWeighted Haar Basis • Meaning it would look more like: Recall the weight coefficients (the workload) W = (c0,…,cN-1) for D = (d0,…,dN-1)
Experimental results WB-MSE VS. STANDARD
Experimental results WB-MRE, ADAPTIVE, STANDARD
Experimental results WB-MRE, ADAPTIVE