A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Written By SurajitChaudhuri , Gautam Das , VivekMarasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Goal of the papers we have studied so far is to approximately answer aggregation queries Accurately Efficiently Main Idea of This Paper : Tailor the choice of samples to be robust for W similar not necessary identical to give workloads.

Two ways of Data mining for analyzing Large DB. OLAP :- Online analytical processing is an approach to quickly answer multi-dimensional analytical queries using OLAP cubes. OLAP cube :- is a data structure that allows fast analysis of data. Example Data of Product By city By type By product type Drawbacks: Expensive Resource intensive -wiki

2) Pre-computed samples • -This method gives approximate answers very efficiently. • But may have larger errors, cause finding the samples with large variance is almost impossible. • 3) Using Workloads • What is a Workload ? • Set of Transact-SQL statements that execute against a DB • or databases that have to be tuned. - msdn • Tuned :- Optimize performance of DB

How Sample set is generated from Workload? • - In practical world Queries fall under a particular pattern. Eg : Info of Texas state. • The queries are run on the entire data base (R). • A column is added to the records. This column holds the data that states if the record was selected by the queries. (tagging) • If the record was selected by many queries its probability of being in the sample is higher. • This way Workload is used to generate Sample set.

ICICLES : A new class of samples that tune themselves to a dynamic Workload. R Outliers : Identify the tuples with Outlier values and store them in a separate relation. Run query on T1 URSAMP on T2 Estimate the true result Using methods in paper combine results. Icicle Outlier table (T1) R(Q1) R(Q2) No Outlier values Table (T2) R(Q3)

Drawbacks of Previous studies :- They have intuitive appeal they lack rigorous problem formulation. Do not deal with uncertainty in Workloads. Ignore data variance in the data distribution of the aggregate column.

Architecture for Approximate Query Processing Offline component for selecting samples from (R) Rewrite the queries to use the sample to answer the query approximately. Report the answers with estimate errors.

Offline component selects samples from R. Each record has Scale Factor Column (or in different relation) Value for the aggregate column of each record is scaled up by multiplying by scale factor and then aggregated.

Error Metrics : y = correct ans y’ = approximate ans Relative error :E(Q) = (|y-y’|)/|y| Squared relative error : SE(Q) = (|y-y’|)2 / |y|2 Group By Query includes g groups yi = correct ans for ith group SE(Q) = (1/g) Σi (yi – yi’) 2 / yi2 A group by query with g groups g select queries with 1/g weight each .

pw = Probability Distribution pw(Q) = probability of query Q is given Mean Square error is MSE(pw) = ΣQ pw(Q) * SE(Q) Root mean squared error :

Fundamental Regions Q1 = select no. between 10 and 50 Q2= select values between 40 and 70 R = Relation W = Workspace n = no. of records in R R is divided into min no. of regions R1,R2,…… Rr Rj = each query in the W selects either all records in Rj or none. Upper Bound on no. of regions is min (2|w| , n)

Fixed Workload Problem Statement : (Fixed Samp) Input : R, W, k Output: A sample of k records such that MSE(W) is minimized. Soultion : (3 steps) Identify fundamental regions Picking exactly one record from each imp fundamental region Assign appropriate values to additional columns in the sample records.

Step 1 : Number of fundamental regions (r) is induced by the Workload W. Case 1 : (r <=k) Case 2 : (r >k) ____________________________ Case 1: (r<=k) Step 2 : Pick one sample from each fundamental region (10,40,60,80) Step 3: Column RegionCount and AggSum are updated Region Count = { 3,2,2,2,) AggSum = {60,90,130,170) We can ans COUNT , SUM and AVG queries.

Case 2 :- (r>k) Step 2: 1) Sort all regions by importance Importance = fj*nj2fj = sum of weights of all the queries in W that select the region j. nj = no. of records in the region j. (fj) = measures the weights of the queries that are affected by Rj (nj2) = measures the effect on the error by not including Rj. 2) Pick the top k Step 3 Assign the values to additional coulmns (Regioncount , AggSum) Not same as pervious (k rec , 2k unknows ({RC1, … RCk} & {AS1, ,ASk} ) MSE(W) = quadratic eq. on Differentiation we get Linear eq. Disadvantage : If queries not identical unpredictable errors.

Lifting Workload Q an Q’ => similar when records selected by them significantly overlap R = Relation Q = Query RQ = rec selected by the query p{q} (R’) => denotes the probability of occurrence of any query that selects the set of records R’.

Define the degree to which W influences Q’s distribution. δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½) For any record inside RQ Probability that an incoming query will select this record δ For any record outside RQ γ Probability that an incoming query will select this record • n1, n2, n3, and n4 are the counts of records in the regions. • n2or n4large (large overlap), P{Q}(R’) is high • n1orn3large (small overlap), P{Q}(R’) is low

δ = inside RQ &selected by R’ γ = outside RQ &selected by R’ δ => 1 γ => 0 RQ and R’ identical δ => 1 γ => ½ R’ super set of RQ δ => ½ γ => 0 R’ subset of RQ δ => ½ γ => ½ R’ is unrestricted

STRATIFIED SAMPLING PROBLEM : SAMP Input : R, pw , k (pw probability distribution fun specified by W) Output : a sample of k records such that MSE(pw) is minimized

Solution STRAT for single-table selection queries with Aggregation • 3 Steps : • 1.Stratification • (a) How many strata to partition R into ? • (b) How many records in each strata? • 2. Allocation • Determine the number of samples required across each strata • 3.Sampling

Count Aggregation • Stratification : • No of statum = No. of fundamental regions (Lemma1) • 2) Allocation : We want to minimize the error over queries in pw . k1, … kr are unknown variables such that Σkj = k. From Equation on an earlier slide, MSE(pW) can be expressed as a weighted sum of the MSE of each query in the workload: Lemma 2: MSE(pW) = Σiwi MSE(p{Q})

Lemma 3 : For a COUNT query Q in W, let ApproxMSE(p{Q}) = Then Expected relative squared error in estimating count of RQ Expected squared error in estimating the count (RQ union Rj) Expected squared error in estimation the count of (RQ union R/RQ)

Since we have an (approximate) formula for MSE(p{Q}), we can express MSE(pw) as a function of the kj’svariables. Corollary 1 : MSE(pw) = Σj(αj/ kj), where each αj is a function of n1,…,nr, δ, and γ. αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj. Now we can minimize MSE(pw). Lemma 4: Σj(αj/ kj) is minimized subject to Σjkj = k if kj = k * ( sqrt(αj) / Σisqrt(αi) ) This provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region.

Solution For Sum Aggregate • 1) Stratification • Can not use same stratification as in Count. • Use Bucketing Technique • Fundamental regions are with large variance are divided into finer regions with significantly lower internal variance. • Each finer region treated as a strata.

2) Allocation -Like COUNT, we express an optimization problem with h*r unknowns k1,…, kh*r. • Unlike COUNT, the specific values of the aggregate column in each region (as well as the variance of values in each region) influence MSE(p{Q}). • Let yj(Yj) be the average (sum) of the aggregate column values of all records in region Rj. • Since the variance = small, so approximate each value in the region to yj.

Thus to express MSE(p{Q}) as a function of the kj’sfor a SUM query Q in W: As with COUNT, MSE(pW) for SUM is functionally of the form Σj(αj/ kj), and αj depends on the same parameters n1, …nh*r , δ, and γ (Corollary 1).

Pragmatic Issues -Identifying Fundamental Regions -Handling Large Number of Fundamental Regions -Obtaining Integer Solutions -Obtaining an Unbiased Estimator

Extensions for more general Workloads • Group Queries : - • Q partitions R into g groups • Lifting model : - Replace Q with g separate selection queries • Tagging step :- append <c,v> • (c = column id used in Group by. V = value of Group by in record t.) • Join Query :- Star queries contains • One source relation and a set of dimension relations connected via foreign key joins. • Group by and selection on source and dimension relation • Aggregation over columns of the source relation. • Approach 1 • - identify samples only over source relation. ( source relation is large) • -Approach 2 • - Identify source relation and precompute its join with all dimension relations.

Experimental Results : FIXED – solution for FIXEDSAMP, fixed workload, identical queries STRAT – solution for SAMP, workloads with single-table selection queries with aggregation PREVIOUS WORK USAMP – uniform random sampling WSAMP – weighted sampling OTLIDX – outlier indexing combined with weighted sampling CONG – Congressional sampling

Conclusion: The solutions FIXED and STRAT handle the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries