240 likes | 425 Vues
Ripple Joins for Online Aggregation. by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C. In Simple words.
E N D
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C
In Simple words This paper tells how to join a bunch of tables and get the SUM, COUNT, or AVG in GROUP BY clauses showing approximate results immediately and the confidence interval of the results from the first few tuples retrieved updating a GUI display with closer approximation information as the join adds more tuples.
Motivation • Traditional Join Algorithms Nested- Loop Join Merge Join Hash Join • Main Goal: These join algorithms are designed minimize the time to completion of query.
Online System • Huge database with large tables. Select student.honors_code,avg(enroll grade) From enroll, student Where enroll.sid=student.sid Group by student.honors_code;
Traditional Algorithms • Traditional algorithms take a lot of time since they have to process the entire tables or relations • The users have to wait for a long time before the results are returned. • Many times aggregation queries are used to get a big picture of a dataset. • More Appropriate method would be Online Aggregation.
Online Aggregation • A running estimate of the final aggregates are continuously displayed to the user. • The proximity of the running estimate to the final result is also displayed to the user.(confidence interval) • Quick results rather than minimize time for completion. • This required some changes to be made to the traditional algorithms for query processing.
Join Algorithms for Online Processing • Traditional Nested loop algorithm could be used in an online fashion. • Drawbacks: 1. If R was of non trivial size then time between successive updates could be excessive. 2. consider a query like this: Select AVG(S.a+ R.b/1000000) From S and R Where S.x=R.x As scan progresses no new information is retrieved ie the Confidence Intervals do not shrink well.
Ripple Join: Operation Assume ripple join of relations R and S • Select a random tuple r from R. • Join with previously selected S tuples. • Select a random tuple s from S. • Join with previously selected R tuples. • Join r and s.
Aspect Ratios • Aspect ratio: How many tuples are retrieved from each base relation per sampling step. e.g. β1 = 1, β2 = 3, … • Square Ripple Join:Samples are drawn from R and S at the same rate. • Rectangular Ripple Join:One relation is sampled at a higher rate than another to provide shortest possible Confidence Intervals.
Ripple Join Algorithm For(max=1 to infinity) { for(i=1 to max-1) if(predicate(R[i],s[max])) output(R[i],S[max]); for(i=1 to max) if(predicate(R[max],s[i])) output(R[max],S[i]); }
Ripple Join Algorithm • Generalization of Nested loop join • Roles of Inner and Outer Relations continuosly interchange during processing • Cursor at S fixed at max=n whil cursor into R loops from 1 to n-1 • When cursor into R reaches value n,the cursor into S loops from 1 to n.
Why call this "Ripple Join"? • The algorithm seems to ripple out from a corner of the join. • Acronym: "Rectangles of Increasing Perimeter Length"
Iterators • Ripple joins can be developed using the iterator model. • 2 variants 1.Square Binary Ripple Join Iterator 2.Enhanced Ripple Join Iterator
Enhanced Ripple Join Iterator • Drawbacks of Square Ripple join iterator are handled 1)Non Unitary aspect ratios 2)Pipelining multiple ripple joins
Ripple Join Variants • Block Ripple Join • Index Ripple Join • Hash Ripple Join
Performance • Goal: Provide efficient and accurate estimation. • Join results should be returned in a such a way that aggregates are updated regularly and confidence intervals shrink rapidly. • Depends on 2 important things: 1.Estimators for SUM,COUNT and AVG 2.Confidence intervals
Estimators for SUM,COUNT and AVG Consider a query of the form SELECT op(expression) from R,S WHERE predicate; At end of nth sampling step estimator of SUM(expression) is
Estimators for SUM,COUNT and AVG • Rn and Sn :Set of Tuples that have been read at nth sampling step • R and S :Set of Relations • expressionP(r,s) equals expression(r,s) if (r,s) satisfies where clause and 0 otherwise • Count(*): expressionP(r,s) replaced by oneP(r,s) where oneP(r,s) =1 if (r,s) satisfies where clause and 0 otherwise. • AVG(expression)=SUM estimator/COUNT estimator.
Confidence interval • A running confidence interval displays how close this answer is to the final result. • This could be calculated in many ways. • The authors present an example calculation built on extending the Central Limit Theorem.
Ripple Joins • Ripple joins are designed to minimize the time until an acceptably precise estimate of the query result is available, as measured by the length of a confidence interval. • Ripple joins are adaptive, adjusting their behavior during processing in accordance with the statistical properties of the data.
Ripple Joins • Ripple joins also permit the user to dynamically trade off the two key performance factors of online aggregation: the time between successive updates of the running aggregate, and the amount by which the confidence-interval length decreases at each update. • Ripple joins appear to be among the first database algorithms to use statistical information about the data not just to estimate selectivities and processing costs, but to estimate the quality of the result currently being displayed to the user and to dynamically adjust algorithm behavior accordingly.