1 / 24

Ripple Joins for Online Aggregation

Ripple Joins for Online Aggregation. by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C. In Simple words.

chance
Télécharger la présentation

Ripple Joins for Online Aggregation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C

  2. In Simple words This paper tells how to join a bunch of tables and get the SUM, COUNT, or AVG in GROUP BY clauses showing approximate results immediately and the confidence interval of the results from the first few tuples retrieved updating a GUI display with closer approximation information as the join adds more tuples.

  3. Motivation • Traditional Join Algorithms Nested- Loop Join Merge Join Hash Join • Main Goal: These join algorithms are designed minimize the time to completion of query.

  4. Online System • Huge database with large tables. Select student.honors_code,avg(enroll grade) From enroll, student Where enroll.sid=student.sid Group by student.honors_code;

  5. Traditional Algorithms • Traditional algorithms take a lot of time since they have to process the entire tables or relations • The users have to wait for a long time before the results are returned. • Many times aggregation queries are used to get a big picture of a dataset. • More Appropriate method would be Online Aggregation.

  6. Online Aggregation • A running estimate of the final aggregates are continuously displayed to the user. • The proximity of the running estimate to the final result is also displayed to the user.(confidence interval) • Quick results rather than minimize time for completion. • This required some changes to be made to the traditional algorithms for query processing.

  7. GUI, 1999

  8. Join Algorithms for Online Processing • Traditional Nested loop algorithm could be used in an online fashion. • Drawbacks: 1. If R was of non trivial size then time between successive updates could be excessive. 2. consider a query like this: Select AVG(S.a+ R.b/1000000) From S and R Where S.x=R.x As scan progresses no new information is retrieved ie the Confidence Intervals do not shrink well.

  9. Ripple Join: Operation Assume ripple join of relations R and S • Select a random tuple r from R. • Join with previously selected S tuples. • Select a random tuple s from S. • Join with previously selected R tuples. • Join r and s.

  10. Example:

  11. Aspect Ratios • Aspect ratio: How many tuples are retrieved from each base relation per sampling step. e.g. β1 = 1, β2 = 3, … • Square Ripple Join:Samples are drawn from R and S at the same rate. • Rectangular Ripple Join:One relation is sampled at a higher rate than another to provide shortest possible Confidence Intervals.

  12. Ripple Join Algorithm For(max=1 to infinity) { for(i=1 to max-1) if(predicate(R[i],s[max])) output(R[i],S[max]); for(i=1 to max) if(predicate(R[max],s[i])) output(R[max],S[i]); }

  13. Ripple Join Algorithm • Generalization of Nested loop join • Roles of Inner and Outer Relations continuosly interchange during processing • Cursor at S fixed at max=n whil cursor into R loops from 1 to n-1 • When cursor into R reaches value n,the cursor into S loops from 1 to n.

  14. Why call this "Ripple Join"? • The algorithm seems to ripple out from a corner of the join. • Acronym: "Rectangles of Increasing Perimeter Length"

  15. Iterators • Ripple joins can be developed using the iterator model. • 2 variants 1.Square Binary Ripple Join Iterator 2.Enhanced Ripple Join Iterator

  16. Square Binary Ripple Join Iterator

  17. Enhanced Ripple Join Iterator • Drawbacks of Square Ripple join iterator are handled 1)Non Unitary aspect ratios 2)Pipelining multiple ripple joins

  18. Ripple Join Variants • Block Ripple Join • Index Ripple Join • Hash Ripple Join

  19. Performance • Goal: Provide efficient and accurate estimation. • Join results should be returned in a such a way that aggregates are updated regularly and confidence intervals shrink rapidly. • Depends on 2 important things: 1.Estimators for SUM,COUNT and AVG 2.Confidence intervals

  20. Estimators for SUM,COUNT and AVG Consider a query of the form SELECT op(expression) from R,S WHERE predicate; At end of nth sampling step estimator of SUM(expression) is

  21. Estimators for SUM,COUNT and AVG • Rn and Sn :Set of Tuples that have been read at nth sampling step • R and S :Set of Relations • expressionP(r,s) equals expression(r,s) if (r,s) satisfies where clause and 0 otherwise • Count(*): expressionP(r,s) replaced by oneP(r,s) where oneP(r,s) =1 if (r,s) satisfies where clause and 0 otherwise. • AVG(expression)=SUM estimator/COUNT estimator.

  22. Confidence interval • A running confidence interval displays how close this answer is to the final result. • This could be calculated in many ways. • The authors present an example calculation built on extending the Central Limit Theorem.

  23. Ripple Joins • Ripple joins are designed to minimize the time until an acceptably precise estimate of the query result is available, as measured by the length of a confidence interval. • Ripple joins are adaptive, adjusting their behavior during processing in accordance with the statistical properties of the data.

  24. Ripple Joins • Ripple joins also permit the user to dynamically trade off the two key performance factors of online aggregation: the time between successive updates of the running aggregate, and the amount by which the confidence-interval length decreases at each update. • Ripple joins appear to be among the first database algorithms to use statistical information about the data not just to estimate selectivities and processing costs, but to estimate the quality of the result currently being displayed to the user and to dynamically adjust algorithm behavior accordingly.

More Related