1 / 39

Evaluating Window Joins over Unbounded Streams

Evaluating Window Joins over Unbounded Streams. Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter : Yang Ying-Chia 楊 應 甲 ( R01922018) CSIE, National Taiwan University. Outline. Abstract Background Introduction Related Work

taariq
Télécharger la présentation

Evaluating Window Joins over Unbounded Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:Yang Ying-Chia 楊應甲 (R01922018) CSIE, National Taiwan University

  2. Outline • Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion

  3. Abstract – Problem and Solution • Problem: Process joins over unbounded streams. • Solution: Moving Window Join • Queries have “window predicates”

  4. Abstract – Central Point of the Thesis • The paper proposes a unit-time-basis cost model for evaluating moving window joins. • Using this cost model, it proposes strategies for maximizing the efficiency of processing joins in different scenarios.

  5. Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion

  6. Background • Join • Nested Loops Join (NLJ) • Hash Join (HJ) • Moving Window Join

  7. Background – Join

  8. Background – Nested Loops Join (NLJ)

  9. Background – Hash Join (HJ)

  10. Background – Moving Window Join

  11. Background – Moving Window Join • Instead of saying we want to join all tuples of A and B, we say we want to join all tuples that have arrived on A in the last t1 seconds with all the tuples that have arrived on S in the last t2 seconds.

  12. Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion

  13. Introduction – Questions • How can we measure the efficiency of a moving window join evaluation strategy, since the traditional metric of execution time to completion does not apply? • Can an algorithm for a moving window join take advantage of asymmetries in the rates of the input streams? • How can we deal with cases in which an input stream is so fast that the system cannot keep up? • If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?

  14. Introduction – The Three Scenarios • One stream is much faster than the other. • System resources are insufficient to keep up with the input streams. • Memory is limited.

  15. Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion

  16. Related Work • Predicate grouping and group optimization techniques • Adaptive query processing and query scrambling • Symmetric Hash Joinand symmetric nested loops join • Diag-Join for data warehouse environment • Rate based streaming query optimization framework

  17. Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion

  18. Estimating the Cost of Moving Window Joins • Cost model • Cost of a single join operation

  19. Cost of Nested Loop Join A to B Number of tuples accessed to search for matched in window B Number of tuples insert and invalidation Cost of accessing a single tuple Number of tuples accessed in a time unit

  20. Cost of Hash Join A to B Cost of accessing a single tuple in a specific hash table implementation Cost of probe(b) and invalidate(b) is a function of the hash bucket size in window B

  21. Cost of Full Join • Symmetric Join • HHJ, NNJ

  22. Cost of Full Join • Asymmetric Join • HNJ

  23. Cost Curves for Full Joins σa= 1/|A| = 1/Nkey(A) σb = 1/|B| = 1/Nkey(B)

  24. Observation from the Previous Graphs • When input streams’ speed difference is minimal, HJ outperforms every other join combinations. • As the speed gap increases, the cost of HJ increases considerably and exceeds that of HNJ at around 70 tuples/sec and 140 tuples/sec. • Here we have a performance crossover point.

  25. Estimating the Weight Factors • The crossover points can be calculated by equating the two cost formulas • For two given streams, we can determine when NLJ will outperform HJ, depending on the ratio of the arrival of the input streams. …

  26. Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion

  27. Recall the three scenarios • One stream is much faster than the other. • System resources are insufficient to keep up with the input streams. • Memory is limited.

  28. Exploiting Asymmetry in Input Streams Speed • Assumptions: • The two time windows are fixed. • The aggregate speed of two streams is less than the system’s service rate μ (i.e., λa + λb < μ ). • The following inequality determines the likely winner between NLJ and HJ: • If inequality holds, NLJ will outperform HJ; otherwise, HJ outperforms NLJ.

  29. Graphs to Prove the Previous Hypothesis

  30. Observation from the Previous Graphs • HHJ costs the least until the input rate reaches about 70 tuples/sec; then HNJ takes over. Hence, either HHJ or HNJ is the winner. • Both hash join output rates decrease drastically after passing their thrashing point.

  31. Maximizing the Number of Result Tuples with Limited Computing Resources • This scenario arises under the following conditions: • System evaluates very expensive predicates • The input stream’s speed is faster than the join operator’s service rate, i.e., λa + λb> μ. • Hence, not all answer tuples can be generated and input streams need to be “regulated”. • But, what policy?

  32. Performance Comparison between Policies • The winner is the equal distribution strategy! • Regardless of time window sizes and window selectivity factors.

  33. Maximizing the Number of Result Tuples with Limited Memory • Assumption: • The two time window sizes can be adjusted to fully utilize available memory. • The two arrival rates are constant. • Hence, memory allocation strategies are necessary. But, what policy? Will equal distribution win again?

  34. Performance Comparison between Policies • The winner is the Max A strategy, which allocates all memory to the slower stream. • Keep the slower stream in memory and let the faster one probe against it and pass by.

  35. Maximizing the Number of Result Tuples with Limited Memory • Another assumption: • Variable time windows • Variable arrival rates

  36. Performance Comparison between Policies • The best policy is either maximizing stream A’s time window in conjunction with maximizing B’s arrival rate, or we can maximize B’s time window and A’s arrival rate alternatively.

  37. Abstract • Background • Introduction • Related Work • Estimating the Cost of Moving Window Joins • On Maximizing the Efficiency of Processing Joins • Conclusion

  38. Conclusion • A unit-time basis model to analyze expected performance of moving window joins is introduced. • The proposed cost-model divides the join cost into two independent terms, each corresponding to one of the two join directions. • This work can be extended to have a cost model beyond single joins and for full query plans. • Other algorithms apart from NLJ and HJ can be modeled and evaluated.

  39. The End Thanks for your attention 

More Related