Evaluating Window Joins Over Unbounded Streams

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar

Problem • Processing Joins over unbounded stream • Solution: Moving Window Join Queries have window predicates: Two streams R and S, only interested in R tuples that have arrived in the last t1 seconds and S tuples that have arrived in the last t2 seconds.

Moving Window Join a : Arrival rate stream a , b : Arrival rate stream b Ta : stream A time window size , Tb : stream B time window size

Central Point The paper proposes a cost model for evaluation of moving window joins. Using this cost model, proposes strategies for maximizing the efficiency of processing joins in different scenarios

Background Implementation Strategies for Joins example R X Sa = b • Nested Loop Joins: (brute force) For each record t in R search for a retrieve every record s from S and test whether the two satisfy the condition t[a] = s[b] • Hash Joins: • Inputs: build input (smaller) and probe input • Scan the build input and generate a hash table using a hashing function on attribute “a” • For each probe row, the hash key's value is computed, the corresponding hash bucket is scanned, and the matches are produced.

Algorithm moving window join (NLJ) For each arrival of a new tuple from stream A • Scan stream B’s window to find any matching tuples and propagate them to the result. • Insert the new tuple into stream A’s window. • Invalidate all expired tuples in stream A’s window these are just those tuples whose timestamp is now outside the current time window.

Questions • How can we measure the efficiency of a moving window join evaluation strategy, since the traditional metric of execution time to completion does not apply? • Can an algorithm for a moving window join take advantage of asymmetries in the rates of the input streams? • How can we deal with cases in which an input stream is so fast that the system cannot keep up? • If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?

Cost Model • Evaluating window joins in 3 Scenarios • One stream is faster than the other. To see whether we can exploit this to optimize the performance of join algorithm. • Resources are insufficient to keep up with the with the speed of the input streams. Service rate is slower than the arrival rate. • Memory is the constraining resource. The problem is following given a fixed amount of memory and flexible times of window, how can we adjust the window size in away that the tuples produced are maximized.

Cost Model Traditional Cardinality based Cost model is incapable, because of producing cost estimates, with streams since the algorithm may never complete. We need something that measures the rate at which the output is generated and then optimize the algorithm to maximize this rate. This is called rate based query optimization

Cost Model A cost formula for computing the window join • Each arrival in A’s window triggers three task, same for arrival in window B. • The same formula will be used hence forth for evaluating the cost for different implementations of joins. The parameters probe(b) etc.. will change for NLJ to B*Cn,,. Cn is the cost of accessing a single tuple and B is the number of tuples in B’s window.

Cost Model Cost of a single join operation can be divided into 2 independent components, one for each input stream. The following is the unit cost of joining A tuples to B tuples plus the invalidation and insertion cost for tuples into B. Aggregate cost of accessing window B in a single time unit.

Related Work • Query Scrambling • Adaptive Query Processing • Streaming Algorithms for Hash Join (SHJ & XJoin) • Diag-Join (data ware house environment): Most of the warehouse joins are performed on foreign keys, and matching tuples are likely to be found in the physical close time frame of their creation. • Babu and Widom: proposed an architecture for a general purpose stream data management system and identified research problems in continuous query processing over streams.

Cost of Nested Loops Join A to B For Nested Loop Joins. Cost of the nested loop join = cost of accessing one tuple* number of tuples accessed in unit time number of tuples accessed = aB = aTb b = a(probe(b)) cost of insertion = 1 tuple, i.e. only the inserted tuple cost of invalidation = 1 tuple, on an average. cost of single tuple access = Cn Putting it all together we get.

Cost of Hash Join A to B If Hash Join (HJ) is used • Cost of probe(b) and invalidate(b) is a function of the hash bucket size in window B. • Typical probe requires 1 key hashing and key comparison for each tuple. • Number of tuples in a hash bucket in window B = Tb b /|B| Again, we put all things together and get the cost formula for HJ

Cost of Full Joins Full Joins are categorized in to two types: • SymmetricJoins: Same join mechanism is used from A to B , as well as from B to A . viz. HHJ ( Hash Joins from both sides) and NNJ ( Nested loop joins from both sides). • Asymmetric Joins: Combination of HJ and NLJ is used. For example HNJ(Nested loop join from A to B and Hash Join from B to A) Some more formulas..

Cost Curves for full joins. So what results do we see in this graph above. ??

Observations from the previous graph • When input stream differences are minimal, HJ outperforms every other join mechanism. • As the difference increases, costs of HJ increase considerably and exceeds the HNJ. • At about 70 tuples/sec ( graph 1) and 140 tuples/sec (graph 2), we have a performance crossover point.

Determining Crossover Points. • In graph 1 we saw that the cross over point was 70 tuples/sec, which is roughly when input stream B is 7 times faster than stream A. To accurately calculate crossover points. Using the formulas obtained previously we get How is this equation useful???? For a given stream, we can determine when NLJ will outperform HJ depending on the ratio of the arrival of the input streams.

Maximizing Efficiency of Processing Joins The following 3 scenarios are considered: • One stream much faster than the other • Computing resources are insufficient to keep up with the speed of the input streams • Memory resources are limited

Exploiting Asymmetry in Input Streams Speed Some assumptions: • The two time windows are fixed. • Aggregate speed of two streams is less that the system’s service rate  ( a + b <  ) The following inequality determines the likely winner between NLJ and HJ. If inequality holds, NLJ will outperform HJ, else HJ will outperform NLJ.

Graphs to prove the previous hypothesis. What observations can we make from these graphs.???? Increasing mismatch between input rates, decreases the performance of HHJ, before HNJ After reaching thrashing point, performance degradation of HNJ is less severe compared to HHJ

Maximizing the Number of Result Tuples with Limited Computing Resources. This scenario arises in the following cases: • Evaluation of expensive predicates • Input stream’s speed is faster than the join operator’s service rate. ( a + b >  ) Consequences??? • All tuples cannot be generated or else system falls behind • Streams need to be ‘regulated’ by dropping some tuples. But, what policy should be adopted while regulating the streams? There are 3 basic choices: 1) Proportional to input rates. 2) Proportional to window size. 3) Equal distribution

We have a winner !!!! • The equal distribution strategy is the winner in this case. • Also mathematical analysis of the cost model proposed in the paper, confirms the result. • Maximum output tuples will be generated when, ratio of two input streams is equal to 1.

Maximizing the Number of Result Tuples with Limited Memory. Assumptions: • We have a variable time window. • The arrival rate is constant. • Memory is a constraint, hence memory allocation strategies are needed. What are the different ways in which we can allocate memory to strings ???? • All to one. We allocate all resources to one stream, either the slower one , or the faster one. • Proportional to the arrival rate, either direct or inverse. • Equal Distribution (our winner in the last case). • ( Will Equal Distribution win again ?????)

A New Winner !!!!! • The Max A strategy, which allocates all memory to the slower stream is the clear winner. • In this strategy, we keep the slower stream in memory and let the faster one probe against it and pass by, thus maximizing the tuples. • Mathematical Analysis of the cost-model confirms this result.

Conclusions and Future Work • A unit-time basis model to analyze expected performance of moving window joins is introduced. • The proposed cost-model divides the join cost into two independent terms, each corresponding to one of the two join directions. • This work can be extended to have a cost model beyond single joins and for full query plans. • Other algorithms apart from NLJ and NJ can be modeled and evaluated.

Evaluating Window Joins Over Unbounded Streams

Evaluating Window Joins Over Unbounded Streams

Presentation Transcript

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Model Counter For Constraints Over Unbounded Strings

Verifying Information Flow Control Over Unbounded Processes

On random sampling over Joins

Proof-Infused Streams: Authenticating Sliding Window Queries on Data Streams

Evaluating Window Joins over Unbounded Streams

Evaluating Project Value Streams

Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams

Over-Approximating Boolean Programs with Unbounded Thread Creation

On Random Sampling over Joins

Continuous Queries over Data Streams

Multiple Aggregations Over Data Streams

Window-aware Load Shedding for Aggregation Queries over Data Streams

On Random Sampling over Joins

Continuous Analytics Over Discontinuous Streams

Evaluating Window Joins over Unbounded Streams

Evaluating Window Joins over Punctuated Streams

Multiple Aggregations Over Data Streams

Continuous Intersection Joins Over Moving Objects

Over-Approximating Boolean Programs with Unbounded Thread Creation