230 likes | 234 Vues
Providing Resiliency to Load Variations in Distributed Stream Processing. Ying Xing, Jeong-Hyon Hwang , Ugur Cetintemel, Stan Zdonik Brown University. Financial Data Streams. Surveillance. Network Monitoring. Click Stream Analysis. Traffic Monitoring. Sensor Network. Stream Processing.
E N D
Providing Resiliency to Load Variationsin Distributed Stream Processing • Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, Stan Zdonik • Brown University
Financial Data Streams Surveillance Network Monitoring Click Stream Analysis Traffic Monitoring Sensor Network Stream Processing Monitoring Apps
Roadmap • Problem Statement • Linear Load Model • Feasible Set • The Algorithm • Extensions • Lower Bound of Input Rates • Non-linear Load Model • Network Bandwidth / Communication Overhead • Experimental Results • Related Work • Conclusions
r1 r1 r1 Feasible Set r2 r2 Problem Statement Operator Distribution Input Rate Space • Goal • Find an operator distribution with the largest feasible set size r2 feasible infeasible r1
Linear Load Model • rj - input rate of input j (tuples/sec) • ck - processing cost of operator ok (CPU cycles/tuple) • l(ok) - the processingload of operator ok (CPU cycles/sec) • sk - selectivity of operator ok ( [# output tuples] / [# of input tuples] ) o1 o2 o3 o4
o1 o2 o1 o2 o3 o4 o4 o3 r2 r2 0 0 r1 r1 Example Feasible Sets o1 o2 o3 o4 r2 0 r1
r2 0 r1 “Ideal” Feasible Set • Theorem 1. Feasible Set is maximized when load coefficients of each input are perfectly balanced over all nodes (relative to their capacities) r2 o1 o2 o3 o4 0 r1
Resilient Operator Distribution Algorithm • Compute the Ideal Feasible Set • Sort Operators based on Load Coefficients • For each operator, determine the destination server r2 r1 0 Ideal Feasible Set
Result: R.O.D. vs Load Balancing 10 nodes 5 input streams
Extension:Network Bandwidth & Comm. Overhead • Network Bandwidth • Comm. Overhead
r1 o1 ou … r2 om ou+1 … Extension: Nonlinear Load Model • Add an artificial variable r1 r2 om o1 ou ou+1 … …
r2 r2 0 0 r1 r1 Extension: Lower Bound of Input Rates • Use the lower bound instead of the origin
Related Work • Traditional Distributed Systems • Load balancing and load sharing [Shivaratri92] [Diekmann97] • Parallel query processing [DeWitt92] • Graph partitioning [Walshaw97] [Schloegel00] • Stream Processing Systems • Load management • Flux [Shah03] – data partitioning based parallel continuous query processing • Medusa [Balazinska04] – federated distributed stream processing
Conclusion • Distributed Stream Processing • Resilient Operator Distribution • Maximize feasible set size • Performance • Much better than conventional load distribution algorithms
Computation Complexity • Computation time is determined by • n –number of nodes • m–number of operators • d–number of system input streams • k– number of samples in load time series • Static operator distribution • Dynamic operator distribution
Heuristics • Heuristic #1 • Choose the case where feasibility boundaries are close on each axis • Heuristic #2 • Choose the case where all the feasibility boundaries are far from the orgin. r2 r2 0 0 r1 r1 r2 r2 0 0 r1 r1
Resilient vs. Optimal 2 nodes 4 input streams
Varying Bandwidth Constraints • Resilient vs. Connected-Load-Balancing
Varying Data Communication CPU Overhead • Resilient vs. Connected-Load-Balancing