DAX: Dynamically Adaptive Distributed System for Processing CompleX Continuous Queries

DAX: Dynamically Adaptive Distributed System for Processing CompleX Continuous Queries Bin Liu, Yali Zhu, Mariana Jbantova, Brad Momberger, and Elke A. Rundensteiner Department of Computer Science, Worcester Polytechnic Institute 100 Institute Road, Worcester, MA 01609 Tel: 1-508-831-5857, Fax: 1-508-831-5776 {binliu, yaliz, jbantova, bmombe, rundenst}@cs.wpi.edu VLDB’05 Demonstration http://davis.wpi.edu/dsrg/CAPE/index.html

Uncertainties in Stream Query Processing Register Continuous Queries Receive Answers High workload of queries Real-time and accurate responses required Streaming Data Distributed Stream Query Engine Streaming Result May have time-varying rates and high-volumes Available resources for executing each operator may vary over time. Memory- and CPU resource limitations Distribution and Adaptations are required.

Adaptation in Distributed Stream Processing • Adaptation Techniques: • Spilling data to disk • Relocating work to other machines • Reoptimizing and migrating query plan • Granularity of Adaptation: • Operator-level distribution and adaptation • Partition-level distribution and adaptation • Integrated Methodologies: • Consider trade-offs between spill vs redistribute • Consider trade-offs between migrate vs redistribute

System Overview[LZ+05, TLJ+05] Query Processor Distribution Manager Local Plan Migrator Connection Manager Local Statistics Gatherer Local Adaptation Controller Global Plan Migrator CAPE-Continuous Query Processing Engine Query Plan Manager Runtime Monitor Repository Data Distributor Data Receiver Global Adaptation Controller Repository Streaming Data Network End User Stream Generator Application Server

Decision-Make Applications ... Analyze relationship among stock price, reports, and news? Decision Support System A equi-Join of stock price, reports, and news on stock symbols Motivating Example • Scalable Real-Time Data Processing Systems Real Time Data Integration Server Stock Price, Volumes,... ... ... Reviews, External Reports, News, ... Complex queries such as multi-joins are common! • To Produce As Many Results As Possible at Run-Time • (i.e., 9:00am-4:00pm) • Main memory based processing • To Require Complete Query Results • (i.e., for offline analysis after 4:00pm or whenever possible) • Load shedding not acceptable, must temporarily spill to disk

Initial Distribution Policies M3 M1 M2 M1 M2 M3 Legend: Random Distribution Balanced Network Aware Distribution Goal: To equalize workload per machine. Algorithm: Iteratively takes each query operator and places it on the query processor with the least number of operators. Goal: To minimize network connectivity. Algorithm: Takes each query plan and creates sub-plans where neighbouring operators are grouped together.

Distribution Manager Distribution Table 1 2 3 4 5 Operator Machine 7 8 Operator 1 M 1 Stream Source 6 Operator 2 M 1 Application Operator 3 M 2 Operator 4 M 2 1 2 3 4 Operator 5 M 1 5 Operator 6 M 1 7 8 Operator 7 M 2 6 Operator 8 M 2 Initial Distribution Process M1 M2 Step 1 Step 2 • Step 1: Create distribution table using initial distribution algorithm. • Step 2: Send distribution information to processing machines (nodes).

M 1 2000 tuples Machine Cost Operator Cost Machine Cost Operator Cost M 2 4100 tuples M 1 .44 Op 1: .25 Op 2: .25 Op 5: .25 Op 6: .25 M 1 .71 Op 1: .15 Op 2: .15 Op 5: .15 Op 6: .15 Operator (OP) Machine (M) Op 7: .4 M 2 .91 Op 3: .3 Op 4: .2 Op 7: .3 Op 8: .2 M 2 .64 Op 3: .4 Op 4: .3 Op 8: .3 Operator 1 M 1 Operator 2 M 1 Operator 3 M 2 Operator 4 M 2 Operator 5 M 1 Operator 6 M 1 Operator 7 M 2 Operator 8 M 2 Operator-level Adaptation - Redistribution Cost per machine is determined as percentage of memory filled with tuples. Statistics Table Cost Table (current) Cost Table (desired) Balance M Capacity: 4500 tuples Distribution Table • Cape’s cost models: number of tuples in memory and network output rate. • Operators redistributed based on redistribution policy. • Redistribution policies of Cape: Balance and Degradation.

Redistribution Protocol: Moving Operators Across Machines

Experimental Results of Distribution and Redistribution Algorithms Query Plan Performance with Query Plan of 40 Operators. Observations: • Initial distribution is important for query plan performance. • Redistribution improves at run-time query plan performance.

Operator-level Adaptation: Dynamic Plan Migration • The last step of plan re-optimization: After optimizer generates a new query plan, how to replace currently running plan by the new plan on the fly? • A new challenge in streaming system because ofstateful operators. • A unique feature of the DAX system. • But can we just take out the old plan and plug in the new plan? • Steps (1) Pause execution of old plan (2) Drain out all tuples inside old plan (3) Replace old plan by new plan (4) Resume execution of new plan Deadlock Waiting Problem: BC (2) All tuples drained AB (3) Old Replaced By new A B C (4) Processing Resumed Key Observation: Purge of tuples in states relies on processing of new tuples.

Migration Strategy - Moving State • MigrationRequirements : No missing results and no duplicates • Two migration boxes: One contains old sub-plan, one contains new sub-plan. • Two sub-plans semantically equivalent, with same input and output queues • Migration is abstracted as replacing old box by new box. • Basic idea - Share common states between two migration boxes • Key Steps • Drain Tuples in Old Box • State Matching: State in old box has unique ID. During rewriting, new ID given to newly generated state in new box. When rewriting done, match states based on IDs. • State Moving between matched states • What’s left? • Unmatched states in new box • Unmatched states in old box QABCD QABCD CD AB SABC SBCD SD SA CD BC SD SBC SAB SC BC AB SB SC SA SB QA QB QC QD QA QB QC QD New Box Old Box

AB SA SBCD SAB SC SAB SC BC a2 b1 BC c1 a1 b1 c1 c2 CD a2 b2 a2 b2 c2 SBC SD SB SA SA SB AB AB a1 b1 a2 b1 BC a2 b2 b2 SB SC b3 QC QA QB QC QA QB QD QA QB QC b3 c3 c3 Moving State:Unmatched States Unmatched New States (Recomputation)Recursively recompute unmatched states from bottom up.Unmatched Old States (Execution Synchronization)First clean accumulated tuples in box input queues, it is then safe to discard these unmatched states. W = 2 t A a2 a1 B b3 b2 b1 C c3 c2 c1

Distribution Table Distribution Manager op1 Distributed DynamicMigration Protocols (I) Migration Start op2 op3 op4 M2 M1 op1 op1 2 1 2 ... 1 op2 op2 3 4 3 4 op3 op4 op3 op4 Migration Stage: Execution Synchronization Distribution Manager Distribution Table op1 op2 2 1 3 5 op2 op1 3 4 4 2 (1) Request SyncTime (1) Request SyncTime op3 op4 op3 op4 (3) Global SyncTime (3) Global SyncTime (2) Local Synctime (4) Execution Synced M1 M2 op1 2 1 op1 2 ... 1 op2 3 4 op2 3 4 op3 op4 op3 op4

Distributed Dynamic Migration Protocols (II) Migration Stage: Change Plan Shape Distribution Manager Distribution Table op1 op2 2 1 3 5 op2 op1 3 4 4 2 op3 op4 op3 op4 (5) Send New SubQueryPlan (5) Send New SubQueryPlan (6) PlanChanged op2 op2 3 5 3 5 op1 op1 4 2 4 2 M1 M2 op2 op2 3 5 3 5 ... op1 op1 4 2 4 2 op3 op4 op3 op4

Distributed Dynamic Migration Protocols (III) Migration Stage: Fill States and Reactivate Operators Distribution Manager Distribution Table op1 op2 2 1 3 5 op2 op1 3 4 4 2 op3 op4 op3 op4 (7) Fill States [3, 5] (7) Fill States [2, 4] (9) Reconnect Operators (9) Reconnet operators (8) States Filled (11) Activate [op2] (11) Active [op 1] (10) Operator Reconnected (7.1) Request state [4] M2 M1 op2 op2 3 5 3 5 (7.2) Move state [4] op1 op1 4 2 4 2 (7.3) Request state [2] op3 op4 op3 op4 (7.4) Move state [2]

A B C m2 m1 SplitA SplitB SplitC A B C A B C m1 m2 m3 m4 Union Union Union Union Join Join Join Join SplitA SplitB SplitC SplitA SplitB SplitC SplitA SplitB SplitC SplitA SplitB SplitC From Operator-level to Partition-level • Problem of operator-level adaptation: • Operators have large states. • Moving them across machines can be expensive. • Solution as partition-level adaptation: • Partition state-intensive operators [Gra90,SH03,LR05] • Distribute Partitioned Plan into Multiple Machines

m2 m1 3-Way Join 3-Way Join SplitA SplitB SplitC PC2 PB2 PA2 PA1 PB1 PC1 C1 C2 A1 A2 B1 B2 C1 C2 A1 A2 B1 B2 1 ... 1 ... 3 ... 2 ... 2 ... 4 ... 1 ... 1 ... 1 ... 4 ... 4 ... 2 ... Partitions of m1 1 ... 3 ... 3 ... 2 ... Partitions of m2 3 ... 3 ... A1%2=0 ->m1 A1%2=1 ->m2 A1 A2 B1 B2 C1 C2 B1%2=0 ->m1 B1%2=1 ->m2 A B C = PA1 PB1 PC1  PA2 PB2 PC2 A B C 1 ... 4 ... 1 ... 3 ... 3 ... 1 ... C1%2=0 ->m1 C1%2=1 ->m2 1 .. 2 .. 1 .. 4 ... 2 ... 2 ... 3 ... 1 ... 3 ... 2 ... 3 ... 4 ... Partitioned Symmetric M-way Join • Example Query: A.A1 = B.B1 = C.C1 • Join is Processed in Two Machines

m2 m1 A B C Secondary Storage • New incoming tuples probe only against partial states SplitA SplitB SplitC A B C A B C Partition-level Adaptations • 1: State Relocation : Uneven workload among machines! • States relocated are active in another machine • Overheads in monitoring and moving states across machines • 2: State Spill: Memory overflow problem still exists! • Push Operator States Temporarily into Disks - Spilled operator states are temporarily inactive

Distribution Manager State Relocation Distribution Manager State Relocation Memory Usage/Average Productivity Force State Spill Memory Usage ... ... Local Adapt. Controller Local Adapt. Controller Local Adapt. Controller Local Adapt. Controller Local Adapt. Controller Local Adapt. Controller State Spill State Spill Disk Disk Disk Disk Disk Disk Query Processor (1) Query Processor (n-1) Query Processor (n) Query Processor (1) Query Processor (n-1) Query Processor (n) Approaches: Lazy- vs. Active-Disk • Independent Spill and Relocation Decisions • Distribution Manager: Trigger state relocation if Mr < r and t > r • Query Processor: Start state spill if Memu / Memall > s • Lazy-Disk Approach • Active-Disk Approach • Partitions on Different Machines May Have Different Productivity • i.e., Most productive partitions in machine 1 may be less productive than least productive ones other machines • Proposed Technique: Perform State Spill Globally

Performance Results of Lazy-Disk & Active-Disk Approaches • Lazy-Disk vs. No-Relocation in Memory Constraint Env. • Lazy-Disk vs. Active Disk Three machines, M1(50%), M2(25%), M3(25%) Input Rate: 30ms; Tuple Range:30K Inc. Join Ratio: 2 State spill memory threshold: 100M State relocation: > 30M, Mem thres. 80%, Minspan 45s Three machines, Input Rate: 30ms; Tuple Range:15K,45K State spill memory thres.: 80M Avg. Inc, Join Ratio: M1(4), M2(1), M3(1) Maximal Force-Disk memory: 100M, Ratio>2 State relocation: >30M, Mem thres.: 80%, Minspan: 45s

Join3 t3 Join3 E t2 Join2 E D Join2 Join1 t1 … A B C Poutput, Psize D Join1 Poutput, Psize … t A B C Plan-Wide State Spill: Local Methods • Direct Extension of Single-Operator Solution: • Update Operator Productivity Values Individually • Spill partitions with smaller Poutput/Psize values among all operators • Local Output • Bottom Up Pushing • Push States from Bottom Operators First • Randomly or using local productivity value for partition selection • Less intermediate results (states) stored -> reduce number of state spills

4 4 2 3 3 4 3+4 4 2+3+4 Plan-Wide State Spill: Global Outputs • A lineage tracing algorithm to update Poutput statistics • Poutput: Contribution to Final Query Output k • Update Poutput values of partitions in Join3 Join3 • Apply Split2 to each tuple and find corresponding partitions from Join2, and update its Poutput value Split2 SplitE Join2 E SplitD Split1 Join1 • And so on … D SplitA SplitB SplitC • Apply Same Lineage Tracing Algorithm for Intermediate Results A B C • Consider Intermediate Result Size 1 2 2 P11: Psize = 10, Poutput=20 P12: Psize = 10, Poutput=20 ... p11 p2i 2 1 p12 p2j ... ... 20 2 3 4 1 2 p11 p21 p31 p41 ... ... ... p12 ... OP1 OP2 p3j p4j p2j Intermediate Result Factor Pinter • Poutput/(Psize + Pinter) OP4 OP1 OP2 OP3

Experiment Results for Plan-Wide Spill • 300 partitions • Memory Threshold: 60MB • Push 30% of states in each state spill • Average tuple inter-arrival time 50ms from each input Query with Average Join Rate: Join1: 1, Join2: 3, Join3: 3 Query with Average Join Rate: Join1: 3, Join2: 2, Join3: 3

Backup Slides

Plan Shape Restructuring and Distributed Stream Processing • New slides for yali’s migration + distribution ideas

Migration Strategies – Parallel Track Basic Idea : Execute both plans in parallel until old box is “expired”, after which the old box is disconnected and the migration is over. Potential Duplicate: Both boxes generate all-new tuples. Pros: Migrate in a gradual fashion. Still output even during migration. Cons: Still rely on executing of old box to process tuples during migration stage. QABCD QABCD At root op in old box: If both to-be-joined tuples have all-new sub-tuples, don’t join. SABC SD SBCD SA CD AB SBC SAB SD SC BC CD Other op in old box: Proceed as normal SA SB SB SC BC AB QA QB QC QD QA QB QC QD

SBCD AB . . . SD SBC CD SC SB BC . . . . . . QA QD QC QB Cost Estimations For MS: New Box TMS = Tmatch + Tmove + Trecompute ≈ Trecompute(SBC) + Trecompute(SBCD) = λBλCW2(Tj + TsσBC) + 2λBλCλDW3(TjσBC + TsσBCσBCD) Cost Estimations For PT: Old Box SABC SD CD T Old Old SC SAB Old Old W BC TM-start 1st W New New SA SB AB 2nd W New New TM-end QA QB QC QD TPT ≈ 2W given enough system resources

Experimental Results for Plan Migration • Observations: • Confirm with prior cost analysis. • Duration of moving state affected by • window size and arrival rates. • Duration of parallel track is 2W given • enough system resources, otherwise • affected by system parameters, such • as window size and arrival rates.

Related Work on Distributed Continuous Query Processing [1] Medusa: M. Balazinska, H. Balakrishnan, and M. Stonebraker. Contract-based load management in federated distributed systems. In Ist of NSDI, March 2004 [2] Aurora*: M. Cherniack, H. Balakrishnan, M. Balazinska, and etl. Scalable distributed stream processing. In CIDR, 2003. [3] Borealis: T. B. Team. The design of the Borealis Stream Processing Engine. Technical Report, Brown University, CS Department, August 2004 [4] Flux: M. Shah, J. Hellerstein, S. Chandrasekaran, and M. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In ICDE, pages 25-36, 2003 [5] Distributed Eddies: F. Tian, and D. DeWitt. Tuple routing strategies for distributed Eddies. In VLDB Proceedings, Berlin, Germany, 2003

Related Work on Partitioned Processing • Non state-intensive queries [BB+02,AC+03,GT03] • State-Intensive operators (run-time memory shortage) • Operator-level adaptation [CB+03,SLJ+05,XZH05] • Fine grained state level adaptation (adapt partial states) • Load shedding [TUZC03] • Require complete query result (no load shedding) • Drop input tuples to handle resource shortage • XJoin [UF00] and Hash-Merge Join [MLA04] • Integrate both spill and relocation in distributed environments • Investigate dependency problem for multiple operators • Flux [SH03] • Multi-Input operators • Integrate both state spill and state relocation • Adapt states of one single input operator across machines • Hash-Merge Join[MLA04],XJoin[UF00] • Only spill states for one single operator in central environments

CAPE Publications and Reports [RDZ04] E. A. Rundensteiner, L. Ding, Y. Zhu, T. Sutherland and B. Pielech, “CAPE: A Constraint-Aware Adaptive Stream Processing Engine”. Invited Book Chapter. http://www.cs.uno.edu/~nauman/streamBook/. July 2004. [ZRH04] Y. Zhu, E. A. Rundensteiner and G. T. Heineman, "Dynamic Plan Migration for Continuous Queries Over Data Streams”. SIGMOD 2004, pages 431-442. [DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, "Joining Punctuated Streams“. EDBT 2004, pages 587-604. [DR04] L. Ding and E. A. Rundensteiner, "Evaluating Window Joins over Punctuated Streams“. CIKM 2004, to appear. [DRH03] L. Ding, E. A. Rundensteiner and G. T. Heineman, “MJoin: A Metadata-Aware Stream Join Operator”. DEBS 2003. [RDSZBM04] E A. Rundensteiner, L Ding, T Sutherland, Y Zhu, B Pielech \ And N Mehta. CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. Demonstration Paper. VLDB 2004 [SR04] T. Sutherland and E. A. Rundensteiner, "D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture“. Tech Report, WPI-CS-TR-04-18,2004. [SPR04] T. Sutherland, B. Pielech, Yali Zhu, Luping Ding, and E. A. Rundensteiner, "Adaptive Multi-Objective Scheduling SelectionFramework for Continuous Query Processing “. IDEAS 2005. [SLJR05] T Sutherland, B Liu, M Jbantova, and E A. Rundensteiner, D-CAPE: Distributed andSelf-Tuned Continuous Query Processing, CIKM, Bremen, Germany, Nov. 2005. [LR05] Bin Liu and E.A. Rundensteiner, Revisiting Pipelined Parallelism in Multi-Join Query Processing, VLDB 2005. [B05] Bin Liu and E.A. Rundensteiner, Partition-based Adaptation Strategies Integrating Spill and Relocation, Tech Report, WPI-CS-TR-05,2005. (in submission) CAPE Project: http://davis.wpi.edu/dsrg/CAPE/index.html

CAPE Engine Exploit semantic constraints such as sliding windows and punctuations to reduce resource usage and improve response time. Constraint-aware Adaptive Continuous Query Processing Engine • Incorporate heterogeneous-grained • adaptivity at all query processing levels. • - Adaptive query operator execution • Adaptive query plan re-optimization • Adaptive operator scheduling • Adaptive query plan distribution Process queries in a real-time manner by employing well-coordinated heterogeneous-grained adaptations.

Analyzing Adaptation Performance • Questions Addressed: • Partitioned Parallel Processing • Resolves memory shortage • Should we partition non-memory intensive queries? • How effective is partitioning memory intensive queries? • State Spill • Known Problem: Slows down run-time throughput • How many states to push? • Which states to push? • How to combine memory/disk states to produce complete results? • State Relocation • Known Asset: Low overhead • When (how often) to trigger state relocation? • Is state relocation an expensive process? • How to coordinate state moving without losing data & states? • Analyzing State Adaptation Performance & Policies • Given sufficient main memory, state relocation helps run-time throughput • With insufficient main memory, Active-Disk improves run-time throughput • Adapting Multi-Operator Plan • Dependency among operators • Global throughput-oriented spill solutions improve throughput

Amount of State Pushed Each Adaptation Percentage: # of Tuples Pushed/Total # of Tuples Percentage Spilled per Adaptation Run-Time Query Throughput Run-Time Main Memory Usage (Input Rate: 30ms/Input, Tuple Range:30K, Join Ratio:3, Adaptation threshold: 200MB)

DAX: Dynamically Adaptive Distributed System for Processing CompleX Continuous Queries

DAX: Dynamically Adaptive Distributed System for Processing CompleX Continuous Queries

Presentation Transcript

Optimizing Multiple Continuous Queries

Continuously Adaptive Continuous Queries (CACQ) over Streams

Streaming Data, Continuous Queries, and Adaptive Dataflow

Continuous Queries

Complex Adaptive Leadership

Complex queries in the PATENTSCOPE search system

Creating Complex Queries with Nested queries

A Dynamically Reconfigurable Data Stream Processing System

Continuous Queries

Continuously Adaptive Continuous Queries

Continuously Adaptive Continuous Queries (CACQ) over Streams

Update-Pattern-Aware Modeling and Processing of Continuous Queries

Scheduling system for distributed MPD data processing

Dynamically Adaptive Numerical Weather Prediction

P2P networks for distributed queries

Complex Adaptive Knowledge Management System

Skoll: A System for Distributed Continuous Quality Assurance

Continuous Queries

Skoll: A System for Distributed Continuous Quality Assurance

Dynamically Adaptive Distributed System for Processing CompleX Continuous Queries