Aurora – system architecture

Aurora – system architecture Pawel Jurczyk

Currently used DB systems • Classical DBMS: • Passive repository storing data (HADP – human-active, DBMS-passive model) • Only current state of data is important • Data synchronized; queries have exact answers (no support for approximation) • Monitoring applications are difficult to implement in traditional DBMS • Triggers do not scale past a few triggers per table • Problems with getting required data from historical time series • Development of dedicated middleware is expensive • Conclusion: these systems are ill suited for applications used to alert human when abnormal situation occurs (expected DAHP model – DBMS-active, human-passive)

Aurora – main assumptions • Data comes from various, uniquely identified data sources (data streams) • Each incoming tuple is timestamped • Aurora is expected to process incoming streams • Tuples are transferred through loop-free, directed graph • Outputs from the system are presented to applications • Maintains historical storage

Input data streams Output data Applications Storage Queries Aurora system overview • Any box can filter stream (select operation) • Box can compute stream aggregates applying aggregate function accross a window of values in the stream • Output of any box can be an input for several other boxes (split operation) • Each box can gather tuples from many inputs (union operation)

Storage S1 Storage S2 Continuous query QoS spec b1 b2 b3 Appl View QoS spec b4 b5 Connection points „Keep 2 hr” Storage S3 Ad-hoc query QoS spec b6 b7 Appl Aurora query model „Keep 2 hr” • Each CP and view should have a persistence specification (e.g. „keep data for 2 hr”) • Each output is associated with QoS specification (helps to allocate the processing elements along the path)

Queries in the aurora • Continuous queries • Query continuously processes tuples • Output tuples are delivered to an application • Ad-hoc queries • System will process data and deliver answer from the earliest time stored in the connection point • Semantic is the same as continuous query that started execution at tnow – (persistence specification) • Query continues until explicit termination • Views • Similar to materialized or partially-materialized views in classical DB systems • Application may connect to the end of this path whenever there is a need

Connection points • Support for dynamic modification of network • Support for data caching (persistence specification) – helpful for ad-hoc queries • Connection point without upload stream can be used as a stored data set (like in classical DBMS) • Tuples from connection point can be pushed through the system (e.g when connection point is „materialized” and stored tuples are passed as a stream to the downstream nodes) • Alternatively, downstream node can pull the data (helpful in the execution of filtering or joining operations)

Optimization in the Aurora - problems • Many changes in the network over the time • The need of dealing with a large number of boxes • The system operates in a data flow mode • Optimization issues address different needs than classical DBMS

Optimization of continuous queries • Optimization is done during the run-time • Aurora starts execution of unoptimized network • Optimization is performed step-by-step for portions of network (subnetworks) • Firstly, hold on all input messages for selected subnetwork – drain it of messages • Then, optimize selected subnetwork • Insert projections (get rid of unneeded attributes of tuples as soon as possible) • Combine boxes (e.g. projection with filtering) • Reorder boxes (e.g. filtering can be pushed down the query tree through join) • Finally, stop holding input messages • Optimizer cycles periodically through all subnetworks (it is a background task)

Optimization of continuous queries - details S(A) T(B, C) • Each box has: • c(b) – execution cost • s(b) – selectivity -expected number of output tuples per 1 input tuple • Amount of processing for successive boxes (according to the situation in figure):c(bi) + c(bj)*s(bi) Filter (A>2, A<4) Filter (B>2, B<4) bi: c=4; s=5 Join (A=B) bj : c=1; s=0.5 Filter (C > 0) • Boxes are in right order if: (1-s(bj))/c(bj) < (1-s(bi))/c(bi) • Let’s check the condition above for bi and bj: • (1 – 0.5)/1 < (1 – 5)/40.5 < -4/4 FALSE • The condition is not satisfied – we should change the order of boxes

Optimization of ad-hoc queries • Each ad-hoc query is attached to a connection point – it runs on all the historical data stored in a connection point • Connection point keeps historical data as B-tree • Firstly examined ‘historical part’ of ad-hoc query (successor(s) of connection point) • filter boxes being compatible with the B-tree storage key can use indexed lookup • joins can use merge-sort or indexed lookup – the cheapest one is chosen • The rest of query is optimized as continuous queries

Outputs Inputs Storage manager Router Scheduler Q1 Q2 Buffer Manager Qi Box Processors Persistent Storage Qj Load Shedder QoS Monitor Qn Run-time architecture

Run-time components • Router • Routes tuples in the system • Forwards them either to outputs or to the storage manager • Storage manager • Responsible for maintaining the box queues and managing the buffer • Scheduler • Decides which box will be processed • Box processor • Executes the appropriate operation • Forwards output to router • QoS monitor • Observes outputs and activates load shedder • Load shedder • Shades load till the performance reaches the acceptable level

QoS • Optimization is based on the attempt to maximize the perceived QoS for the outputs • Basically, QoS is a function of: • Response times (production of output tuples) • Tuple drops • Values produced (importance of produced values) • Administrator specifies QoS graphs for output based on one or more of mentioned functions • Other types of QoS functions can be defined too • Administrator defines headroom for the system (the percentage of computing resources that can be used by Aurora)

QoS graphs • Graphs are expected to be normalized • Graphs should allow a properly sized network to operate with all outputs in a ‘good zone’ • Graphs should be convex (the value-based graph is an exception) good zone 1 1 1 0 0 0 Delay % tuples delivered Output value

time b2 b1 Aurora Storage Manager (ASM) – Queues management • Windowed operations (e.g. aggregations) require historical collection of tuples • Tuples may accumulate in various places when network is saturated • There is one queue at the output of each box; this queue is shared by all successor boxes • Queues are stored in memory and on disks • Queues may change length • Scheduler and ASM share scheduling priority and the percentage of queue in the main memory Queue organization Processed tuples

Aurora Storage Manager (ASM) – Connection point management • If the amount of needed historical data in the CP is less than the maximal window size of the successor boxes, no extra storage needed • Historical data is organized in B-trees based on the storage key (default: timestamp) • Periodically, all tuples that are older than the history requirement, are removed from B-tree • B-trees are stored in the space allocated by the ASM

Scheduling in Aurora • Scheduler (and Aurora) aims to reduce overall tuple execution cost • Exploit of two nonlineralities in tuple processing • Interbox nonlinearity: • Minimaze tuple trashing (if buffer space is not sufficient tuples has to be shuttled between memory and disk) • Avoiding to copy data from output to buffer (a possibility of bypassing ASM when one box is scheduled right after another) • Intrabox nonlinearity: • The cost of tuple processing may decrease as the number of available tuples in the queue increases (avoiding context-switching, better optimization)

Scheduling in Aurora • Aurora’s approach: (1) have in queues as many tuples as possible, (2) process it at once – train scheduling, and (3) pass them to subsequent boxes without going to disk – superbox scheduling • Two goals: (1) minimize number of I/O operations and (2) minimize number of box calls per tuple • How does it work? • Output is selected for execution • There is found the first downstream box with queue in memory • Then, there are considered upstream boxes – there is found as many upstream boxes with queues (not empty) in memory as possible • Found sequence of boxes can be scheduled one after another • Storage manager is notified to keep all the queues of selected boxes in memory during the execution

Priorities assignment in Scheduler • The waiting delay of tuples (a part of the latency of each output) is the function of scheduling • The goal of scheduler: to assign priorities to boxes outputs that maximize the overall QoS • The Scheduler’s approach is divided into two aspects: • state-based analysis that assigns priorities to outputs and picks for scheduling the output with the highest utility • feedback-based analysis that observes overall system and increases the priorities of outputs not doing well

Scheduler – execution overhead 300 Execution costs 250 Scheduling overhead 200 Time (ms) 150 100 50 0 Tuple at a time Trains Superboxes

Prediction of overload situations • Static analysis • The goal: determine if the hardware running the network is sized correctly • Each box has processing cost c(b) and selectivity s(b) • Each input has the rate of tuples production r(d) • Analysis starts from each datasource and continues downstream • The system is stable when: 1/c(bi) ≥ r(di) • The output rate from bi is: min(1/c(bi), r(di)) * s(bi) • Iteration of the steps above gives output data rate and computational requirements for each box • Then there is a possibility of prediction required computational resources

Prediction of overload situations S(A, B, C) T(B, C) • b1: 1/0.05t/s ≥ 100t/s (not true!) • Output stream: min(1/0.05s, 100t/s) * 0.1 = 2t/s • b2: (1/0.05)t/s ≥ 100t/s (not true!) • Output stream: min(1/0.05s, 100t/s) * 0.1 = 2t/s rs=100t/s rt=100t/s Filter (A>2, A<4) Filter (B>2, B<4) b1: c=0.05s; s=0.1 b2: c=0.05s; s=0.1 Join (A=B) b3: c=0.1s; s=5 Filter (C > 0) b4 : c=0.05s; s=0.5 • b3: (1/0.1)t/s ≥ (2 + 2)t/s (true) • Output stream: min(1/0.1s, 4t/s) * 5 = 20t/s • b4: (1/0.05)t/s ≥ 20t/s (true) • Output stream: min(1/0.05s, 20t/s) * 0.5 = 10t/s • Needed computation: 100t/s+100t/s+2t/s+2t/s+20t/s+10t/s=234t/s

Prediction of overload situations • Run-time analysis • Helps to deal with input rate spikes • Uses delay-based QoS information • If many of tuples are outside the ‘good zone’, there is a probability of overload

Load shedding • Reaction to overload • Load shedding process relies on QoS information • Load shedding by dropping tuples • Drop is a system level operator that enables to drop randomly tuples from stream at specified rate • Drop box is located as far upstream as possible • Result of static analysis • Dropping of tuples on network branches that terminate in more tolerant outputs • Algorithm: (1) choose the output with the smallest negative slope in tuple drops graph, (2) move horizontally along this curve until there is another output with smaller negative slope at this point, (3) this horizontal difference is an indication of of the output tuples drop rate • Result of dynamic analysis • Similar algorithm as previously • Can be use delay-based graphs • Dropping of tuples on branches that terminate in higher priority outputs (otherwise it would be ineffective)

Load shedding • Load shedding by filtering tuples • Idea: remove less important tuples rather than randomly chosen • It use value-based QoS information • There is prepared a histogram containing the frequency with which value ranges have been observed • Then there can be calculated utility of each of intervals (multiply frequency with value-based QoS function value) • Backward interval propagation: Aurora picks the interval with the lowest utility and prepares predicate for it that is used in filter box • Forward interval propagation: Estimation of proper filter predicate and checking it by trial and error

Aurora – system architecture