Retrospective on Aurora: Advances in Stream-based Data Management and Linear Road Benchmark

Chapter 10: Stream-based Data Management • Title: Retrospective on Aurora • Authors: Hari Balakrishnan, et. al.

Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core • Problem • Problem Statement • Why is this problem important? • Why is this problem hard? • Approaches • Approach description, key concepts • Contributions (novelty, improved) • Assumptions

Problem Statement • Given • Stream data • Experience on the development of five stream-based applications using Aurora stream processing engine • Find: • Key requirements of streaming applications • Objectives • Reflect on the design of Aurora based on this experience • Eliminate the limitations and address new challenges on a follow-on project, Borealis • Constraints • Data streams arrive in no particular order. • Data streams arrive without any temporal regularity.

Why is this problem important? • Stream-processing applications • Financial Services – stock ticker • Transportation – congestion pricing, dynamic tolls • Sensor Networks – Environment monitoring • Defense – Battalion monitoring

Why is this problem Hard? • High update rate • Time-series • Streaming applications entail time series. • Time series operations are not well supported by current DBMSs. • Real-time constraints • Outbound processing, where data are stored before being processed, cannot deliver real-time latency. • SPEs must adopt inbound processing, where query processing is performed directly on incoming messages. • Spikes in message load. • Incoming traffic is bursty. • Quality of Service (QOS) requirements

Novel Contributions • Comparison with SQL-centric related Work: • Data Flow Network (DFN) centric • Developer – compose DFN using graphical user interface • Optimizer – rearrange DFN, e.g. swap boxes, • Compiler – Translate DFN to intermediate representation • Run-time – Schedule tasks based on QOS requirements • Other Contributions – Lessons Learnt • Identify characteristics of streaming applications • from 5 case studies • Identify core performance tuning ideas

Aurora Architecture • Aurora is based on a dataflow-style ‘boxes & arrows’ paradigm unlike others using SQL style query interface. (i.e., performing query back and forth adds system overhead and latency.) • Can be spread across any number of machines for scalability and availability. Input Operator Output Aurora Operators Aurora GUI

Aurora Case Study 1: Financial Services • An application detects feed problems and triggers switch between feeds in real time. • Hierarchical Alarm • Low alarm is triggered when update is delayed beyond threshold (e.g., 5 sec). • High alarm is triggered when low alarms accumulate beyond threshold (e.g., 100 times). • Boxes in red circle separate the alarms from both Reuters and Comstock into alarms from NYSE and alarms from NASDAQ. Filter & Merging techniques • This case study illustrates the ability to detect stream imperfections and extend functionality using user-defined Map functions.

Aurora Case Study 2: Linear Road Benchmark • Linear Road is a bench mark for stream processing eingines. • Simulates an unban highway system that uses ‘variable tolling’ (i.e, congestion-based pricing). • Linear Road should support for • Two continuous queries • Calculates a segment toll every time a vehicle enters the segment. • Detects and reports accidents and adjusts tolls accordingly. • Three Historical queries • Request an account balance • Day’s total expenditure for a given vehicle • Prediction of travel time between two segments using historical data • Each of these queries must be answered with a specified accuracy and within a specified response time.

Aurora Case Study 3: Battalion Monitoring • Aircrafts gather data and send them to monitoring stations. • Enemy units cross a given line, signaling an attack. • The limited resource is the bandwidth between aircraft and ground. When an attack is initiated, selective dropping of data is allowed to serve important classes. • Authors could test their load-shedding techniques. • Insert random drop boxes to discard a fraction of their input tuples. • Insert semantic, predicate-based drop filters. • Observations • The semantic load-shedding techniques achieve the least value utility loss. • As load increases, two techniques show similar performance. • At high loads, all algorithms converge to same loss levels.

Aurora Case Study 4: Environmental Monitoring • Monitoring toxins in water. • Stream data is fish behavior (e.g., breathing rate) and water quality (e.g., temperature). • When the fish behave abnormally, an alarm is sounded. • The water data contain 1,2, and 4 hour sliding windows. • Ease of developing stream applications • Aurora proved very convenient for sliding window calculation. • Aurora’s GUI proved invaluable.

Aurora Case Study 5: Medusa • Is a distributed stream-processing system using Aurora. • Takes Aurora queries and distributes them across multiple nodes. • Offers several Benefits: • Incremental scalability over multiple nodes. • High availability by mutual monitoring between nodes. • Composition of stream feeds from different participants. • Handling load spikes by federated system.

Lessons Learnt: Application Characteristics • Common Queries • Historical data using Open window • Last 10 week’s worth of toll data for each driver • Aggregate - How much a driver has spent on tolls over past 10 weeks? • Tables of historical data with arbitrary update patterns • Synchronization • Stream applications rely on shared data and computation. • WaitFor (P: Predicate, T: Timeout) • Unpredictable stream behavior • Financial services application detects arrival rate of a stream. • Military application adjust resources during times of stress.

Lessons Learnt: Performance Tuning • Requirements • Main memory implementation • Data movement across DFN elements • Scheduling of DFN elements • Performance Decisions • Memory copying – memcpy() implementations • Scheduler • Reduce scheduler overheads by aggressive profiling • Tight loops • keep unnecessary house-keeping out of tight loops • Data-structures • Optimize data-structures used to implement DFN elements

Future Plans: Borealis • Dynamic revision of query results • Intelligently corrects query results that have already been emitted with the corrected data that arrive later. • Dynamic query modification • E.g., traders wish to be alerted of interesting events, where the def’n of ‘interesting’ varies. • Distributed optimization • Server-heavy or sensor-heavy optimization problem becomes emerging. • More flexible optimization to handle a very large # of devices • Implementation plans

Summary • Paper’s focus • Identify the requirements of stream applications by the experience from the design and implementation of Aurora stream-processing engine • Ideas • Describe five applications and their implementation in detail. • Reflect on the design of Aurora based on the experience. • Discuss future ideas on follow-on project. • Contributions • Identify key requirements of streaming applications • Analytical Validation • Case study

Assumptions, Rewrite today • Assumptions • Archiving is not necessary! • Performance more important than declarative query language • Rewrite today • Compare performance with competition, e.g. STREAM • Allow archiving along with stream processing • Consider other applications • RFID, cell phone applications • Include current status of Borealis implementation.

Retrospective on Aurora: Advances in Stream-based Data Management and Linear Road Benchmark

Retrospective on Aurora: Advances in Stream-based Data Management and Linear Road Benchmark

Presentation Transcript

Chapter 1 The History of Sport Management

Organizing Data and Information

Stream Cipher

Data Stream Algorithms Intro, Sampling, Entropy

Chapter 8

Chapter 1

Chapter 2 Data Mining

William Stallings Computer Organization and Architecture 7 th Edition

Topic 10: Network Security Management

Chapter 7: Data Link Control Protocols

Discovering Computers

SQL Unit 18: Data Management: Databases and Organizations Richard Watson

Data Management: Databases and Organizations Richard Watson

Chapter 3: Data Transmission

Data Management Services in GT2 and GT3

Chapter 16 Inventory Management

Chapter 5 Peer-to-Peer Protocols and Data Link Layer

Data Workflow Management, Data Preservation and Stewardship

Amateur Extra License Class

Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 10 —

Chapter 4 Data-Oriented Models

Phased Scheduling of Stream Programs

Sea Ice

Sea Ice