310 likes | 600 Vues
StreamBase Systems. Stream Processing Overview. Dr. Stan Zdonik, Co-Founder March 14, 2006. Agenda. Problem Space and Landscape Case Scenarios Technical Approaches to CEP What is required of a Stream Processing Engine Emphasis on StreamSQL Future Directions for the Community. Investors.
E N D
StreamBase Systems Stream Processing Overview Dr. Stan Zdonik, Co-Founder March 14, 2006
Agenda • Problem Space and Landscape • Case Scenarios • Technical Approaches to CEP • What is required of a Stream Processing Engine • Emphasis on StreamSQL • Future Directions for the Community
Investors Partners StreamBase at a Glance • Founded in 2003 by Dr. Mike Stonebraker (Ingres, Illustra) • Initial research prototype at MIT, Brown, & Brandeis (2001). • Boston-based company, with offices in NY, Washington, DC, & Europe • Financial backing by tier-one venture capital firms • Solid, growing customer base • Do for real-time data what relational database and SQL do for stored data
Use Case: Running VWAP • Scenario: • Every minute for every stock I am trading: • Calculate VWAP (vol. weighted avg. price) for my trades & all trades • Alert whenever my personal trading execution is inferior to market • Solution: • 5 StreamBase operators, 30 min to build
(Group by IP prefix; Sum) Filter sum > T1 18.31.0.* 18.31.0.89 Filter count > T1 (Group by source IP; Count) Join Source IP Filter count > T2 http, dns, ssh (Group by source IP; count distinct protocol) Example of IP intrusion detection with StreamBase Use Case: Intrusion Detection • Client Scenario • Need to identify unusual patterns in IP connections • Solution • Implement sophisticated filtering & monitoring to drive real-time alerting • Immediate termination of suspicious user access • Delivery • Process, analyze, & act on 50k msgs/sec
(unit#, x, y) Count; Window = 1 min Count > 3 (x.y) across line and enemy Lookup (unit#) (unit#, x, y, enemy?) Use Case: Battalion Monitoring • Client Scenario • Government contractor required filtering of data and reports from reconnaissance aircraft of friendly and enemy activity • Determine positions of friendly vs enemy troops, tanks, aircraft in real-time • Solution • Critical alerting established to pinpoint any/every enemy movement Example of combat military monitoring of friendly and enemy forces in real-time with StreamBase
CEP/Stream Processing Marketplace • The high end • ~100K messages/second • ~1 msec latency • Anything will work at the low end • 1 message/day: Use pencil & paper • 1 message/hour: Use spreadsheet • 1 message/minute: Use favorite app server, RDBMS and/or enterprise middleware Stream Processing Engines (StreamSQL) Complex events Processing Complexity Conventional Architectures Simple events Human speed (seconds to minutes) Machine speed (msec) Processing Speed
Technical Approaches to CEP • Custom code • Almost everybody does this today • Nobody wants to continue to do this going forward • Replacing this with commercial off-the-shelf (COTS) infrastructure will fuel an explosion in exploitation of increasingly ubiquitous real-time data • Your favorite rule engine • StreamSQL stream processing engine
Required Characteristics for Complex Event Processing Engines
Rule 1: Keep the Data Moving To achieve low-latency, perform data processing without first storing and retrieving the data In-stream Processing Traditional Data Processing Event Data StreamBase Application Alerts Actions Memory Memory Updates Disk Disk Queries Queries • Low latency • No waiting • Results delivered in-flight
Rule 2: Query Paradigm (StreamSQL) What is StreamSQL? • StreamSQL extends conventional SQL with time windows for key functions (e.g. joining, querying, aggregating data) • Streams do not have “end of table” • Optimal approach for unifying processing of real-time and stored data • SQL is a good paradigm • For analytics • And filtering • “Gold standard” for stored data Use querying mechanism to find output events of interest or compute analytics on real-time and historical data
Arrival time Data Value 3:01.00 3:01.10 3:01.20 3:01.30 3:02.00 3:02.40 3:03.55 3:04.10 3:04.88 3:05.75 3:06.28 3:07.00 3:08.50 3:09.50 StreamSQL Programming Paradigm • Time window-based computations, statistics • Extensibility • User-defined functions and aggregates • Custom Java or C++ operators • Modules for reusability • Stores state
Integrating Real Time and Stored State…… Produce the split-adjusted price of every security in a feed over several days (stock can split more than once) Two feeds: Tick (symbol, price, volume, date, time) Splits (symbol, date, time, split_factor)
StreamSQL solution for Real-Time and Stored Data Stored table: Store (symbol, factor) Feeds: Tick and Split _________________________________________ UPDATEStore (SET factor = factor * S.split_factor) FROMSplit S WHERE symbol = S.symbol SELECT T.symbol, price = T.price * S.factor, T.volume, T.date, T.time FROMTick T, Store S WHERE S.symbol = T.symbol Mixing Stream and Table Mixing Stream and Table
StreamSQL Solution ….or a four box application in the StreamBase GUI Some programmers prefer textual notation; some prefer GUI. Take your pick. Tick (symbol, price, volume, date, time) (read) T.price * S.factor Store (Symbol, Factor) Splits (symbol, date, time, split_factor) (write) factor * S.split_factor
Characteristics of Example • Storage of (perhaps lots of) state • Decision making based on a mix of stored state and real time computation StreamSQL has a single programming paradigm for both kinds of data. Not necessarily true for other technical approaches.
What About Pattern-Matching? • Example: Find IBM ticks over 80 followed by at least two ticks under 80. CREATE STREAM TickTriples AS SELECT symbol, T1.price AS price1, T2.price AS price2, T3.price AS price3 FROM Ticks T1 -> Ticks T2 -> Ticks T3 WHERE T1.symbol = T2.symbol AND T2.symbol = T3.symbol; SELECT * FROM TickTriples WHERE price1 > 80 AND price2 < 80 AND price3 < 80 AND symbol = "IBM"; Regular expression (pattern matching) is the same in any technology!!!
Performance – StreamSQL • Internal query plan (think of it as our graphical workflow notation) • For any event, we know exactly what processing happens next • As a result, we can optimize the plan
StreamSQL Advantages • Superior performance • Easy programmability (and maintainability) • One notation for real-time and stored data • Includes regular expression evaluation • Closer to basis for standardization • FROM clause can mix stored tables and streams • Add time windows to SQL • Add stream disorder to SQL
Rule 3: Handle Delayed, Missing,& Out-of-Order Data • Ability to time-out individual calculations or computations • Ability to merge streams and plug gaps from one with valid value in another • Bounded sort operation (BSORT) • Outer-join Make provision for handling data which is late or delayed, missing, or out-of-sequence
Rule 4: Generate Predictable Outcomes • Two distinct runs of the system with the same input should yield the same output (deterministic). • Ensure calculations performed on one time-series record do not interfere with calcs done on another Process time-series records (tuples) in a consistent manner
Alerts Actions Real-time Feeds Remote process Embedded local storage Data store Rule 5: Process Streaming or Stored Data Store and access current or historical state information, preferably using a familiar standard such as SQL • Interfaces: • Embedded in-process DB for low latency, low overhead • Standards such as ODBC, JDBC to external databases • Ability to test trading algorithms on historical data, then switch seamlessly to live feed
If a failure occurs (hardware, operating system, software), the streaming application must failover to a back-up and keep running Secondary Alerts Actions Alerts Actions Market Data Market Data Checkpoint Checkpoint Primary Rule 6: Guarantee Data Safety & Availability • Restarting and recovering from a log for real-time processing is not practical. • Better idea: A tandem-style approach for streaming data
Rule 7: Partition & Scale Automatically • Easily split application without custom-coding • Multi-threading: • To utilize multi-CPU (Multi-core) hardware • Avoid blocking for external events and maintain low latency Split an application over multiple processors or machines for scalability, without developer having to write low-level code
Rule 8: Process & Respond Instantaneously • Ensure high availability, stored/real-time processing, handling stream imperfections all work concurrently with low latency • Test rigorously—simulated and live feeds • Monitor latency and processing speed in messages/second Run all 7 rules in-process at tens to hundreds of thousands of messages/second with low latency
Client Applications Output Stream Input Stream StreamBase Application StreamBase Application Messaging/Transport System Messaging/Transport System StreamBase Server StreamBase Server Operating System Operating System Output Stream Input Stream Hardware Hardware Stream Processing Engine Architecture The StreamBase Server Infrastructure Capabilities: • 10k-500K+ msgs/sec • High availability • 64 bit addressing • Supports clusters & blade configurations via application & data partitioning Functional Capabilities: • Implements StreamSQL • Multi-threaded with real-time scheduling • Multiple options for managing stored data • Insertion of custom logic & analytics to the data stream • Adapters to external data sources & messaging systems
Integrated environment for building, testing, deploying Integrated Development Environment • Eclipse-based IDE • Drag-and-connect with workflow orientation • Built-in load simulation for easy testing • Stream Record/Playback • Custom C++ or Java operators • Debugger & performance monitor
Required Characteristics for Complex Event Processing Engines
Future Directions for the Community • Standard vocabulary and vernacular: • E.g. “events,” “CEP,” “stream processing,” “pattern-matching” • Education and visibility around category: • Analyst reports • Broader market education • Technical standards: • Benchmarks: Performance, scalability • Languages: StreamSQL or extended SQL • Research: • Approximation • Distributed processing • Self-adaptive • Sensor applications • Scientific applications
London Office107-111 Fleet StreetLondon EC4A 2ABUnited Kingdom+44 (0)20 7936 9050 Corporate Headquarters181 Spring StreetLexington, Massachusetts 02421+1 866 STRMBAS+1 866 787 6227+1 781 761 0800 Reston, Virginia Office11921 Freedom Drive, Suite 550 Reston, VA 20190+1 703 608 6958 New York City Office220 West 42nd Street, 20th FloorNew York, New York 10036+1 866 STRMBAS+1 866 787 6227 Enterprise-classstream processing software designed totransformreal-time complex eventsinto actionable intelligence Thank You