200 likes | 294 Vues
Efficiently correlate live and archived data streams with Pattern Correlation Queries (PCQs) using recency region matching for stock market analysis. Explore modeling, processing, and optimizing strategies for handling high-stream rates and complex patterns.
E N D
NihalDindar, Peter M. Fischer, MerveSoner, NesimeTatbul ETH Zurich, Switzerland Efficiently Correlating Complex Events over Live and Archived Data Streams
What is a Pattern Correlation Query (PCQ) ? • Upon detecting a fallin the current price of stock X on the livestream, • look for a tick-shaped pattern for X within recent archive Price fall pattern (live match) recencyregion tick-shaped pattern (archive matches) Time
PCQ = Live Archive • Fall pattern on live stream: • PATTERN(A+) • DEFINE A AS A.Price < PREV(A.Price) • Tick-shaped pattern on archive stream: • PATTERN(A+B+) • DEFINE A AS A.Price < PREV(A.Price) B AS B.Price > PREV(B.Price) AND LAST(B.Price) > FIRST(A.Price) • Correlation Criteria • WHERE symbol_l = symbol_a • RECENCY = 10 minutes
Challenges • A clean, useful, optimizable semantics for PCQ • Needed definitions: archive of an event, recency e.g., • Efficient access and processing of fast growing archive data • Optimized processing of high-cost complex pattern matching queries to achieve scalability with potentially high live stream rates
Related Work • Pattern matching systems for live streams • Academic: Cayuga, SASE+, ZStream • Commercial: Coral8, ESPER, Oracle CEP, StreamBase • Systems which combine live and historical data • Moirae, NiagaraST/Latte, TelegraphCQ • Summary: either live pattern matching or combined processing of live and historical data, but not both
Outline • Introduction • Modeling PCQs • Processing PCQs • Optimizing PCQs • Experimental Results • Conclusions and Future Work
Modeling PCQs Event ahappens before(->) event bif astarts before b starts and ends before b ends. A stream is totally ordered based on start and then end time of its events. Price fall pattern (live match) tick-shaped pattern (archive matches) recencyregionsize = P An event has start and end time. An event b has recency correlation with an event a, where a->b and a’s start time is inside b’srecency region. Time
Outline • Introduction • Modeling PCQs • Processing PCQs • Optimizing PCQs • Experimental Results • Conclusions and Future Work
Baseline PCQ Processing Strategy : The Lazy Approach Step 1: Look for live matches Step 2: Calculate the recency region Step 3: Look for archive matches Step 4: Apply the join condition and Join the live and archive matches Price fall pattern (live match) recencyregion tick-shaped pattern (archive matches) Time
Outline • Introduction • Modeling PCQs • Processing PCQs • Optimizing PCQs • Experimental Results • Conclusions and Future Work
Optimizing PCQs - Recent Input Buffer • is an in-memory data structure that mediates between live and archived event stores • caches the most recent stream tuples for efficient access • provides bulk inserts into the stream archive
Optimizing PCQs - Query Result Caching • caches archive matches in order to avoid recomputing them for overlapping regions 1 2 3 Live Stream 1 2 3 4 5 Archive Stream Recency Region P Query Result Cache 3 2 1 5 4 Archive matches are retrieved from the Query Result Cache
Optimizing PCQs - Join Source Ordering • Selectivity Criteria: to process the more selective pattern first • Processing Cost Criteria: to avoid the processing of hot spots Recency region for archive first 1 2 Recency region for live first 1 Recency region for live first Live Stream Archive Stream
Outline • Introduction • Modeling PCQs • Processing PCQs • Optimizing PCQs • Experimental Results • Conclusions and Future Work
Experimental Results • Data: January 26 to 31, 2006 of stock-market data from NYSE • Query: (live pattern: fall), (archive pattern: tick-shaped) • Stock : Exxon Mobile (XOM), P covers several hours baseline
Summary of Experimental Results • PCQs are expensive • Optimization pays off • Our optimizations provide big improvement baseline
Outline • Introduction • Modeling PCQs • Processing PCQs • Optimizing PCQs • Experimental Results • Conclusions and Future Work
Conclusions • We have investigated the problem of efficiently correlating complex events over live and archived data streams, providing: • an optimizablesemantics for Pattern Correlation Queries • Recent input buffering to deal with different access speed of live and archive data • Query result cache & join source ordering to reduce the quadratic complexity of PCQ processing for scaling with high stream rates
Future Work • Optimizations for response time • Indexes on result cache • Introduction of other correlation criteria such as context similarity, temporal periodicity.