Continuously Adaptive Continuous Queries (CACQ) over Streams

Continuously Adaptive Continuous Queries (CACQ) over Streams by Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman SIGMOD’2002

CACQ Introduction • Proposed continuous query (CQ) systems are based on static plans • But, CQs are long running • Initially valid assumptions less so over time • Static optimizers at their worst! • CACQ insight: apply continuous adaptivity of eddies to continuous queries • Avoid static optimization via dynamic operator ordering • Process multiple queries simultaneously • Explore sharing of work & storage

Motivating Applications • “Monitoring” queries look for recent events in streams • Sensor data processing • Stock analysis • Router, web, or phone events • In CACQ, queries over ‘recent-history’ • Only tuples currently entering the system • Stored in in-memory data tables for time-windowed joins between streams

Continuous Queries • Long running, “standing queries”, similar to trigger systems • Installed; continuously produce streamed results until removed • Lots of queries, over the same data sources • Opportunity for work sharing! • Global query optimization problem: hard! • Idea: adaptive heuristics not quite as hard? • Bad decisions are not final • Future work: finding an optimal plan (adaptively)

R.a S.b Probe Build Build Probe R S Joins in CACQ • CACQ uses Parallel Pipelined Joins • To avoid blocking • Example: Symmetric (Windowed) Hash Join

n n CACQ Query Model • SELECT R.a, S.b from R, S where R.c = x and R.d[n] = S.e[m] R  S Probe into S m

CACQ Main Points • Adaptivity via Eddies and Routing policies • Tuple Lineage for flexible sharing of operators between queries • Grouped Filter for efficiently computing selections over multiple queries • State Modules (SteMs) for enabling state sharing among joins

Step by Step Using Example • First, just one query with only selections • Then, add multiple queries • Then, add joins to the picture

Eddies & Adaptivity • Eddies (Avnur & Hellerstein, SIGMOD 2000): Continuous Adaptivity • No static ordering of operators (“no query plan”) • Routing policy dynamically determines in which order individual tuples go through operators • Encoding of state (lineage) with each tuple : • Use ready bits to track what to do next • Use done bits to track what has been done

R R  (R.b < 15)  (R.b < 15)  (R.a > 10)  (R.a > 10) R1 Eddy Eddy a b a b Ready Done Eddies : Single Query, Single Source SELECT * FROM R WHERE R.a > 10 AND R.b < 15 • Ready bits track what to do next • All 1’s in single source • Done bits track what has been done • Tuple can be output when all bits set R2 R2 R1 R2 R2 R2 R1 R2 1 1 11 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0

Q1 Q2 Q3 R1 a a a SELECT * FROM R WHERE R.a > 10 AND R.b < 15 Q1 a R1 R1 b b b Q2 SELECT * FROM R WHERE R.a > 20 AND R.b = 25 R1 R1 R1 R1 R1 R1 R R R SELECT * FROM R WHERE R.a = 0 AND R.b <> 50 Q3 b R a b Q1 Q2 Q3 Done QueriesCompleted Multiple Queries R.a > 10 R.a > 20 R1 R.a = 0 Grouped Filters R1 R.b < 15 R1 R.b = 25 R1 R.b <> 50 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1

Q1 Q1 Q2 Q2 Q3 Q3 a b a b b a SELECT * FROM R WHERE R.a > 10 AND R.b < 15 Q1 a R2 a b a b a b Q2 SELECT * FROM R WHERE R.a > 20 AND R.b = 25 R1 R1 R2 R2 R2 R R R R R R SELECT * FROM R WHERE R.a = 0 AND R.b <> 50 Q3 b R a b Q1 Q2 Q3 Done QueriesCompleted Multiple Queries R.a > 10 R2 R.a > 20 R2 R.a = 0 R2 Grouped Filters R2 R2 R.b < 15 Reorder Operators! R2 R.b = 25 R.b <> 50 11 0 1 1 1111 1 0 0 0 1 1 1 0 0 1 1 0 0 0 0 0

Tuple & Query Data Structures Tuple [10, 1100, …] • Per tuple bitmaps: • queriesCompleted • What queries has this tuple been output to or rejected by? • done • What operators have been applied to this tuple? • ready • What operators can be applied to this tuple? • Per query bitmaps: • completionMask • What operators must be applied to output a tuple to this query? Query [0110]

Q1 SELECT * FROM R WHERE R.a > 10 AND R.b < 15 Q2 SELECT * FROM R WHERE R.b < 15 AND R.c <> 5 AND R.d = 10 Tuple 110 0 0 0 110 0 1 0 110 0 1 0 a b c d Q1 Q2 Done QC Outputting Tuples • Store a completionMask bitmap for each query • One bit per operator • Set if operator in query • To determine if a tuple t can be output to query q: • Eddy ANDs q’s completionMask with t’s done bits • Output only if q’s bit not set in t’s queriesCompleted bits • Every time a tuple returns from an operator completionMasks & Done == 1100 && QueriesCompleted[0] == 0 Q1: 1100 Q2: 0111 & Done == 0111

7 S.a 8 1 11 >1 >7 >11 Q2 Q3 Q1 Grouped Filter • Use binary trees to efficiently index range predicates • Two trees (LT & GT) per attribute • Insert constant • When tuple arrives • Scan to right (for GT) or left (for LT) of the tuple-attribute in the tree • Those are the queries that the tuple does not pass • Hash tables to index equality, inequality predicates Greater-than tree over S.a S.a > 1 S.a > 7 S.a > 11

Work Sharing via Tuple Lineage Q1: SELECT * FROM s WHERE A, B, C Q2: SELECT * FROM s WHERE A, B, D

CACQ Shared Subexpr. Q 1 Q 2 A A A A D D D 0 or 1 | 0 or 1 0 or 1 | 0 or 1 0 or 1 | 0 or 1 B B B B Reject? QC QC QC C D C C C 0 or 1 | 0 1 | 1 0 | 0 QC QC QC Data Stream S Data Stream S Work Sharing via Tuple Lineage Q1: SELECT * FROM s WHERE A, B, C Q2: SELECT * FROM s WHERE A, B, D Conventional Queries Query 1 Query 2 Lineage (Queries Completed) Enables Any Ordering! sCDBA Inter-section of CD goes through AB an extra time! sBC sCDB sBD sAB sAB sCD sD AB must be applied first! sC sc sB s s s s Data Stream S

Tradeoff: Overhead vs. Shared Work • Overhead in additional bits per tuple • Bit / query / tuple have significant effects • Trading accounting overhead for work sharing • 100 bits / tuple allows a tuple to be processed once, not 100 times • Reduce overhead by not keeping state about operators tuple will never pass through

R.a S.b Probe Build Build Probe R S Joins in CACQ • Use symmetric hash join to avoid blocking • Use State Modules (SteMs) to share storage between joins with a common base relation

T.c R.a S.b Probe Build Build Probe Probe Query 1 S.b = T.c R T S Query 2 R.a = S.b Processing Joins Via State Modules • Idea: Share join indices over base relations • State Modules (SteMs) are: • Unary indexes (e.g. hash tables, trees) • Built on fly (as data arrives) • Scheduled by CACQ as first class operators • Based on symmetric hash join

Routing Policies • Machinery so far assures correctness • Routing policy responsible for performance • Consult the policy to determine where to route every tuple that: • Enters the system • Returns from an operator

Eddies with Lottery Scheduling • Operator gets 1 ticket when it takes a tuple • Favor operators that run fast (low cost) • Operator loses a ticket when it returns a tuple • Favor operators that drop tuples (low selectivity) • Winner? • Large number of tickets == measure of goodness • Lottery Scheduling: • When two operators vie for the same tuple, hold a lottery • Never let any operator go to zero tickets • Support occasional random “exploration”

Routing Policies For CACQ • Give more tickets to operators shared by multiple queries (e.g., grouped filters) • When a shared operator outputs a tuple, charge it multiple tickets • Intuition: cardinality reducing shared operators reduce global work more than unshared operators • Not optimizing for the throughput of a single query!

Evaluation • Java implementation on top of Telegraph QP • 4,000 new lines of code in 75,000 line codebase • Server Platform • Linux 2.4.10 • Pentium III 733, 756 MB RAM • Queries posed from separate workstation • Output suppressed

Results: Routing Policy All attributes uniformly distributed over [0,100] Query 1 2 3 4 5

Experiment: Increased Scalability Workload, Per Query: 1-5 randomly selected range predicates of form ‘attr > x’ over 5 attributes. Predicates from the uniform distribution [0,100]. 50% chance of predicate over each attribute.

Conclusion • CACQ: sharing and adaptivity for processing monitoring queries over data streams • Features • Adaptivity • Adapt without costly multi-query reoptimization • Work sharing via tuple lineage • Without constraining the available plans • Computation sharing via grouped filter • Storage sharing via SteMs • Future Work • More sophisticated routing policies • Batching & query grouping • Better integration with historical results

Continuously Adaptive Continuous Queries (CACQ) over Streams

Continuously Adaptive Continuous Queries (CACQ) over Streams

Presentation Transcript

Eddies: Continuously Adaptive Query Processing

Continuous Queries and Publish/Subscribe

Optimizing Multiple Continuous Queries

Continuous Queries

Continuous Similarity-Based Queries on

DAX: Dynamically Adaptive Distributed System for Processing CompleX Continuous Queries

Continuous Queries

Eddies: Continuously Adaptive Query Processing

Continuously Adaptive Continuous Queries

Incremental Aggregation on Multiple Continuous Queries

Adaptive Prognostic Models: Learning by Continuous Monitoring

Probabilistic Continuous Update Scheme in Location Dependent Continuous Queries

Eddies: Continuously Adaptive Query processing

Adaptive Continuous Skill Development Program Adaptive Processes Consulting

Adaptive Continuous Business Analysis Skill Development Program Adaptive Processes Consulting

Efficient Scheduling of Heterogeneous Continuous Queries

Continuous Top-k Dominating Queries

Green Governors: A Framework for Continuously Adaptive DVFS

Continuously Transforming

CSS Adaptive Layouts with Media Queries

Continuous Queries

Efficient Scheduling of Heterogeneous Continuous Queries