270 likes | 416 Vues
Continuously Adaptive Continuous Queries (CACQ) over Streams. by Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman SIGMOD’2002. CACQ Introduction. Proposed continuous query (CQ) systems are based on static plans But, CQs are long running
 
                
                E N D
Continuously Adaptive Continuous Queries (CACQ) over Streams by Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman SIGMOD’2002
CACQ Introduction • Proposed continuous query (CQ) systems are based on static plans • But, CQs are long running • Initially valid assumptions less so over time • Static optimizers at their worst! • CACQ insight: apply continuous adaptivity of eddies to continuous queries • Avoid static optimization via dynamic operator ordering • Process multiple queries simultaneously • Explore sharing of work & storage
Motivating Applications • “Monitoring” queries look for recent events in streams • Sensor data processing • Stock analysis • Router, web, or phone events • In CACQ, queries over ‘recent-history’ • Only tuples currently entering the system • Stored in in-memory data tables for time-windowed joins between streams
Continuous Queries • Long running, “standing queries”, similar to trigger systems • Installed; continuously produce streamed results until removed • Lots of queries, over the same data sources • Opportunity for work sharing! • Global query optimization problem: hard! • Idea: adaptive heuristics not quite as hard? • Bad decisions are not final • Future work: finding an optimal plan (adaptively)
R.a S.b Probe Build Build Probe R S Joins in CACQ • CACQ uses Parallel Pipelined Joins • To avoid blocking • Example: Symmetric (Windowed) Hash Join
n n CACQ Query Model • SELECT R.a, S.b from R, S where R.c = x and R.d[n] = S.e[m] R  S Probe into S m
CACQ Main Points • Adaptivity via Eddies and Routing policies • Tuple Lineage for flexible sharing of operators between queries • Grouped Filter for efficiently computing selections over multiple queries • State Modules (SteMs) for enabling state sharing among joins
Step by Step Using Example • First, just one query with only selections • Then, add multiple queries • Then, add joins to the picture
Eddies & Adaptivity • Eddies (Avnur & Hellerstein, SIGMOD 2000): Continuous Adaptivity • No static ordering of operators (“no query plan”) • Routing policy dynamically determines in which order individual tuples go through operators • Encoding of state (lineage) with each tuple : • Use ready bits to track what to do next • Use done bits to track what has been done
R R  (R.b < 15)  (R.b < 15)  (R.a > 10)  (R.a > 10) R1 Eddy Eddy a b a b Ready Done Eddies : Single Query, Single Source SELECT * FROM R WHERE R.a > 10 AND R.b < 15 • Ready bits track what to do next • All 1’s in single source • Done bits track what has been done • Tuple can be output when all bits set R2 R2 R1 R2 R2 R2 R1 R2 1 1 11 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0
Q1 Q2 Q3 R1 a a a SELECT * FROM R WHERE R.a > 10 AND R.b < 15 Q1 a R1 R1 b b b Q2 SELECT * FROM R WHERE R.a > 20 AND R.b = 25 R1 R1 R1 R1 R1 R1 R R R SELECT * FROM R WHERE R.a = 0 AND R.b <> 50 Q3 b R a b Q1 Q2 Q3 Done QueriesCompleted Multiple Queries R.a > 10 R.a > 20 R1 R.a = 0 Grouped Filters R1 R.b < 15 R1 R.b = 25 R1 R.b <> 50 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1
Q1 Q1 Q2 Q2 Q3 Q3 a b a b b a SELECT * FROM R WHERE R.a > 10 AND R.b < 15 Q1 a R2 a b a b a b Q2 SELECT * FROM R WHERE R.a > 20 AND R.b = 25 R1 R1 R2 R2 R2 R R R R R R SELECT * FROM R WHERE R.a = 0 AND R.b <> 50 Q3 b R a b Q1 Q2 Q3 Done QueriesCompleted Multiple Queries R.a > 10 R2 R.a > 20 R2 R.a = 0 R2 Grouped Filters R2 R2 R.b < 15 Reorder Operators! R2 R.b = 25 R.b <> 50 11 0 1 1 1111 1 0 0 0 1 1 1 0 0 1 1 0 0 0 0 0
Tuple & Query Data Structures Tuple [10, 1100, …] • Per tuple bitmaps: • queriesCompleted • What queries has this tuple been output to or rejected by? • done • What operators have been applied to this tuple? • ready • What operators can be applied to this tuple? • Per query bitmaps: • completionMask • What operators must be applied to output a tuple to this query? Query [0110]
Q1 SELECT * FROM R WHERE R.a > 10 AND R.b < 15 Q2 SELECT * FROM R WHERE R.b < 15 AND R.c <> 5 AND R.d = 10 Tuple 110 0 0 0 110 0 1 0 110 0 1 0 a b c d Q1 Q2 Done QC Outputting Tuples • Store a completionMask bitmap for each query • One bit per operator • Set if operator in query • To determine if a tuple t can be output to query q: • Eddy ANDs q’s completionMask with t’s done bits • Output only if q’s bit not set in t’s queriesCompleted bits • Every time a tuple returns from an operator completionMasks & Done == 1100 && QueriesCompleted[0] == 0 Q1: 1100 Q2: 0111 & Done == 0111
7 S.a 8 1 11 >1 >7 >11 Q2 Q3 Q1 Grouped Filter • Use binary trees to efficiently index range predicates • Two trees (LT & GT) per attribute • Insert constant • When tuple arrives • Scan to right (for GT) or left (for LT) of the tuple-attribute in the tree • Those are the queries that the tuple does not pass • Hash tables to index equality, inequality predicates Greater-than tree over S.a S.a > 1 S.a > 7 S.a > 11
Work Sharing via Tuple Lineage Q1: SELECT * FROM s WHERE A, B, C Q2: SELECT * FROM s WHERE A, B, D
CACQ Shared Subexpr. Q 1 Q 2 A A A A D D D 0 or 1 | 0 or 1 0 or 1 | 0 or 1 0 or 1 | 0 or 1 B B B B Reject? QC QC QC C D C C C 0 or 1 | 0 1 | 1 0 | 0 QC QC QC Data Stream S Data Stream S Work Sharing via Tuple Lineage Q1: SELECT * FROM s WHERE A, B, C Q2: SELECT * FROM s WHERE A, B, D Conventional Queries Query 1 Query 2 Lineage (Queries Completed) Enables Any Ordering! sCDBA Inter-section of CD goes through AB an extra time! sBC sCDB sBD sAB sAB sCD sD AB must be applied first! sC sc sB s s s s Data Stream S
Tradeoff: Overhead vs. Shared Work • Overhead in additional bits per tuple • Bit / query / tuple have significant effects • Trading accounting overhead for work sharing • 100 bits / tuple allows a tuple to be processed once, not 100 times • Reduce overhead by not keeping state about operators tuple will never pass through
R.a S.b Probe Build Build Probe R S Joins in CACQ • Use symmetric hash join to avoid blocking • Use State Modules (SteMs) to share storage between joins with a common base relation
T.c R.a S.b Probe Build Build Probe Probe Query 1 S.b = T.c R T S Query 2 R.a = S.b Processing Joins Via State Modules • Idea: Share join indices over base relations • State Modules (SteMs) are: • Unary indexes (e.g. hash tables, trees) • Built on fly (as data arrives) • Scheduled by CACQ as first class operators • Based on symmetric hash join
Routing Policies • Machinery so far assures correctness • Routing policy responsible for performance • Consult the policy to determine where to route every tuple that: • Enters the system • Returns from an operator
Eddies with Lottery Scheduling • Operator gets 1 ticket when it takes a tuple • Favor operators that run fast (low cost) • Operator loses a ticket when it returns a tuple • Favor operators that drop tuples (low selectivity) • Winner? • Large number of tickets == measure of goodness • Lottery Scheduling: • When two operators vie for the same tuple, hold a lottery • Never let any operator go to zero tickets • Support occasional random “exploration”
Routing Policies For CACQ • Give more tickets to operators shared by multiple queries (e.g., grouped filters) • When a shared operator outputs a tuple, charge it multiple tickets • Intuition: cardinality reducing shared operators reduce global work more than unshared operators • Not optimizing for the throughput of a single query!
Evaluation • Java implementation on top of Telegraph QP • 4,000 new lines of code in 75,000 line codebase • Server Platform • Linux 2.4.10 • Pentium III 733, 756 MB RAM • Queries posed from separate workstation • Output suppressed
Results: Routing Policy All attributes uniformly distributed over [0,100] Query 1 2 3 4 5
Experiment: Increased Scalability Workload, Per Query: 1-5 randomly selected range predicates of form ‘attr > x’ over 5 attributes. Predicates from the uniform distribution [0,100]. 50% chance of predicate over each attribute.
Conclusion • CACQ: sharing and adaptivity for processing monitoring queries over data streams • Features • Adaptivity • Adapt without costly multi-query reoptimization • Work sharing via tuple lineage • Without constraining the available plans • Computation sharing via grouped filter • Storage sharing via SteMs • Future Work • More sophisticated routing policies • Batching & query grouping • Better integration with historical results