390 likes | 573 Vues
Window Aggregates in NiagaraST. Kristin Tufte, Jin Li Thanks to the NiagaraST Group @ PSU. Outline. Review of Window Aggregate Evaluation WID Window Semantics Panes Disordered Data and Out-of-order Processing. window. window. t4 (a5, 47, 01:10:10).
E N D
Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST Group @ PSU Data Streams: Lecture 10
Outline • Review of Window Aggregate Evaluation • WID Window Semantics • Panes • Disordered Data and Out-of-order Processing Data Streams: Lecture 10
window window t4 (a5, 47, 01:10:10) t6 (a6, 46, 01:11:02) t4 (a5, 47, 01:10:10) t5 (a6, 48, 01:10:40) Window Aggregate – Buffering SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid ( aid, max)( a5, 47 )( a6, 48 ) (aid, amt ts (hh:mm:ss)) windowMax t1 (a5, 40, 01:06:30) t2 (a6, 42, 01:07:45) t3 (a5, 45, 01:08:15) window windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 ( aid, amt, ts ) t6 ( a6, 46, 01:11:02) t5 ( a6, 48, 01:10:30) t4 ( a5, 47, 01:10:10) Data Streams: Lecture 10
wid max aid 70 70 70 70 70 a6 a5 a5 a5 a6 max: 45 max: 42 max: 40 max: 48 max: 47 … … … … … … 74 74 74 74 74 a5 a5 a6 a6 a5 max: 42 max: 45 max: 48 max: 40 max: 47 71 a6 max: 48 70 a6 max: 48 a6 max: 48 74 a6 max: 48 74 … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … a6 75 max: 46 … … … … … … … … … … … … Window Aggregate Evaluation in NiagaraST -WID (aid, window-id, max) ( a6, 70, 48 ) Max groups on: aid, wid (aid, amt, ts,window-id) t5 ( a6, 48, 01:10:30, 70-74 ) p1 ( a6, *, *, 70 ) t4 ( a5, 47, 01:10:10, 70-74 ) t6 ( a6, 46, 01:11:02, 71-75 ) t1 ( a5, 40, 01:06:30, 70-74 ) t3 ( a5, 45, 01:08:15, 70-74 ) t2 ( a6, 42, 01:07:45, 70-74 ) Bucket Range 5 minutes Slide 1 minute Wattr ts windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 (aid, amt, ts ) p1( a6, *, 01:11:00) t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t1 ( a5, 40, 01:06:30) t6 ( a6, 46, 01:11:02) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) Data Streams: Lecture 10
What’s the Difference • Window Semantics • Assumptions of data arrival order vs. Window-id • Data arrival (query answer and result production) vs. Punctuation • Query evaluation performance • Space • Time • Latency Data Streams: Lecture 10
Bucket • Bucket maps each tuple to windows ( aid, amt, ts ) windows: A 01:03:00 – 01:08:00 B 01:04:00 – 01:09:00 C 01:05:00 – 01:10:00 D 01:06:00 – 01:11:00 E 01:07:00 – 01:12:00 F 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10
Window Semantics Framework in NiagaraST • T: the set of all tuples in the input stream • S: a window specification • W: a set of window-ids • In this lecture, we assume time starts at 0 • windows: (T, S) W • Defines the set of window ids to be used, e.g., 0, 1, 2, … • extent:(T, S, w) U T, where w W • Specifies which tuples belong to a given window • wids:(T, S, t) V W, where t T • Determines the set of window-ids to which a tuple belongs • Is the dual of extent Data Streams: Lecture 10
Extent • Window 74’s content ( aid, amt, ts ) windows: 71 01:03:00 – 01:08:00 72 01:04:00 – 01:09:00 73 01:05:00 – 01:10:00 7401:06:00 – 01:11:00 75 01:07:00 – 01:12:00 76 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10
Wids • t3’s window membership ( aid, amt, ts ) windows: 71 01:03:00 – 01:08:00 7201:04:00 – 01:09:00 7301:05:00 – 01:10:00 7401:06:00 – 01:11:00 7501:07:00 – 01:12:00 7601:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10
windows (T, S [RANGE, SLIDE, WATTR]) = {0, 1, 2, …} windows (T, S [5, 1, ts]) = {0, 1, 2, …} extent (w, T, S[RANGE, SLIDE, WATTR]) = { t T | ((w+1) * SLIDE)-RANGE) ≤ t.WATTR < (w+1) *SLIDE} extent (w, T, S[5,1, ts]) = { t T | ((w+1) * 1) − 5) ≤ t.ts < (w+1) * 1} wids (t, T, S [5, 1, ts]) = {w W| t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W| t.WATTR / SLIDE– 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } Defining Window Semantics - sliding window Q1:SELECT aid, max(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP-BY aid For t4(s5, 47, 01:10:10), wids (t4, T, S [5, 1, ts]) = {w W | t4.ts / 1 – 1 < w ≤ (t4.ts + 5) / 1) – 1 } = {w W | 69.17 < w ≤ 74.17 } = {w W | 70 ≤ w ≤ 74} where t4.ts is 01:10:10 ≈ 70.17 minute Data Streams: Lecture 10
Partitioned Window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid Q1': SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts Pattr aid] Q1 == Q1' Data Streams: Lecture 10
Defining Window Semantics - partitioned window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids windows (T, S [RANGE, SLIDE, row-num, PATTR]) = {(i, p) | i {0, 1, 2, …}, p T.PATTR} extent ((i, p), T, S[RANGE, SLIDE, row-num, PATTR]) = { t T | t.PATTR = p, ((i+1) * SLIDE)-RANGE ) ≤rank(t, row-num, PATTR, T) < (i+1) *SLIDE} Data Streams: Lecture 10
Rank function rank ( aid, amt, ts) t1 ( a5, 40, 01:06:30) 0 1 t1 ( a5, 40, 01:06:30) t3 ( a5, 45, 01:08:15) a5 2 t2 ( a6, 42, 01:07:45) t4 ( s5, 47, 01:10:10) t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t2 ( a6, 42, 01:07:45) 0 t5 ( a6, 48, 01:10:30) a6 t5 ( a6, 48, 01:10:30) 1 t6 ( a6, 46, 01:11:02) t6 ( a6, 46, 01:11:02) 2 rank(t, row-num, PATTR, T): tuple t’s arrival position in partition PATTR Data Streams: Lecture 10
Defining Window Semantics - partitioned window (cont.) wids (t, T, S[RANGE, row-num, PATTR]) = {(i, p)W | t.PATTR = p, r / SLIDE – 1 i (r + RANGE) / SLIDE –1}wherer = rank (t, row-num, PATTR, T) Q2: SELECT aid, MAX(amt ) FROM Bids [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR aid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids Data Streams: Lecture 10
Defining Window Semantics - slide-by-tuple window Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid windows (T, S [RANGE, RATTR, 1, row-num]) = {w | t T, w = t.RATTR} extent (t, T, S[RANGE, RATTR,1, row-num]) = { t T | (w+1) -RANGE ) ≤t.RATTR< (w+1)} wids (t, T, S[RANGE, RATTR,1, row-num]) = { wW | t.RATTR≤w< t.RATTR + RANGE} Data Streams: Lecture 10
Discussion: Landmark Windows • “Incrementally compute the max bid price of each auction; update the results every 1 minute.” • Assume time starts from 0 • windows (T, S [-inf, 1, ts]) = {0, 1, 2, …} Data Streams: Lecture 10
Implementation of wids functions -sliding window • No state, no delay to determine the window-ids for each tuple • Context-free Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W| t.WATTR / SLIDE– 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } wids (t, T, S [5, 1, ts]) = {w W| t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } Data Streams: Lecture 10
Implementation of wids functions -partitioned window • Need to maintain the number of arrived tuples for each partition • Backward Context Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] wids (t, T, S[1000, 100, row-num, aid]) = {(i, p)W | t.aid = p, r / 100 – 1 i (r + 1000) / 100 –1}wherer = rank (t, row-num, PATTR, T) Data Streams: Lecture 10
Implementation of wids functions -slide-by-tuple window • Forward context Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid wids (t, T, S[5, ts,1, row-num]) = { wW | t.ts≤w< t.ts + 5} windows (T, S [5, ts, 1, row-num]) = {w | t T, w = t.ts} Data Streams: Lecture 10
Backward Context and Forward Context • Backward context • state to be kept • Forward context • the mapping from tuples to window-ids has to be delayed Data Streams: Lecture 10
WID vs. Buffering – execution time comparison (overview) Data Streams: Lecture 10
WID vs. Buffering – execution time comparison (zoom-in) Data Streams: Lecture 10
Outline • Review of Window Aggregate Evaluation • WID Window Semantics • Panes • Disordered Data and Out-of-order Processing Data Streams: Lecture 10
W1 W2 Windows W3 W4 W5 … … P1 P2 P3 P4 P5 P6 P7 P8 Panes Sharing Panes Li, J. Maier, D., Tufte, K., Papadimos, V., Tucker, P. No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams. http://www.cs.pdx.edu/datalab/niagara/nopanenogain.pdf Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts]GROUP BY aid • Pane size = GCD (SLIDE, RANGE) Data Streams: Lecture 10
Query Transformation Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid Q5: SELECT aid, sum(cnt) FROM Q6 [Range 4 Slide 1 Wattr pid] GROUP BY aid Q6: SELECT aid, count(*) as cnt, window-id as pid FROM Bids [Range 1 minute Slide 1 minute Wattr ts] GROUP BY aid Data Streams: Lecture 10
Pane Implementation (aid, sum, window-id) ( a5, 8, 70 ) s0 sum(*) (group on window-id, aid) (aid, pane-id, count, window-id) ( a5, 70, 8, 70-74 ) m0 bucket B2 as window-id RANGE 4 SLIDE 1 WATTR pane-id Q5 (aid, pane-id, count) ( a5, 70, 8 ) m0 count (*) (group on pane-id, aid) (aid, amt, pane-id) ( a5, 47, 70-70 ) t1 ( a5, *, 70 ) p1 ( a6, 48, 70-70 ) t2 Q6 bucket B1 as pane-id RANGE 1 min SLIDE 1 min WATTR ts (aid, amt, ts ) ( a5, 47, 01:10:10 ) t1 ( a5, *, 01:11:00 ) p1 ( a6, 48, 01:10:30 ) t2 streamscan Data Streams: Lecture 10
Aggregate Functions • Distributive, Algebraic, Holistic • Distributive • f(X1), f(X2) f(X1 U X2) • Algebraic • g(X) f(X), g is a “synopsis function” • g(X) can be stored in constant memory • g(X1), g(X2) g(X1 U X2) • Holistic: otherwise • Differential • g(Y), g(X) f(Y – X) • g(Y-X), g(X) f(Y) • g(X) can be stored in constant memory • distributive or algebraic Data Streams: Lecture 10
Pane Implementation - Another Window Sum Window operator Range 4 rows Slide 1 row Wattr row-num Tumbling Count Range 1 minute Slide 1 minute Wattr ts SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid Data Streams: Lecture 10
When are panes better than windows? SELECT aid, max(amt) FROM Bids [Range X rows Slide Y rows Wattr row-num] GROUP BY aid Panes are better when cost ratio is less than 1 The number of tuples per pane affects whether using panes is better Data Streams: Lecture 10
Outline • Review of Window Aggregate Evaluation • WID Window Semantics • Panes • Handling Disorder Data Streams: Lecture 10
Sources of Disorder • Sources of disorder • Merging different data sources • Various network transmission delay • Data prioritization • Query processing algorithms, e.g., shared window joins [Hammad, et al.] • Multiple possible windowing attributes, e.g., two timestamps Data Streams: Lecture 10
Example of Disorder: Band Disorder • Network flow: source IP + port destination IP + port • Timestamp of 8th packet of network flow vs. flow start time • Passive Measurement and Analysis project, San Diego Supercomputer Center, http://pma.nlanr.net/PMA Data Streams: Lecture 10
Example of Disorder: Block-sorted Disorder Network flow start time (s) • Scatter plot of netflow records emitted by a router in the Abilene Network (http://abilene.internet2.edu/observatory) • X-axis is position of flow record in stream, y-axis is network flow start time Data Streams: Lecture 10
Handling Disorder • Sort (In-order processing) • Slack – BSort in Aurora • Heartbeat in STREAM • Sort-based Merge in Gigascope • Output buffering and sorting in a shared-window join • Space and time cost Data Streams: Lecture 10
Out-of-Order Processing • Out-of-order processing in Join • M. A. Hammad, W. G. Aref, and A. K. Elmagarmid Optimizing In-Order Execution of Continuous Queries over Streamed Sensor Data. SSDBM 2005 • NiagaraST: Punctuation + Window-Id • Analogue to CPU’s out-of-order processing of instructions Data Streams: Lecture 10
wid max aid 70 70 s5 s5 max: 47 max: 52 … … … … … … 74 74 s5 s5 max: 47 max: 52 71 s6 max: 48 70 70 s6 s6 max: 48 max: 48 s6 max: 48 74 74 s6 s6 max: 48 max: 48 74 … … … … … … … … … … … … s6 75 max: 46 … … … … … … … … … … … … … … … … … … Disorder Handling - WID Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP-BY aid (aid, window-id, max) ( a6, 70, 48 ) Max Group on: wid, aid (aid, amt, ts,window-id) p1( a6, *, *, 70 ) t7 ( a5, 52, 01:10:15, 70-74) t6 ( a6, 46, 01:11:02, 71-75) bucket (aid, amt, ts ) p1( a6, *, 01:11:00) t6 ( a6, 46, 01:11:02) t7 ( a5, 52, 01:10:15) Data Streams: Lecture 10
Sources of Punctuation • External Punctuation • Data sources, e.g., Gigascope • Internal Punctuation • Mechanisms that can be used to generate punctuations • Slack • Heartbeat Data Streams: Lecture 10
Latency vs. AccuracyBand Disorder • Compare external punctuation and two flavors of slack • As slack increases, error decreases and latency increases • External punctuation has better latency and accuracy than slack Data Streams: Lecture 10
Latency vs. AccuracyBlock-Sorted Disorder SELECT count (*) From Bids [Range 10 minutes Slide 1 minute Wattr ts] Latency vs. Accuracy Block-Sorted-Disorder (percentage of incorrect answers) Data Streams: Lecture 10