1 / 39

Window Aggregates in NiagaraST

Window Aggregates in NiagaraST. Kristin Tufte, Jin Li Thanks to the NiagaraST Group @ PSU. Outline. Review of Window Aggregate Evaluation WID Window Semantics Panes Disordered Data and Out-of-order Processing. window. window. t4 (a5, 47, 01:10:10).

elise
Télécharger la présentation

Window Aggregates in NiagaraST

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST Group @ PSU Data Streams: Lecture 10

  2. Outline • Review of Window Aggregate Evaluation • WID Window Semantics • Panes • Disordered Data and Out-of-order Processing Data Streams: Lecture 10

  3. window window t4 (a5, 47, 01:10:10) t6 (a6, 46, 01:11:02) t4 (a5, 47, 01:10:10) t5 (a6, 48, 01:10:40) Window Aggregate – Buffering SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid ( aid, max)( a5, 47 )( a6, 48 ) (aid, amt ts (hh:mm:ss)) windowMax t1 (a5, 40, 01:06:30) t2 (a6, 42, 01:07:45) t3 (a5, 45, 01:08:15) window windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 ( aid, amt, ts ) t6 ( a6, 46, 01:11:02) t5 ( a6, 48, 01:10:30) t4 ( a5, 47, 01:10:10) Data Streams: Lecture 10

  4. wid max aid 70 70 70 70 70 a6 a5 a5 a5 a6 max: 45 max: 42 max: 40 max: 48 max: 47 … … … … … … 74 74 74 74 74 a5 a5 a6 a6 a5 max: 42 max: 45 max: 48 max: 40 max: 47 71 a6 max: 48 70 a6 max: 48 a6 max: 48 74 a6 max: 48 74 … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … a6 75 max: 46 … … … … … … … … … … … … Window Aggregate Evaluation in NiagaraST -WID (aid, window-id, max) ( a6, 70, 48 ) Max groups on: aid, wid (aid, amt, ts,window-id) t5 ( a6, 48, 01:10:30, 70-74 ) p1 ( a6, *, *, 70 ) t4 ( a5, 47, 01:10:10, 70-74 ) t6 ( a6, 46, 01:11:02, 71-75 ) t1 ( a5, 40, 01:06:30, 70-74 ) t3 ( a5, 45, 01:08:15, 70-74 ) t2 ( a6, 42, 01:07:45, 70-74 ) Bucket Range 5 minutes Slide 1 minute Wattr ts windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 (aid, amt, ts ) p1( a6, *, 01:11:00) t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t1 ( a5, 40, 01:06:30) t6 ( a6, 46, 01:11:02) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) Data Streams: Lecture 10

  5. What’s the Difference • Window Semantics • Assumptions of data arrival order vs. Window-id • Data arrival (query answer and result production) vs. Punctuation • Query evaluation performance • Space • Time • Latency Data Streams: Lecture 10

  6. Bucket • Bucket maps each tuple to windows ( aid, amt, ts ) windows: A 01:03:00 – 01:08:00 B 01:04:00 – 01:09:00 C 01:05:00 – 01:10:00 D 01:06:00 – 01:11:00 E 01:07:00 – 01:12:00 F 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10

  7. Window Semantics Framework in NiagaraST • T: the set of all tuples in the input stream • S: a window specification • W: a set of window-ids • In this lecture, we assume time starts at 0 • windows: (T, S)  W • Defines the set of window ids to be used, e.g., 0, 1, 2, … • extent:(T, S, w) U  T, where w  W • Specifies which tuples belong to a given window • wids:(T, S, t) V  W, where t  T • Determines the set of window-ids to which a tuple belongs • Is the dual of extent Data Streams: Lecture 10

  8. Extent • Window 74’s content ( aid, amt, ts ) windows: 71 01:03:00 – 01:08:00 72 01:04:00 – 01:09:00 73 01:05:00 – 01:10:00 7401:06:00 – 01:11:00 75 01:07:00 – 01:12:00 76 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10

  9. Wids • t3’s window membership ( aid, amt, ts ) windows: 71 01:03:00 – 01:08:00 7201:04:00 – 01:09:00 7301:05:00 – 01:10:00 7401:06:00 – 01:11:00 7501:07:00 – 01:12:00 7601:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10

  10. windows (T, S [RANGE, SLIDE, WATTR]) = {0, 1, 2, …} windows (T, S [5, 1, ts]) = {0, 1, 2, …} extent (w, T, S[RANGE, SLIDE, WATTR]) = { t T | ((w+1) * SLIDE)-RANGE) ≤ t.WATTR < (w+1) *SLIDE} extent (w, T, S[5,1, ts]) = { t T | ((w+1) * 1) − 5) ≤ t.ts < (w+1) * 1} wids (t, T, S [5, 1, ts]) = {w W| t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W| t.WATTR / SLIDE– 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } Defining Window Semantics - sliding window Q1:SELECT aid, max(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP-BY aid For t4(s5, 47, 01:10:10), wids (t4, T, S [5, 1, ts]) = {w W | t4.ts / 1 – 1 < w ≤ (t4.ts + 5) / 1) – 1 } = {w W | 69.17 < w ≤ 74.17 } = {w W | 70 ≤ w ≤ 74} where t4.ts is 01:10:10 ≈ 70.17 minute Data Streams: Lecture 10

  11. Partitioned Window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid Q1': SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts Pattr aid] Q1 == Q1' Data Streams: Lecture 10

  12. Defining Window Semantics - partitioned window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids windows (T, S [RANGE, SLIDE, row-num, PATTR]) = {(i, p) | i  {0, 1, 2, …}, p  T.PATTR} extent ((i, p), T, S[RANGE, SLIDE, row-num, PATTR]) = { t T | t.PATTR = p, ((i+1) * SLIDE)-RANGE ) ≤rank(t, row-num, PATTR, T) < (i+1) *SLIDE} Data Streams: Lecture 10

  13. Rank function rank ( aid, amt, ts) t1 ( a5, 40, 01:06:30) 0 1 t1 ( a5, 40, 01:06:30) t3 ( a5, 45, 01:08:15) a5 2 t2 ( a6, 42, 01:07:45) t4 ( s5, 47, 01:10:10) t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t2 ( a6, 42, 01:07:45) 0 t5 ( a6, 48, 01:10:30) a6 t5 ( a6, 48, 01:10:30) 1 t6 ( a6, 46, 01:11:02) t6 ( a6, 46, 01:11:02) 2 rank(t, row-num, PATTR, T): tuple t’s arrival position in partition PATTR Data Streams: Lecture 10

  14. Defining Window Semantics - partitioned window (cont.) wids (t, T, S[RANGE, row-num, PATTR]) = {(i, p)W | t.PATTR = p, r / SLIDE – 1 i  (r + RANGE) / SLIDE –1}wherer = rank (t, row-num, PATTR, T) Q2: SELECT aid, MAX(amt ) FROM Bids [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR aid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids Data Streams: Lecture 10

  15. Defining Window Semantics - slide-by-tuple window Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid windows (T, S [RANGE, RATTR, 1, row-num]) = {w | t T, w = t.RATTR} extent (t, T, S[RANGE, RATTR,1, row-num]) = { t T | (w+1) -RANGE ) ≤t.RATTR< (w+1)} wids (t, T, S[RANGE, RATTR,1, row-num]) = { wW | t.RATTR≤w< t.RATTR + RANGE} Data Streams: Lecture 10

  16. Discussion: Landmark Windows • “Incrementally compute the max bid price of each auction; update the results every 1 minute.” • Assume time starts from 0 • windows (T, S [-inf, 1, ts]) = {0, 1, 2, …} Data Streams: Lecture 10

  17. Implementation of wids functions -sliding window • No state, no delay to determine the window-ids for each tuple • Context-free Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W| t.WATTR / SLIDE– 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } wids (t, T, S [5, 1, ts]) = {w W| t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } Data Streams: Lecture 10

  18. Implementation of wids functions -partitioned window • Need to maintain the number of arrived tuples for each partition • Backward Context Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] wids (t, T, S[1000, 100, row-num, aid]) = {(i, p)W | t.aid = p, r / 100 – 1 i  (r + 1000) / 100 –1}wherer = rank (t, row-num, PATTR, T) Data Streams: Lecture 10

  19. Implementation of wids functions -slide-by-tuple window • Forward context Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid wids (t, T, S[5, ts,1, row-num]) = { wW | t.ts≤w< t.ts + 5} windows (T, S [5, ts, 1, row-num]) = {w | t T, w = t.ts} Data Streams: Lecture 10

  20. Backward Context and Forward Context • Backward context • state to be kept • Forward context • the mapping from tuples to window-ids has to be delayed Data Streams: Lecture 10

  21. WID vs. Buffering – execution time comparison (overview) Data Streams: Lecture 10

  22. WID vs. Buffering – execution time comparison (zoom-in) Data Streams: Lecture 10

  23. Outline • Review of Window Aggregate Evaluation • WID Window Semantics • Panes • Disordered Data and Out-of-order Processing Data Streams: Lecture 10

  24. W1 W2 Windows W3 W4 W5 … … P1 P2 P3 P4 P5 P6 P7 P8 Panes Sharing Panes Li, J. Maier, D., Tufte, K., Papadimos, V., Tucker, P. No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams. http://www.cs.pdx.edu/datalab/niagara/nopanenogain.pdf Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts]GROUP BY aid • Pane size = GCD (SLIDE, RANGE) Data Streams: Lecture 10

  25. Query Transformation Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid Q5: SELECT aid, sum(cnt) FROM Q6 [Range 4 Slide 1 Wattr pid] GROUP BY aid Q6: SELECT aid, count(*) as cnt, window-id as pid FROM Bids [Range 1 minute Slide 1 minute Wattr ts] GROUP BY aid Data Streams: Lecture 10

  26. Pane Implementation (aid, sum, window-id) ( a5, 8, 70 ) s0 sum(*) (group on window-id, aid) (aid, pane-id, count, window-id) ( a5, 70, 8, 70-74 ) m0 bucket B2 as window-id RANGE 4 SLIDE 1 WATTR pane-id Q5 (aid, pane-id, count) ( a5, 70, 8 ) m0 count (*) (group on pane-id, aid) (aid, amt, pane-id) ( a5, 47, 70-70 ) t1 ( a5, *, 70 ) p1 ( a6, 48, 70-70 ) t2 Q6 bucket B1 as pane-id RANGE 1 min SLIDE 1 min WATTR ts (aid, amt, ts ) ( a5, 47, 01:10:10 ) t1 ( a5, *, 01:11:00 ) p1 ( a6, 48, 01:10:30 ) t2 streamscan Data Streams: Lecture 10

  27. Aggregate Functions • Distributive, Algebraic, Holistic • Distributive • f(X1), f(X2)  f(X1 U X2) • Algebraic • g(X)  f(X), g is a “synopsis function” • g(X) can be stored in constant memory • g(X1), g(X2)  g(X1 U X2) • Holistic: otherwise • Differential • g(Y), g(X)  f(Y – X) • g(Y-X), g(X)  f(Y) • g(X) can be stored in constant memory • distributive or algebraic Data Streams: Lecture 10

  28. Pane Implementation - Another Window Sum Window operator Range 4 rows Slide 1 row Wattr row-num Tumbling Count Range 1 minute Slide 1 minute Wattr ts SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid Data Streams: Lecture 10

  29. When are panes better than windows? SELECT aid, max(amt) FROM Bids [Range X rows Slide Y rows Wattr row-num] GROUP BY aid Panes are better when cost ratio is less than 1 The number of tuples per pane affects whether using panes is better Data Streams: Lecture 10

  30. Outline • Review of Window Aggregate Evaluation • WID Window Semantics • Panes • Handling Disorder Data Streams: Lecture 10

  31. Sources of Disorder • Sources of disorder • Merging different data sources • Various network transmission delay • Data prioritization • Query processing algorithms, e.g., shared window joins [Hammad, et al.] • Multiple possible windowing attributes, e.g., two timestamps Data Streams: Lecture 10

  32. Example of Disorder: Band Disorder • Network flow: source IP + port  destination IP + port • Timestamp of 8th packet of network flow vs. flow start time • Passive Measurement and Analysis project, San Diego Supercomputer Center, http://pma.nlanr.net/PMA Data Streams: Lecture 10

  33. Example of Disorder: Block-sorted Disorder Network flow start time (s) • Scatter plot of netflow records emitted by a router in the Abilene Network (http://abilene.internet2.edu/observatory) • X-axis is position of flow record in stream, y-axis is network flow start time Data Streams: Lecture 10

  34. Handling Disorder • Sort (In-order processing) • Slack – BSort in Aurora • Heartbeat in STREAM • Sort-based Merge in Gigascope • Output buffering and sorting in a shared-window join • Space and time cost Data Streams: Lecture 10

  35. Out-of-Order Processing • Out-of-order processing in Join • M. A. Hammad, W. G. Aref, and A. K. Elmagarmid Optimizing In-Order Execution of Continuous Queries over Streamed Sensor Data. SSDBM 2005 • NiagaraST: Punctuation + Window-Id • Analogue to CPU’s out-of-order processing of instructions Data Streams: Lecture 10

  36. wid max aid 70 70 s5 s5 max: 47 max: 52 … … … … … … 74 74 s5 s5 max: 47 max: 52 71 s6 max: 48 70 70 s6 s6 max: 48 max: 48 s6 max: 48 74 74 s6 s6 max: 48 max: 48 74 … … … … … … … … … … … … s6 75 max: 46 … … … … … … … … … … … … … … … … … … Disorder Handling - WID Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP-BY aid (aid, window-id, max) ( a6, 70, 48 ) Max Group on: wid, aid (aid, amt, ts,window-id) p1( a6, *, *, 70 ) t7 ( a5, 52, 01:10:15, 70-74) t6 ( a6, 46, 01:11:02, 71-75) bucket (aid, amt, ts ) p1( a6, *, 01:11:00) t6 ( a6, 46, 01:11:02) t7 ( a5, 52, 01:10:15) Data Streams: Lecture 10

  37. Sources of Punctuation • External Punctuation • Data sources, e.g., Gigascope • Internal Punctuation • Mechanisms that can be used to generate punctuations • Slack • Heartbeat Data Streams: Lecture 10

  38. Latency vs. AccuracyBand Disorder • Compare external punctuation and two flavors of slack • As slack increases, error decreases and latency increases • External punctuation has better latency and accuracy than slack Data Streams: Lecture 10

  39. Latency vs. AccuracyBlock-Sorted Disorder SELECT count (*) From Bids [Range 10 minutes Slide 1 minute Wattr ts] Latency vs. Accuracy Block-Sorted-Disorder (percentage of incorrect answers) Data Streams: Lecture 10

More Related