150 likes | 261 Vues
Aurora introduces a novel approach to data querying that deviates from traditional SQL-like languages, opting instead for a graphical interface that uses dataflow diagrams, where boxes represent operators and arrows signify data streams. This design aims to simplify complex queries and enhance user experience. Key functionalities include various operators like Filter, Map, Union, and more, alongside features like time-based windows for aggregation. By emphasizing Quality of Service (QoS), Aurora addresses challenges such as tuple delays and unordered data, enabling users to define parameters for efficient window closure.
E N D
Panel onStream Query LanguagesThe Aurora View Stan Zdonik Brown University
Aurora Queries • We do not have an SQL-like language. • We have a GUI for dataflow diagrams. • Boxes = operators • Arrows = streams • Rationale: • CSE is tough for thousands of queries. • Workflow is more natural. • Easier for users to extend what’s been done. • Best to understand implementation first.
Aurora Operators • Very relational in spirit. • Filter, Map, Union, Join, Aggregate • Adds Windows (everyone seems to agree). … with some wrinkles that we will get to. • Adds a few operators. • Wsort • Resample
Simple Aggregation A B C 1, 1, 1 1, 1, 1 1, 1, 3 1, 1, 2 Aggregate Agg(init,incr,final) Window(on C, size = 2 offset = 1) GroupBy A,B 1, 2, 2 1, 2, 1 A B C 1, 2, 1 1, 1, 1 . . . • init:called when window opens • incr: called for each new value • final: called when window closes • One or more open window per group. • Size and Offset given in: • #tuples, attribute interval, or time interval Generalized aggregate
Query 1 Generate the stream of packets whose length is greater than twice the average packet length over the last 1 hour. (pID, length, time) Join Match ( length > 2 * avgLen and time=time2) Map f(t): (t.ID, t.length, t.time) Aggregate agg(init,incr,final) Window(on time, size = 1 hr, offset=1 tuple) State = (sum int, num int, endtime int)) init = {sum :=0, num :=0} incr (p) ={sum := sum+p.length; num:=num+1; endtime := p.time} final= emit (time2=endtime, avgLen=sum/num)
Query 2 Create an alert when more than 20 type 'A' squirrels are in Jennifer's backyard. Assume squirrels report every p sec (sID1, region, time) Join Match (sID1=sID2) Filter region = JWY and type = “A” ST (sID2, type) Aggregate agg (count) Window(on time, size=p sec, offset=p sec) Filter count > 20
Query 3 Stream an event each time 3 different squirrels within a pairwise distance of 5 meters from each other chirp within 10 seconds of each other. (sID, loc, time) Join Match (1.sID not= 2.sID and dist(1.loc, 2.loc) < 5 m) Window (on time, size = 5 sec, offset = 1 tuple) Join Match (dist(1.1.loc, 2.loc) < 5 m and dist(1.2.loc, 2.loc) < 5 m and 1.1.sID not= 2.sID and 1.2.sID not= 2.sID) Window ( on time, size = 5 sec, offset = 1 tuple) 1 1 (sID, loc, time) 2 (sID, loc, time) 2
Super-bonus Query Create a log of flow information from a stream of packets. A flow (simple definition) from a source S to a destination D ends when no packet from S to D is seen for at least 2 minutes after the last packet from S to D. The next packet from S to D starts a new flow. The flow log contains the source, destination, count of packets, and total length of packets for each flow. Are you kidding!!!!
Actually, it’s Pretty Easy 2 min S D Aurora Aggregate Aggr = (init1, incr1, final1) Window (size = 2 tuples, offset = 1) GroupBy (src, dest) Aggregate Aggr = (init2, incr2, final2) Window (on flow#, size = 1, offset = 1) GroupBy (src, dest) (pID, src, dest, length, time) State1 = (flow#: int, first packet, second packet) ) State2 = (count int, len int) init1 = {flow# :=0;first:=null;second:=null} init2 = {count :=0; len := 0} Incr1(p) ={first:=second, second:=p; if second.time-first.time > 2 then flow# := flow# + 1} incr2 (p) ={count =: count + 1 len := len + p.length} final2 = emit (src,dest,len, count) final1= emit (second.src,second.dest, second.length, second.time, flow#)
… but this is not enough! • What if it was really important that I know about the squirrels within 1 minute of the intrusion? => Queries need Quality-of-Service support. In fact, QoS is an integral part of the declarative spec. of the query.
…but it gets worse! • Networks (e.g., mobile) can arbitrarily delay or lose tuples. => Operators can’t block arbitrarily waiting. A corollary of latency-based Qos.
…and worse! • Tuples may not arrive at an operator in sort order. • The network can reorder them • Operators themselves can shuffle them. • Priority scheduling might force them out of order. • This complicates things. • windows • aggregates
Our Solution • Problem has to do with when to close windows. • Tradeoff: Latency (QoS) vs. Accuracy • Define additional parameters on windows that determine termination. • might result in lost data.
time 1 1 1 1 1 1 1 timeout interval (time) slack time 1 1 1 2 1 1 timeout interval (#tuples) Our Solution (cont.) • For blocking (late tuples) => Timeout • For disorder (early tuples) => Slack
Status • Now: • users supply values for timeout and slack. • As in examples, not always needed. • Goal: • automatically insert / adjust these values based on QoS specs.