Enhancing Data Querying with Aurora's Graphical Operators and Stream Processing

Panel onStream Query LanguagesThe Aurora View Stan Zdonik Brown University

Aurora Queries • We do not have an SQL-like language. • We have a GUI for dataflow diagrams. • Boxes = operators • Arrows = streams • Rationale: • CSE is tough for thousands of queries. • Workflow is more natural. • Easier for users to extend what’s been done. • Best to understand implementation first.

Aurora Operators • Very relational in spirit. • Filter, Map, Union, Join, Aggregate • Adds Windows (everyone seems to agree). … with some wrinkles that we will get to. • Adds a few operators. • Wsort • Resample

Simple Aggregation A B C 1, 1, 1 1, 1, 1 1, 1, 3 1, 1, 2 Aggregate Agg(init,incr,final) Window(on C, size = 2 offset = 1) GroupBy A,B 1, 2, 2 1, 2, 1 A B C 1, 2, 1 1, 1, 1 . . . • init:called when window opens • incr: called for each new value • final: called when window closes • One or more open window per group. • Size and Offset given in: • #tuples, attribute interval, or time interval Generalized aggregate

Query 1 Generate the stream of packets whose length is greater than twice the average packet length over the last 1 hour. (pID, length, time) Join Match ( length > 2 * avgLen and time=time2) Map f(t): (t.ID, t.length, t.time) Aggregate agg(init,incr,final) Window(on time, size = 1 hr, offset=1 tuple) State = (sum int, num int, endtime int)) init = {sum :=0, num :=0} incr (p) ={sum := sum+p.length; num:=num+1; endtime := p.time} final= emit (time2=endtime, avgLen=sum/num)

Query 2 Create an alert when more than 20 type 'A' squirrels are in Jennifer's backyard. Assume squirrels report every p sec (sID1, region, time) Join Match (sID1=sID2) Filter region = JWY and type = “A” ST (sID2, type) Aggregate agg (count) Window(on time, size=p sec, offset=p sec) Filter count > 20

Query 3 Stream an event each time 3 different squirrels within a pairwise distance of 5 meters from each other chirp within 10 seconds of each other. (sID, loc, time) Join Match (1.sID not= 2.sID and dist(1.loc, 2.loc) < 5 m) Window (on time, size = 5 sec, offset = 1 tuple) Join Match (dist(1.1.loc, 2.loc) < 5 m and dist(1.2.loc, 2.loc) < 5 m and 1.1.sID not= 2.sID and 1.2.sID not= 2.sID) Window ( on time, size = 5 sec, offset = 1 tuple) 1 1 (sID, loc, time) 2 (sID, loc, time) 2

Super-bonus Query Create a log of flow information from a stream of packets. A flow (simple definition) from a source S to a destination D ends when no packet from S to D is seen for at least 2 minutes after the last packet from S to D. The next packet from S to D starts a new flow. The flow log contains the source, destination, count of packets, and total length of packets for each flow. Are you kidding!!!!

Actually, it’s Pretty Easy 2 min S D Aurora Aggregate Aggr = (init1, incr1, final1) Window (size = 2 tuples, offset = 1) GroupBy (src, dest) Aggregate Aggr = (init2, incr2, final2) Window (on flow#, size = 1, offset = 1) GroupBy (src, dest) (pID, src, dest, length, time) State1 = (flow#: int, first packet, second packet) ) State2 = (count int, len int) init1 = {flow# :=0;first:=null;second:=null} init2 = {count :=0; len := 0} Incr1(p) ={first:=second, second:=p; if second.time-first.time > 2 then flow# := flow# + 1} incr2 (p) ={count =: count + 1 len := len + p.length} final2 = emit (src,dest,len, count) final1= emit (second.src,second.dest, second.length, second.time, flow#)

… but this is not enough! • What if it was really important that I know about the squirrels within 1 minute of the intrusion? => Queries need Quality-of-Service support. In fact, QoS is an integral part of the declarative spec. of the query.

…but it gets worse! • Networks (e.g., mobile) can arbitrarily delay or lose tuples. => Operators can’t block arbitrarily waiting. A corollary of latency-based Qos.

…and worse! • Tuples may not arrive at an operator in sort order. • The network can reorder them • Operators themselves can shuffle them. • Priority scheduling might force them out of order. • This complicates things. • windows • aggregates

Our Solution • Problem has to do with when to close windows. • Tradeoff: Latency (QoS) vs. Accuracy • Define additional parameters on windows that determine termination. • might result in lost data.

time 1 1 1 1 1 1 1 timeout interval (time) slack time 1 1 1 2 1 1 timeout interval (#tuples) Our Solution (cont.) • For blocking (late tuples) => Timeout • For disorder (early tuples) => Slack

Status • Now: • users supply values for timeout and slack. • As in examples, not always needed. • Goal: • automatically insert / adjust these values based on QoS specs.

Enhancing Data Querying with Aurora's Graphical Operators and Stream Processing

Enhancing Data Querying with Aurora's Graphical Operators and Stream Processing

Presentation Transcript

Semantic Query Languages

Relational Query Languages

Logical Query Languages

Logical Query Languages

SWiM Panel on Stream Query Languages

Query Languages

XML Query Languages

XML Query Languages

Query Languages

Query Languages

Query Languages

Review “Query Languages”

XML Query Languages

Relational Query Languages

XML Query Languages

RDF Query Languages

Relational Query Languages

XML Query Languages