XStreamCast: Broadcasting and Query Processing of Streamed XML

XStreamCast:Broadcasting and Query Processing of Streamed XML Leonidas Fegaras University of Texas at Arlington

The XStreamCast Group Faculty: Leonidas Fegaras David Levine PhD Students: Sujoe Bose Weimin He Hao Zhou Tejas Shah Masters Students: Vamsi K. Chaluvadi Darsan Tatineni Sravani Reddy Funded by NSF (will start on 1/1/04). Web page: http://lambda.uta.edu/XStreamCast/

The XStreamCast Architecture Most web servers are pull-based: A client submits a request, the server returns the requested data. This doesn’t scale very well for a very large number of clients who request similar query results. Pushed-based dissemination: A server multicasts a stream of data to registered clients. In our framework: • A client registers with a server using a pull-based web service • A server multicasts data to registered clients in a continuous stream • Data are often derived by merging multiple input streams (eg sensor data) • The server does not have any knowledge about the client queries • The only task performed by the server is slicing, scheduling, and multicasting data: • Critical data may be repeated more often than no-critical data • Invalid data may be revoked • New updates may be broadcast as soon as they become available. • A client connects to multiple streams and evaluates continuous queries locally • It doesn’t register queries with the servers • All processing is done at the client side • No handshaking, no error-correction

The XStreamCast Data Model • Based on XML rather than on flat relational data • The server slices an XML data source into XML fragments. Each fragment: • is a filler that fills a hole • may contain holes, which can be filled by other fragments • is wrapped with control information, such as its unique hole ID, the path that reaches this fragment, etc. • Hole IDs • are similar to surrogates but are hidden from clients • are less restrictive than hierarchical key structures • A continuous stream consists of a fragmented XML data source followed by continuous updates • The unit of update is a fragment • Snapshot view: a hole ID is associated with the latest update • Temporal view: a hole ID is associated with the sequence of all updates

The Fragmented Hole-Filler Model <commodities> <vendor> <name> Wal-Mart </name> <items> <stream:hole id="10" tsid="5"/> <stream:hole id="20" tsid="5"/> ... </vendor> ... </commodities> <stream:filler id="10" tsid="5"> <item> <name> PDA </name> <make> HP </make> <model> PalmPilot </model> <price currency="USD">315.25<price> </item> </stream:filler> <stream:filler id="20" tsid="5"> <item> <name> Calculator </name> <make> Casio </make> <model> FX-100 </model> <price currency="USD">50.25<price> </item> </stream:filler>

Query Processing A client opens connections to streams and evaluates XQueries against these streams • The data view at the client side is the unfragmented data source • For large streams, it’s a bad idea to reconstruct the streamed data in client’s memory • need to process fragments as soon they become available from the server • Some operators block or require unbounded memory: • Sorting • Joins between two streams or self-joins • Group-by with aggregation.

Rest of the Talk • An algebra for stored XML data • An algebra for streamed XML data (snapshot view) • The XCQL query language for querying time-varying streamed XML data (temporal view) • Schema-based translation of XCQL

An Algebra for Stored XML Data Based on the nested-relational algebra: v(T) access the XML data source T using v pred(X) select fragments from X that satisfy pred v1,….,vn(X) project X  Y merge X predY join predv,path (X)unnest (retrieve descendents of elements) pred,h (X)apply h and reduce by  gs,predv,,h(X) group-by gs, apply h to each group, and reduce each group by

Semantics v(T) = { < v = T > } pred(X) = { t | t  X, pred(t) } v1,….,vn(X) = { <v1=t.v1,…,vn=t.vn> | t  X } X  Y = X ++ Y X predY = { tx ty | tx X, ty  Y, pred(tx,ty) } predv,path(X)={ t  <v=w> | t  X, w  PATH(t,path), pred(t,w) } pred,h (X)= /{ h(t) | t  X, pred(t) } gs,predv,,h (X) = …

XPath Expressions • Path evaluation is central to the algebra: PATH: ( XML-data, simple-XPath )  set(XML-data) • Some rules for stored XML data: PATH(<A>x</A>,A/path) = PATH(x,path) PATH(<A>x</A>,A) = { <A>x</A> } PATH(x1 x2,path) = PATH(x1,path)  PATH(x2,path) PATH(x,path) =  otherwise • Predicates have existential semantics $v/A/B = “text”  x  PATH(v,A/B): x = “text”

Transforming XQueries to the Algebra Transformation steps: • XQueries to list comprehensions • XPath terms to simple paths without predicates • Normalization of nested comprehensions • Generator domains are normalized into simple path expressions • List comprehension to XML Algebra

Example #1 where

Example #1 (cont.) ,element(“book”,$b/title)  $b/publisher=“Addison-Wesley” and $b/@year > 1991 $b  $v/bib/book $v  document(“http://www.bn.com”)

Example #2 for $u in document(“users.xml”)//user_tuple return <user> { $u/name } { for $b in document(“bids.xml”)//bid_tuple[userid=$u/userid]/itemno $i in document(“items.xml”)//item_tuple[itemno=$b] return <bid> { $i/description/text() } </bid> sortby(.) } </user> sortby(name)  sort, elem(“bid”,$i/description/text()) $i/itemno=$b sort($u/name), elem(“user”,$u/name++) $b $i    $c/itemno $is/items/item_tuple $u $c  $is $us/users/user_tuple $bs/bids/bid_tuple   $c/userid=$u/userid $us  $bs  document(“items.xml”) document(“users.xml”) document(“bids.xml”)

Algebraic Optimization • Optimizing query expressions as in relational algebra • Query unnesting • Nested queries executed in nested loop fashion • Not possible in stream based processing • Blocking operators replaced with non-blocking outer versions

The Streamed XML Algebra Much like the stored XML algebra, but works on streams. The streams between operators are streams of tuples with fragments as tuple components. An input fragment is stored on a central state  (which can be garbage-collected) but can also be attached to tuples streamed through operators. A stream  between operators takes the forms: • t ; ’ a tuple of fragments t followed by the rest of the stream ’ • Eos end-of-stream Each stored XML algebraic operator has a streamed counterpart eg, pred(t ; ) = t ; pred() if pred is true for t pred(t ; ) = pred() otherwise pred(eos) = eos but … we may not be able to validate pred due to holes in t.

Streamed Algebra Semantics • To keep the suspended fragments, each streamed algebraic operator has • one state 0 for the output and • optional state(s) 1/2 for the input(s) • The result of PATH may now be unspecified: PATH(<hole id=“m” …>,path) = PATH((m),path) if m  = {  } otherwise • When in predicates,  requires 3-value logic • Tuples with incomplete fragments are suspended when necessary, eg: pred(t ; ) = t ; pred() if truePATH(t,pred) pred(t ; ) = pred() otherwise 0  0 {t} if PATH(t,pred)

Join Much like main-memory symmetric join • states: • 0 all suspended output tuples due to unfilled holes • 1 all tuples from left stream • 2 all tuples from right stream • a tuple from left stream: (t1;1) pred2 = { t1 t2 | t22, truePATH(t1 t2,pred) }; (1pred2) 1  1  t1 0  0  { t1 t2 | t22, PATH(t1 t2,pred) } • a tuple from right stream: 1pred (t2;2 ) = { t1 t2 | t11, truePATH(t1 t2,pred) }; (1pred2) 2  2  t2 0  0  { t1 t2 | t11, PATH(t1 t2,pred) }

Reconstructing the XML Data : set(int  XML-data) is an environment that binds filler ids to XML. x   replaces holes with fillers in x using the environment : <A> x </A>   = <A> x   </A> (x1 x2) = (x1 ) (x2 ) <hole id=“m” …>   = [m] if m x   = x otherwise R() returns a pair (a,), where and a is [0] (the reconstructed data): if R() = (a,) then R(<filler id=“m” x>; ) = R(eos) = (,) Basically, R(t ; ) = f(R()) { (x , ) if m=0 (a’, ’) if m0 where ’={(m,x )}  [m/x]

Equivalence Between Stored & Streamed Algebras If we reconstruct the XML document from the streamed fragments and evaluate a query using the stored algebra, we get the same result as when we use the equivalent streamed algebra over the streamed XML fragments and reconstruct the result. result XML document stored XML algebra reconstruction reconstruction streamed XML algebra XML fragments XML fragments Proof sketch: We prove R(p())=p(R()) inductively, where p is the stream version of p. If truePATH(t,pred), then R(p(t;))=R(t;p())=f(R(p()))=f(p(R())) =p(f(R())) =p(R(t;)) …

A Data Model for Temporal XML Based on Hole-Filler model but: • A fragment is now associated with a timestamp • A Hole may be associated with a sequence of fragments, say (<f1,t1>,…,<fn,tn>), sorted by timestamp ti. • The ith version of this hole is fi • The “last” version is fn • The lifespan of the fragment fi is [ti,ti+1], where tn+1 is “now” • The snapshot XML data are derived by ignoring all but the last version • Holes, fragments, and timestamps are hidden from clients • The client sees a temporal view, which can be queried by XCQL

XCQL: Continuous Query Language for XML • It is basically XQuery extended with interval and version projections • Inspired by Stanford’s CQL (which is based on SQL) • Without using the extensions, XCQL is equivalent to XQuery over the snapshot data • Extensions: • Interval projection: e?[t1,t2] shortcut: e?[t] = e?[t,t] where t can be any XQuery time expression, including “now” and “start” • Version projection: e@[v1,v2] shortcut: e@[v] = e@[v,v] where v is any integer expression, including “last” • Valid time begin: vtFrom(e) • Valid time end: vtTo(e)

Example • A network management system receives two streams from a backbone router for TCP connections: one for SYN packages and another for ACK packages that acknowledge the receipt. We want to identify the misbehaving packages that do not receive an acknowledgment within a minute: for $s in stream("syn")//packet, $a in stream("ack")//packet?[vtFrom($s)+1min,now] where $s/id = $a/id and $s/srcIP = $a/destIP and $s/srcPort = $a/destPort return <warning> { $s/id } </warning>

The Temporal View Deriving the temporal view from the fragmented stream: define function temporalize($tag as element()*) {for $e in $tag return if(not(empty($e/*))) then element {name($e)} {$e/@*, temporalize($e/*)} else if(name($e)="hole") then temporalize(get_fillers($e/@id))else $e} define function get_fillers($fid as xs:integer){let $fillers := doc("fragments.xml")/fragments/filler[@id=$fid] for $f at $p in $fillers let $e := $f/*order by ./@validTime return element {name($e)}{$e/@*, attribute vtFrom {$f/@validTime}, attribute vtTo{ if ($p = count($fillers))then "now"else $fillers[$p+1]/@validTime },$e/node()}}

Translation of XCQL into XQuery e?[tb,te] is translated into interval_projection(e,tb,te) e@[vb,ve] is translated into version_projection(e,vb,ve) define function interval_projection ($e as element(), $tb as xs:time, $te as xs:time){ if (!$e/@vtFrom) element {name($e)} { for $c in $e/* return interval_projection($c,$tb,$te) } else if (!interval_intersection($e/vtFrom,$e/vtTo,$tb,$te)) return () else element {name($e)} { attribute vtFrom {max($e/vtFrom,$tb)}, attribute vtTo {min($e/vtTo,$te)}, for $c in $e/* return interval_projection($c,$tb,$te) }}

Recursion is Hard to Optimize The recursion in temporalize, interval_projection, etc, can be eliminated if we know • the complete schema, or • the structural summary <tag name=“creditAccounts”> <temporal name=“account”> <tag name=“customer”/> <tag name=“creditLimit/> <event name=“transaction”> <tag name=“vendor”/> <tag name=“amount”/> <tag name=“status”/> </event> </temporal> </tag> Fragmentation can only be done on temporal or event nodes. Temporal: has lifespan [vtFrom,vtTo] Event: occurs at one point of time (vtFrom=vtTo)

Schema-Based Mapping define function temporalizeCreditAccounts ( $e1 as element() ) as element() { <creditAccounts> { for $e2 in $e1/hole, $e3 in get_fillers($e2/@id) return <account> { $e3/customer, $e3/type, $e3/creditLimit, for $e4 in $e3/hole, $e5 in get_fillers($e4/@id) return <transaction> { $e5/vendor, $e5/amount, $e5/status } </transaction> } </account> } </creditAccounts> }

Example Query: doc(“creditSystem.xml”)/account/transaction[amount > 1000] Default translation: get_fillers_list(get_fillers_list(get_fillers_list(0)/account/hole/@id) /transaction/hole/@id)[amount > 1000] Using schema-based translation: temporalizeCreditAccounts(get_fillers(0))/account/transaction [amount > 1000] Optimized (optimistic) translation: doc("fragments.xml")/fragments/filler/transaction[amount > 1000]

Future Work • Optimal fragmentation and scheduling of fragments based on client profiles • Query optimization of XCQL • Design main memory evaluation techniques for XML fragments • Implement the framework! • Application domain: network management

XStreamCast: Broadcasting and Query Processing of Streamed XML