Efficient Processing of XML Update Streams

Leonidas Fegaras University of Texas at Arlington Efficient Processing of XML Update Streams

Data Stream Processing • What is a data stream? • continuous, time-varying data arriving at unpredictable rates • continuous updates, long-running queries, continuous results • Sought characteristics of stream processing engines • real-time processing • high throughput, low latency, fast mean response time, low jitter • low memory footprint • Why bother? • many data are already available in stream form • sensor networks, network traffic monitoring, stock tickers • publisher-subscriber systems • data stream mining for fraud detection • data may be too volatile to index • continuous measurements

XML Stream Processing • Why XML? There is no reason to normalize stream data • Various sources of XML streams • tokenized XML documents • RSS feeds • web service results • Granularity • XML tokens (events): <tag>, </tag>, “text”, etc • region-encoded XML elements (eg, based on pre-order numbering)‏ • XML fragments (hole-filler model)‏ • Push-based processing: SAX • Pull-based processing: StAX

Traditional Stream Processing output stream input stream • Works on streams that consist of numerical values or relational tuples • Focuses on a sliding window • fixed number of tuples, or • fixed time span • Calculates approximate results • Uses a small (bounded) state • Examples: • top-k most frequent values • group-by SQL queries past stream engine sliding window state future

Our Goals • Handle continuous XQueries over continuous streamed XML data • Embedded updates in the streams • Exact rather than approximate answers • Produce continuous results, even when the results are not complete • Problem: most interesting operations are blocking and/or require unbounded state • grouping & aggregation • predicate evaluation • sorting • sequence concatenation • backward axis steps • We want to address the blocking problem differently • Display the current result of the blocking operation continuously in the form of an update stream • incoming vs. generated updates

Our View of XML Update Streams • A continuous (possibly infinite) sequence of XML tokens with embedded updates • Typically, a finite data stream followed by an infinite stream of updates • SAX-like events • three basic types of tokens: <tag>, </tag>, text • the target of an update is a stream subsequence that contains zero, one, or more “complete” XML elements • the source is also a token sequence that contains complete XML elements • updates are embedded in the data stream and can come at any time • update events can be interleaved with data events and with each other • each event must now have an ID to associate it with an update • updated regions can be updated too • to update a stream subsequence, you wrap it into a Mutable region • three types of updates: replace, insertBefore, insertAfter

Example id Event … equivalent to 1 <a> <a> 1 2 startMutable(1) <c> 2 <c> Y 2 X </c> 2 </c> <c> 2 endMutable(1) X 1 </c> 3 startInsertBefore(2) 3 <c> </a> 3 Y 3 </c> 3 endInsertBefore(2)‏ 1 </a>

Continuous Results • Our stream engine is implemented as a pipeline • each pipeline stage performs a very simple task • The final pipeline stage is the Result Display that displays the query results continuously • the display is an editable text window, where text can be inserted, deleted, and replaced at any point • when an update is coming in the input stream, it is propagated all the way to the result display, where it causes an update to the display text! result display input stream with updates output stream with updates query pipeline

Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • Multiple points of blocking • distinct-values • count • self-join • order-by

Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“35”> <author name=“D”> <title>T1</title> </author> </books> input stream currently <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ...

Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“36”> <author name=“A”> <title>T2</title> </author> <author name=“D”> <title>T1</title> </author> </books> input stream <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ... currently

Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“37”> <author name=“A”> <title>T2</title> <title>T3</title> </author> <author name=“D”> <title>T1</title> </author> </books> input stream <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ... currently

Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“38”> <author name=“A”> <title>T2</title> <title>T3</title> </author> <author name=“B”> <title>T4</title> </author> <author name=“D”> <title>T1</title> </author> </books> input stream <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ... currently

Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“39”> <author name=“A”> <title>T2</title> <title>T3</title> </author> <author name=“B”> <title>T4</title> <title>T5</title> </author> <author name=“D”> <title>T1</title> </author> </books> input stream <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ... currently

Why? • Because, this is what you really want to see as the result of a query • eg, in a stock ticker feed stream, where updates to ticker values come continuously • It leads to “optimistic evaluation” where results are displayed immediately, to be retracted or modified later when more information is available • addresses the “blocking problem” • we proceed without waiting, displaying the results “so far”, but later we may have to send updates • generalizes “on-line aggregation”

Optimistic Evaluation • Pessimistic evaluation: at all times, the query display must always show the correct results up to that point • Optimistic evaluation: display any possible output without delay and later, if necessary, retract it or modify it to make it correct • complementary to, but different than, lazy evaluation • How? • Generated and incoming updates are propagated through the evaluation pipeline until they are processed by the display • They may cause changes to the states of the pipeline stages • Examples: • Event counting: instead of waiting until we count all events, we generate updates that continuously display the counter “so far” • Predicate testing: assume the predicate is true, but when you later find that it is false, retract all output associated with this predicate • Sorting: wrap each element to be sorted around an update that inserts it into the correct place to the element sequence “so far”

Contributions • Instead of eagerly performing the updates on cached portions of the stream, we propagate the updates through the pipeline • … all the way to the query result display • the display prints the results continuously, replacing old results with new • Other approaches: • display approximate answers continuously by focusing on a sliding window over the stream • Our approach: • generates exact answers continuously in the form of an update stream • But the propagated updates may affect the state of the operators • we developed a uniform methodology to incorporate state change • Used this framework to unblock operations and reduce buffering • let the operations themselves embed new updates into the stream that retroactively perform the blocking parts of the operation • why? because “later” is often better than “now”

State Transformers • Each stage in the query evaluation pipeline is a state transformer • Input: a single event and a state S • Output: a sequence of events and a new state S’ • Implemented as a function from an event to a sequence of events that destructively modifies the state • can be used in both pull- and push-based stream processing • The state transformers need only handle the basic SAX events: <tag>, </tag>, text, and begin/end of stream state state transformer stream with regular and update events stream with regular events

Example: Element Counting • The state is an integer counter, count • A blocking state transformer, f(e): if e is a text event count = count+1 return [ ] else if e is end-of-stream return [ “count value” ] • A non-blocking state transformer: if e is begin-of-stream return [ startMutable(id), “0”, endMutable(id) ] else if e is a text event count = count+1 return [ startReplace(d), “count value”, endReplace(id) ]

XPath Steps • The state transformers of simple XPath forward steps are trivial to implement • Example: the Child step (/tag) • state: • need a counter nest to keep track of the nesting depth, and • a flag pass to remember if we are currently passing through or discarding events • logic: • when we see the event <tag> at nest=1, we enter pass mode and stay there until we see </tag> at nest=1 • when in pass mode, we return the current event • otherwise, we return [ ]

Handling Updates update ids • It would be cumbersome to modify each state transformer to handle incoming update events • Our solution: the update events are handled in the same way for all state transformers • Each state transformer is wrapped by a fixed function that handles update events by adjusting the states, while passing the regular events to the state transformer state update events state adjustment . . . state current state regular and update events regular and update events regular events state transformer

Adjusting States • For each state transformer, we need to provide only one function adjust(s1,s2,s3) “if state s2 is replaced by s3, adjust the succeeding state s1 accordingly” • Example: the adjust function for element counting is: adjust(s1,s2,s3).count = s1.count+(s3.count-s2.count) • The adjust function for XPath steps is the identity adjust(s1,s2,s3) = s1

Adjusting State for Element Counting id Event element counting adjustment 1 <a> 1 2 startMutable(1) start state(2) = n n+1 2 <c> 2 X 2 </c> 2 endMutable(1) end state(2) = n+1 n+2 1 3 startInsertBefore(2) start state(3) = n 3 <c> 3 Y 3 </c> 3 endInsertBefore(2) end state(3) = n+1 1 </a> work with id=2 state copy work with id=3 state copy

The Result Display • Its like any other state transformer but it also does side-effects to the display screen • The state of an update id is its position in the screen adjust(s1,s2,s3) = s1+s3-s2 • Side effects: • remove_text(start,end)‏ • insert_text(position,text)‏ • The state transformer is very simple: • eg, for a <tag> event insert “<tag>” in the screen at position

Problem • For each update id, each state transformer must keep a separate copy of the state • Will lead to space explosion for an infinite stream of updates • Not applicable to replacement updates, since our queries are snapshot, not historical • For incoming updates, we can ignore updates that are irrelevant to the query • Hard for content-based predicates • For generated updates, the scope of an update is usually limited • The scope is often known at run time • Allows the removal of out-of-scope states • eg, predicate testing

Unblocking XQuery Operations • We have used this technique for unblocking • concatenation • predicates • descendant • backward steps • sorting • In our preliminary results, many XQueries on large data sets had high throughput, required very little buffering, and (of course) had very fast first response time

Future Work • Plan to cover most XQuery features completely • Would like to handle historical queries • Same model for update streams • ... but now replacement updates may add a new version • Example: “which stock increased its value by at least 10% since the last update?” • Need to extend the XQuery syntax with historical features • Need to cut-off out-of-scope historical data

Efficient Processing of XML Update Streams

Efficient Processing of XML Update Streams

Presentation Transcript

XML Processing in

Trie Indexes for Efficient XML Query Processing

Syntax-directed Transformations of XML Streams

Processing XML Documents

Processing XML Streams with Deterministic Automata

Efficient XML Interchange

Efficient XML Interchange

Efficient Processing of Updates in Dynamic XML Data

Indexing Methods for Efficient XML Query Processing

Efficient Processing of Massive Data Streams for Mining and Monitoring

Compiler Support for Efficient Processing of XML Datasets

Querying XML streams in DB2

Combining efficient XML compression with query processing

XML Query Processing

Efficient Processing of Ordered XML Twig Pattern

Processing XML

Towards efficient processing of RDF data streams

Efficient XML Storage, Query, and Update

XML Stream Processing

Efficient Visualization of Document Streams

Efficient processing of path query with not-predicates on XML data

Efficient XML Interchange