1 / 27

Efficient Processing of XML Update Streams

This article discusses the efficient processing of continuous XML update streams, including the challenges, techniques, and goals such as handling continuous XQueries and producing exact results. It also presents a motivating example and the implementation of a stream engine pipeline.

sarahdavies
Télécharger la présentation

Efficient Processing of XML Update Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Leonidas Fegaras University of Texas at Arlington Efficient Processing of XML Update Streams

  2. Data Stream Processing • What is a data stream? • continuous, time-varying data arriving at unpredictable rates • continuous updates, long-running queries, continuous results • Sought characteristics of stream processing engines • real-time processing • high throughput, low latency, fast mean response time, low jitter • low memory footprint • Why bother? • many data are already available in stream form • sensor networks, network traffic monitoring, stock tickers • publisher-subscriber systems • data stream mining for fraud detection • data may be too volatile to index • continuous measurements

  3. XML Stream Processing • Why XML? There is no reason to normalize stream data • Various sources of XML streams • tokenized XML documents • RSS feeds • web service results • Granularity • XML tokens (events): <tag>, </tag>, “text”, etc • region-encoded XML elements (eg, based on pre-order numbering)‏ • XML fragments (hole-filler model)‏ • Push-based processing: SAX • Pull-based processing: StAX

  4. Traditional Stream Processing output stream input stream • Works on streams that consist of numerical values or relational tuples • Focuses on a sliding window • fixed number of tuples, or • fixed time span • Calculates approximate results • Uses a small (bounded) state • Examples: • top-k most frequent values • group-by SQL queries past stream engine sliding window state future

  5. Our Goals • Handle continuous XQueries over continuous streamed XML data • Embedded updates in the streams • Exact rather than approximate answers • Produce continuous results, even when the results are not complete • Problem: most interesting operations are blocking and/or require unbounded state • grouping & aggregation • predicate evaluation • sorting • sequence concatenation • backward axis steps • We want to address the blocking problem differently • Display the current result of the blocking operation continuously in the form of an update stream • incoming vs. generated updates

  6. Our View of XML Update Streams • A continuous (possibly infinite) sequence of XML tokens with embedded updates • Typically, a finite data stream followed by an infinite stream of updates • SAX-like events • three basic types of tokens: <tag>, </tag>, text • the target of an update is a stream subsequence that contains zero, one, or more “complete” XML elements • the source is also a token sequence that contains complete XML elements • updates are embedded in the data stream and can come at any time • update events can be interleaved with data events and with each other • each event must now have an ID to associate it with an update • updated regions can be updated too • to update a stream subsequence, you wrap it into a Mutable region • three types of updates: replace, insertBefore, insertAfter

  7. Example id Event … equivalent to 1 <a> <a> 1 <b> <b> 2 startMutable(1) <c> 2 <c> Y 2 X </c> 2 </c> <c> 2 endMutable(1) X 1 </b> </c> 3 startInsertBefore(2) </b> 3 <c> </a> 3 Y 3 </c> 3 endInsertBefore(2)‏ 1 </a>

  8. Continuous Results • Our stream engine is implemented as a pipeline • each pipeline stage performs a very simple task • The final pipeline stage is the Result Display that displays the query results continuously • the display is an editable text window, where text can be inserted, deleted, and replaced at any point • when an update is coming in the input stream, it is propagated all the way to the result display, where it causes an update to the display text! result display input stream with updates output stream with updates query pipeline

  9. Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • Multiple points of blocking • distinct-values • count • self-join • order-by

  10. Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“35”> <author name=“D”> <title>T1</title> </author> </books> input stream currently <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ...

  11. Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“36”> <author name=“A”> <title>T2</title> </author> <author name=“D”> <title>T1</title> </author> </books> input stream <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ... currently

  12. Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“37”> <author name=“A”> <title>T2</title> <title>T3</title> </author> <author name=“D”> <title>T1</title> </author> </books> input stream <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ... currently

  13. Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“38”> <author name=“A”> <title>T2</title> <title>T3</title> </author> <author name=“B”> <title>T4</title> </author> <author name=“D”> <title>T1</title> </author> </books> input stream <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ... currently

  14. Motivating Example • Group and order book titles by author: let $al := distinct-values(doc(“bib.xml”)//book/author)‏ return <books num=“{ count($al) }”>{ for $a in $al order by $a return <author name=“{ $a/text() }”>{ doc(“bib.xml”)//book[author=$a]/title }</author> }</books> • The result display is refreshed continuously: display <books num=“39”> <author name=“A”> <title>T2</title> <title>T3</title> </author> <author name=“B”> <title>T4</title> <title>T5</title> </author> <author name=“D”> <title>T1</title> </author> </books> input stream <book><author>D</author><title>T1</title></book> <book><author>A</author><title>T2</title></book> <book><author>A</author><title>T3</title></book> <book><author>B</author><title>T4</title></book> <book><author>B</author><title>T5</title></book> <book><author>A</author><title>T6</title></book> <book><author>C</author><title>T7</title></book> <book><author>C</author><title>T8</title></book> <book><author>A</author><title>T9</title></book> <book><author>B</author><title>T10</title></book> <book><author>D</author><title>T11</title></book> ... currently

  15. Why? • Because, this is what you really want to see as the result of a query • eg, in a stock ticker feed stream, where updates to ticker values come continuously • It leads to “optimistic evaluation” where results are displayed immediately, to be retracted or modified later when more information is available • addresses the “blocking problem” • we proceed without waiting, displaying the results “so far”, but later we may have to send updates • generalizes “on-line aggregation”

  16. Optimistic Evaluation • Pessimistic evaluation: at all times, the query display must always show the correct results up to that point • Optimistic evaluation: display any possible output without delay and later, if necessary, retract it or modify it to make it correct • complementary to, but different than, lazy evaluation • How? • Generated and incoming updates are propagated through the evaluation pipeline until they are processed by the display • They may cause changes to the states of the pipeline stages • Examples: • Event counting: instead of waiting until we count all events, we generate updates that continuously display the counter “so far” • Predicate testing: assume the predicate is true, but when you later find that it is false, retract all output associated with this predicate • Sorting: wrap each element to be sorted around an update that inserts it into the correct place to the element sequence “so far”

  17. Contributions • Instead of eagerly performing the updates on cached portions of the stream, we propagate the updates through the pipeline • … all the way to the query result display • the display prints the results continuously, replacing old results with new • Other approaches: • display approximate answers continuously by focusing on a sliding window over the stream • Our approach: • generates exact answers continuously in the form of an update stream • But the propagated updates may affect the state of the operators • we developed a uniform methodology to incorporate state change • Used this framework to unblock operations and reduce buffering • let the operations themselves embed new updates into the stream that retroactively perform the blocking parts of the operation • why? because “later” is often better than “now”

  18. State Transformers • Each stage in the query evaluation pipeline is a state transformer • Input: a single event and a state S • Output: a sequence of events and a new state S’ • Implemented as a function from an event to a sequence of events that destructively modifies the state • can be used in both pull- and push-based stream processing • The state transformers need only handle the basic SAX events: <tag>, </tag>, text, and begin/end of stream state state transformer stream with regular and update events stream with regular events

  19. Example: Element Counting • The state is an integer counter, count • A blocking state transformer, f(e): if e is a text event count = count+1 return [ ] else if e is end-of-stream return [ “count value” ] • A non-blocking state transformer: if e is begin-of-stream return [ startMutable(id), “0”, endMutable(id) ] else if e is a text event count = count+1 return [ startReplace(d), “count value”, endReplace(id) ]

  20. XPath Steps • The state transformers of simple XPath forward steps are trivial to implement • Example: the Child step (/tag) • state: • need a counter nest to keep track of the nesting depth, and • a flag pass to remember if we are currently passing through or discarding events • logic: • when we see the event <tag> at nest=1, we enter pass mode and stay there until we see </tag> at nest=1 • when in pass mode, we return the current event • otherwise, we return [ ]

  21. Handling Updates update ids • It would be cumbersome to modify each state transformer to handle incoming update events • Our solution: the update events are handled in the same way for all state transformers • Each state transformer is wrapped by a fixed function that handles update events by adjusting the states, while passing the regular events to the state transformer state update events state adjustment . . . state current state regular and update events regular and update events regular events state transformer

  22. Adjusting States • For each state transformer, we need to provide only one function adjust(s1,s2,s3) “if state s2 is replaced by s3, adjust the succeeding state s1 accordingly” • Example: the adjust function for element counting is: adjust(s1,s2,s3).count = s1.count+(s3.count-s2.count) • The adjust function for XPath steps is the identity adjust(s1,s2,s3) = s1

  23. Adjusting State for Element Counting id Event element counting adjustment 1 <a> 1 <b> 2 startMutable(1) start state(2) = n n+1 2 <c> 2 X 2 </c> 2 endMutable(1) end state(2) = n+1 n+2 1 </b> 3 startInsertBefore(2) start state(3) = n 3 <c> 3 Y 3 </c> 3 endInsertBefore(2) end state(3) = n+1 1 </a> work with id=2 state copy work with id=3 state copy

  24. The Result Display • Its like any other state transformer but it also does side-effects to the display screen • The state of an update id is its position in the screen adjust(s1,s2,s3) = s1+s3-s2 • Side effects: • remove_text(start,end)‏ • insert_text(position,text)‏ • The state transformer is very simple: • eg, for a <tag> event insert “<tag>” in the screen at position

  25. Problem • For each update id, each state transformer must keep a separate copy of the state • Will lead to space explosion for an infinite stream of updates • Not applicable to replacement updates, since our queries are snapshot, not historical • For incoming updates, we can ignore updates that are irrelevant to the query • Hard for content-based predicates • For generated updates, the scope of an update is usually limited • The scope is often known at run time • Allows the removal of out-of-scope states • eg, predicate testing

  26. Unblocking XQuery Operations • We have used this technique for unblocking • concatenation • predicates • descendant • backward steps • sorting • In our preliminary results, many XQueries on large data sets had high throughput, required very little buffering, and (of course) had very fast first response time

  27. Future Work • Plan to cover most XQuery features completely • Would like to handle historical queries • Same model for update streams • ... but now replacement updates may add a new version • Example: “which stock increased its value by at least 10% since the last update?” • Need to extend the XQuery syntax with historical features • Need to cut-off out-of-scope historical data

More Related