PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS

PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS Kristin Tufte PhD Defense Dec 17, 2004

Streams & XML person lname:Jones fname:Bob address (Jones, Bob, 153 Fir St., Portland) • Nested, structured data (XML) • Streams: network traffic information, environmental sensor data, telephone call records, click streams street: 153 Fir St. city: Portland That was then… …this is now.

New Challenges • XML • Data is nested • New operators, query language • Streams • Potentially infinite • Produce results without waiting for end of stream/data • Arrival rate not in control of database system • XML Streams • Stock Data • Data Exchange • Intelligent Transportation Systems

Talk Preview • Incremental Query Evaluation (IQE) • Merge Operation • Merge Theory • Merge Performance

Context for IQE • Continuous Queries – Tapestry (Early 1990’s) • Monotonic queries, append-only databases • Long-running Queries • Online aggregation (Hellerstein et al.), • Nested Aggregates (Tan et al.) • Incremental Query Evaluation (IQE) (Partial Results) • General solution for long-running queries over XML data • Stream Processing • Potentially infinite streams of data • STREAM, Aurora (Borealis), Niagara West • Triggers (Eric Hanson, NiagaraCQ)

Incremental Query Evaluation* • Motivation: Internet queries (long-running, data in XML) • Get results to users before all of the data arrives • Non-monotonic (blocking) operators are problematic • Modify operators and system framework count group by Subject select DateTime ≥ “12/17/04:12AM” (Title, Subject, DateTime) * Joint work with Jai Shanmugasundaram

(Non-)monotonic Operators • An operator O is monotonic if: A  B O(A) O(B) • select, join (but often implemented with a blocking algorithm) • O is non-monotonic if it is not monotonic • aggregates, nest • On new input monotonic operators add to output, non-monotonic operators change output count group by Subject select DateTime ≥ “12/17/04:12AM” (Title, Subject, DateTime)

Handling Non-monotonic Operators Old Value New Value Subject, Count Subject, Count ( null, null, Ukraine, 2) (Ukraine, 2, Ukraine, 3) Title, Subject, D/TTitle, Subject, D/T (null, null, null, Title1, Ukraine, 1AM) (null, null, null, Title2, Ukraine, 3AM) (null, null, null, Title3, Ukraine, 5AM) top10(count) • Users issue partial result requests • Re-evaluation – transmit full result on every partial result request • Differential – avoid retransmitting duplicate data • Operators produce and process tuple inserts, deletes, updates • All tuples contain “old value” and “new value” count group by Subject select DateTime ≥ “12/17/04:12AM” (Title, Subject, DateTime)

Re-evaluation vs. Differential

Skewed Data

produce partial result ( null, null, Google, {Title1}), ( null, null, Microsoft, {Title2, Title3}) (Google, {Title1}, Google, {Title1, Title4}) Subject: Google Subject: Google Subject: Google Merge Title: Title5 Title: Title1 Title: Title4 Title: Title1 Title:Title4 Title: Title5 Differential Nest Old Value New Value Subject, Title Subject, Title Subject, Title (Google, Title1), (Microsoft, Title2), (Microsoft, Title3) (Google, Title4) (Google, {Title1,Title4}, Google, {Title1, Title4, Title 5}) (Google, Title5) but what you’d really like to send is: (Google, {Title5}) and “merge” it with: (Google, {Title1,Title4})

Talk Preview • Incremental Query Evaluation • Merge Operation • Merge Theory • Merge Performance

Merge Operation • Flexible method for combining two XML (nested) documents-“recursive union” over similarly-structured XML documents • Merge Template guides the process • “Keys” are used to indicate when elements should be combined

Merge Example Combined Inserted Used in Match auction item item desc: Trek Madone 5.9 Bike bid iid:433 desc: 1971 Martin Guitar iid:501 bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Merged Document auction auction item item item iid:501 desc: Trek Madone 5.9 Bike bid iid:433 desc: 1971 Martin Guitar iid:501 bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Auction Document New Bid

Merge Template (MT) (auction, [], NoContentNoAttrs) auction • Merge Template is an XML document consisting of a tree of Element Merge Templates (EMT) • EMT is a triplet containing: (name, local key, content combine function) (item, [iid], NoContentNoAttrs) item (iid, [], ExactMatch) (desc, [], ShallowContent - Replace) iid:501 (bid, [bidder, amt], NoContentNoAttrs) bid (bidder, [], ExactMatch) (amt, [], ExactMatch) bidder: Sue amt: $1550

Merge Template Features • Used as the basis for an Accumulate operator • Repeated merge over a stream of XML documents to create an Accumulator • Accumulator is a view of the stream • Performs structural aggregation • Keys used to identify elements to combine • Keys external to document • Content-Combine Functions • aggregate, deep replace • Attributes – handled like elements without children

Outline • Incremental Query Evaluation • Partial results over XML data • Merge Operation • Merge Theory • Merge Performance

Theoretical Foundations • Why a formal definition? • Prove Merge is deterministic (unique result) • Unambiguous definition • Key results: • Formal definition of Merge as the join of an upper semi-lattice • Merge is the least upper bound of two documents (under some constraints) • Path Set Representation • Good for reasoning about XML documents

D3 is “smallest” document that “contains” D1 and D2 View Merge as Least Upper Bound auction item item iid:501 desc: Trek Madone 5.9 Bike bid iid:433 desc: 1971 Martin Guitar bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Merged Document (D3) auction auction item item item iid:501 desc: Trek Madone 5.9 Bike bid id:433 desc: 1971 Martin Guitar iid:501 bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Auction Document (D1) New Bid (D2)

What can go wrong? • No unique result (no Least Upper Bound (LUB)) • Keys in Merge Template eliminate ambiguity • Know D4 is correct result if we know iid is a key for item auction auction item item item iid:501 iid:433 iid:501 iid:433 D3 D4 auction auction item item iid:501 iid:433 D1 D2

What is a lattice? {1, 2, 3} D3 • An Upper-Semi Lattice is: a partially ordered set, in which least upper bounds (LUBs) exist and are unique • A set of sets closed under union form an upper semi lattice. • implies  {1, 2} {2, 3} {1, 2} {2, 3} D1 D2 Ex 1 – Not Lattice LUB of {1,2} and {2, 3} does not exist Ex 2 – Lattice Order: S1 S2 if S1  S2 Ex 3 – Lattice Order: document containment

What do I need for a lattice? • Set of documents (LT) (T is a Merge Template) • Order (document containment) • Show LT satisfies the properties of a lattice.

Document Containment Order • D1 is contained in D2if there is a structure-preserving mapping from D1 into D2 auction auction item item item D1 D2 iid:501 desc: Trek Madone 5.9 Bike iid:433 desc: 1971 Martin Guitar iid:501 desc: Trek Madone 5.9 Bike

Merge Template (T) Defines LT • A Merge Template, T, is specific to a set of documents • Auction MT specific to “auction” documents • LT is all documents that are “compatible” and “key-respecting” with respect to T • Different lattice for each Merge Template D8 D10 LT Set of all documents T D4 D5 D2 D1 D3

Non-Key-Respecting Documents • means contained in. D is contained in D′ if there is a structure-preserving mapping from D into D′ • D3 is not key-respecting with respect to T and should not be inLT. auction auction (auction, [], NoContentNoAttrs) item item item (item, [iid], NoContentNoAttrs) iid:501 iid:433 iid:501 iid:433 (iid, [], ExactMatch) D3 D4 T auction auction item item iid:501 iid:433 D1 D2

Merge-Lattice Theorem Overview D3 ρ(D1) ρ(D2)  LT • Associate each document D with a unique path set ρ(D) • ρ(D1)  ρ(D2) is a Least Upper Bound (LUB) for ρ(D1) and ρ(D2) • ρ(D1)  ρ(D2) is the “smallest” set that contains both ρ(D1) and ρ(D2) • Intuition: Merge of D1 and D2 should be the document associated with ρ(D1)  ρ(D2)   D1 D2 ρ(D1) ρ(D2) ρ1 ρ2

Document and Path Set auction[]: auction[].item[id:501]: auction[].item[id:501].id[]:501 auction[].item[id:501].desc[]:Trek Madone 5.9 Bike auction[].item[id:501].bid[bidder:Dave,amt:$1500]: auction[].item[id:501].bid[bidder:Dave,amt:$1500]. bidder[]:Dave auction[].item[id:501].bid[bidder:Dave,amt:$1500]. amt[]:$1500 auction • Use Merge Template + document to create path set • One element in path set for each element in document • Path comprised of rooted key value and element content • Path set order (subset) identical to document containment order item iid:501 desc: Trek Madone 5.9 Bike bid bidder: Dave amt: $1500 auction[].item[iid:501].desc[]:Trek Madone 5.9 Bike rooted key value element content

Proof that D3 is in L • Construct D3 from ρ(D1)  ρ(D2), show D3 is compatible and key-respecting with respect to T D3 3 σ σ-1 (=ρ3) T 2 ρ2 ρ(D1) ρ(D2)  D2 1 ρ2-1 ρ1 D1 ρ1-1

Outline • Incremental Query Evaluation • Partial results over XML data • Merge Operation • Merge Theory • Merge Performance

Implementation Highlights • Accumulate operator uses repeated binary Merges to combine a series of XML documents into one result document • Accumulate is implemented as a recursive walk over input docs and the Merge Template • Implemented in Niagara v1.0 (UW-Madison) • Lazy construction of DOM nodes: SAXDOM • General improvements to Niagara 1.0 code base

Performance Environment • 866 MHz Pentium PIII, 512MB memory, Red Hat Linux 8.0 • Sun JVM J2SE 1.4.2, maximum memory 412MB

Input Data - XMark Persons Items Bids site site site people open_auctions open_auctions id person* id open_auction* id open_auction* name phone? reserve? seller person interval bidder email profile start end time bid education personref person * 0 or more ? optional

Structural Aggregation with Restructuring • Q5.1 – simple structural aggregation query • For each person produce a list of items they bid on and their bids on those items people site person* id open_auctions itemsbid id open_auction* item* id bidder time bid bid* personref person time amt Q5.1 input (Bids) Q5.1 output

Restructuring of Input site people people restructure accumulate open_auctions person id:53 id:53 person open_auction iid:8 itemsbid itemsbid bidder open_auction iid:8 item id:8 time:5:00 bid:$82 bidder bid person:53 personref time:5:00 bid:$82 time:5:00 amt:$82 personref person:53 Restructured Input Q5.1 Input Q5.1 Output

Q5.1 query plans nest (“”) accumulate unnest (time) nest (bidderid) construct (restructured document) unnest (person_ref.person as bidderid) nest (bidderid) unnest (bidder.person_ref.person as bidderid) unnest (bidder) nest (itemid, bidderid) unnest (site.open_auctions.open_auction) unnest (open_auction.id as itemid) unnest (amt) filescan unnest (site.open_auctions.open_auction) Merge Query Plan filescan Nest Query Plan

Q5.1 Nest Query Plan nest (“”) people nest (bidderid) id:53 person itemsbid nest (bidderid) item id:8 nest (itemid, bidderid) bid time:5:00 amt:$82 unnest (open_auction, open_auction.id, bidder, person_ref.person, time, amt) Q5.1 Output filescan Nest Query Plan

Q5.1 Execution Time

Q5.2 Execution Time items item* id id bidder* bid* time amt Q5.2 output Q5.2: for every item list of bidders and their bids Q5.1: for every person list of items sold and bids on those items

Execution time breakdown Q5.2

Simplified Q5.4-A Output people id person* name email phone? itemssold itemsbid profile open_auction* id open_auction* id education reserve? seller interval bidder* person time bid start end person pesonref For each person, provide person information, list of items put up for auction (itemssold) and items bid on (itemsbid)

Simplified Q5.4-B Output people id person* name email phone? itemssold itemsbid profile item* item* id id education seller person bid reserve? interval time amt start end Key: person personref renamed deleted

Q5.4-A and Q5.4-B Results • Q5.4-B is faster despite having to unnest the input more deeply • Key factor: Q5.4-B has fewer elements in the result Query 5.4-A Query 5.4-B

Merge-Ready Structural Aggregation • No restructuring; input structured similar to output • Best case for Merge Q5.5 (small documents) Q5.6 (big documents)

Sliding Structural Aggregation • Extend accumulate to handle sliding windows • For each element, maintain range of windows • Test vs. sliding nest Q6.1 (group bids by item then person)

Conclusion • Studied processing of XML Streams • IQE • General framework for partial results over initial portion of stream • Merge • Flexible operator for combining XML documents • Formal definition in terms of lattice theory • Outperforms nest-based alternatives

Extras/Deletes

Join on Author Nest on Author (Author, Address) (Author, Book) Re-evaluation vs. differential • Query plan for re-evaluation vs. differential

Partially-Ordered Set (POSet) {1, 2, 3} {1, 2} {2, 3} {1} Example: Set of sets ( implies  ) S1 S2 if S1  S2 • Let P be a set. A partial order () on P is such that for all x, y, z P • x  x • x  y and y  x  x = y • x  y and y  z  x  z

Sliding Accum query plan Q6.1 sliding accumulate (document, timestamp, window-min, window-max) ( D1, 12:01 PM, 0, 7 ) t1′ ( D2, 12:20 PM, 1, 8 ) t2′ ( *, 2:00 PM, 0, 0 ) p1′ bucket (document, timestamp) ( D1, 12:01 PM ) t1 ( D2, 12:20 PM ) t2 ( *, 2:00 PM )p1 construct filescan + series of unnests

Sliding Nest Query Plan Q6.1 sliding nest (windowid) sliding nest (bidderid, windowid) sliding nest (bidderid, windowid) sliding nest (itemid, bidderid, windowid) (document, timestamp, window-min, window-max) ( D1, 12:01 PM, 0, 7 ) t1′ ( D2, 12:20 PM, 1, 8 ) t2′ ( *, 2:00 PM, 0, 0 ) p1′ bucket (document, timestamp) ( D1, 12:01 PM ) t1 ( D2, 12:20 PM ) t2 ( *, 2:00 PM )p1 construct filescan + series of unnests

PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS

PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS

Presentation Transcript

Approximate Frequency Counts over Data Streams

Data Streams

Data Streams

Data Processing and Aggregation

Buffering in Query Evaluation over XML Streams

Data Streams

Evaluation of Partial Path Queries on XML Data

Tree-Pattern Aggregation for Scalable XML Data Dissemination

Continuous Queries over Data Streams

Continuously Maintaining Order Statistics Over Data Streams

Multiple Aggregations Over Data Streams

Rectangle-Efficient Aggregation in Spatial Data Streams

Window-aware Load Shedding for Aggregation Queries over Data Streams

Adaptive Frequency Counting over Bursty Data Streams

dQUOB: SQL queries over data streams

Data Aggregation

Multiple Aggregations Over Data Streams

Rectangle-Efficient Aggregation in Spatial Data Streams

Schema-Based Query Optimization for XQuery over XML Streams

Approximate Frequency Counts over Data Streams