Efficient Query Evaluation over XML Streams

Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

XML Document 1: <paper> 2: <sectionid = 1> 3: <title> 4: Intro 5: </title> 6: <content> 7: bla bla bla 8: </content> 9: </section> 10: <sectionid = 2> 11: <title> 12: Results 13: </title> 14: <content> 15: yada yada yada 16: </content> 17: </section> 18: <sectionid = 3> 19: <title> 20: Conclusions 21: </title> 22: <content> 23: etc etc etc 24: </section> 25: <title> 26: On the Complexity of Database Queries 27: </title> 28: <author> 29: Papadimitriou 30: </author> 31: <author> 32: Yannakakis 33: </author> 34: </paper>

XML Document Tree root paper section author id content section title 1 author id content Yannakakis title 3 Intro bla bla bla Papadimitriou etc etc etc section title Conclusions id content title 2 On the Complexity of Database Queries yada yada yada Results

XPath Queries /paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content root paper section author id content section title 1 author id content Yannakakis title 3 Intro bla bla bla Papadimitriou etc etc etc section title Conclusions id content title 2 On the Complexity of Database Queries yada yada yada Results

XPath Queries /paper[title != section/title]/author root paper section author id content section title 1 author id content Yannakakis title 3 Intro bla bla bla Papadimitriou etc etc etc section title Conclusions id content title 2 On the Complexity of Database Queries yada yada yada Results

XPath • Query = path pattern + predicates • XPath 2.0 • Forward axis only • Eval(Q,D): nodes in D that match Q • Two modes of XPath evaluation: • Full fledged evaluation: given Q,D, output Eval(Q,D) • Filtering: given Q,D, determine whether Eval(Q,D) is nonempty.

XML Streams • XML stream: sequence of SAX events • startDocument(), endDocument(), startElement(name), endElement(name), text(str) • Why XML streams? • For transferring XML between systems • For efficient access to large XML documents • Critical resources • Memory • Processing time

Streaming XML Algorithms • XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] • X-scan [Ives, Levy, and Weld 00] • XMLTK [Avila-Campillo et al 02] • XTrie [Chan et al 02] • SPEX [Olteanu, Kiesling, and Bry 03] • Lazy DFAs [Green et al 03] • The XPush Machine [Gupta and Suciu 03] • XSQ [Peng and Chawathe 03] • FluX [Koch el al 04] • TurboXPath [Josifovski, Fontoura, and Barta 05] • … All of them use lots of memory on certain queries & documents

Memory Bottleneck I: Storage of Large Transition Tables • Framework of most algorithms: • Q  NFA • Simulate NFA by DFA • Caveat: exponential blowup • However: exponential blowup is not necessary[Bar-Yossef, Fontoura, Josifovski 04] • Algorithm for filtering XML streams whose space is linear in the query size

Memory Bottleneck II:Buffering of Document Fragments • Scenario 1: buffering nodes, which may or may not be part of the output. /paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content root paper section author id content section title 1 author id content Yannakakis title 3 Intro bla bla bla Papadimitriou etc etc etc section title Conclusions id content title 2 On the Complexity of Database Queries yada yada yada Results

Memory Bottleneck II:Buffering of Document Fragments • Scenario 2: buffering nodes needed for evaluating pending predicates. /paper[title != section/title]/author root paper section author id content section title 1 author id content Yannakakis title 3 Intro bla bla bla Papadimitriou etc etc etc section title Conclusions id content title 2 On the Complexity of Database Queries yada yada yada Results

Memory Bottleneck II:Buffering of Document Fragments • Scenario 3: buffering multiple candidate matches that are nested within each other. root a //a[b and c] c a c b a b • Relevant only when document is “recursive” • Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]

Our Results • Quantitative space lower bounds for: • Full-fledged evaluation of queries with predicates (Scenario 1) • Filtering/full-fledged evaluation of queries with “multi-variate” predicates (Scenario 2) • Matching upper bound • Eager evaluation of predicates • In all other scenarios: no buffering required • Filtering of queries with “univariate” predicates over non-recursive documents is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]

Related Work • Space complexity of XPath evaluation over non-streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] • Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] • Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

Document Concurrency • Q: query • D = 1,…,n: document • Each i is an SAX event • t = (1,…,t) • Definition: x  D is alive at step t if x  t and  , s.t. • x  Eval(Q,t) • x  Eval(Q,t) • t-concurrency(D,Q): number of nodes that are alive at step t • concurrency(D,Q): maxt t-concurrency(D,Q)

Concurrency: Example /paper[author=“Papadimitriou”]/section[@id = “2” or title = “Intro”]/content 1: <paper> 2: <sectionid = 1> 3: <title> 4: Intro 5: </title> 6: <content> 7: bla bla bla 8: </content> 9: </section> 10: <sectionid = 2> 11: <title> 12: Results 13: </title> 14: <content> 15: yada yada yada 16: </content> 17: </section> 18: <sectionid = 3> 19: <title> 20: Conclusions 21: </title> 22: <content> 23: etc etc etc 24: </content> 25: </section> 26: <title> 27: On the Complexity of Database Queries 28: </title> 29: <author> 30: Papadimitriou 31: </author> 32: <author> 33: Yannakakis 34: </author> 35: </paper> dead alive alive

Lower Bound Notions • A “normal” lower bound: For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space. • Q and D may be “pathological” • Doesn’t say much about real-world queries/documents • An “ideal” lower bound: For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space. • Too good to be true • A can have D and Q “hard-coded”, and then know the result a priori • Space of A on D and Q = minimum description length of Q and D

Our Lower Bound • Theorem: For every A, every Q, and every D, there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space. • D’ is the same as D, except for a few extra empty nodes with auxiliary names. • Theorem holds only if: • Q is “star-free” • D is non-recursive

Why isn’t this Obvious? • Reason 1: we want the theorem to work for every Q and D, not only ones with high MDL. • Reason 2: • Obvious: If x is alive at step t  A has to remember x • Because: A may or may not need to output x • Not obvious: If x and y are alive at step t  A has to remember both • If x and y are not “independent”, maybe it’s enough to remember just x (or just y)

Proof of Lower Bound • C = t-concurrency(D,Q) • x1,…,xC = nodes that are alive at step t • Recall: for every xi there exist i and i s.t. • xi Eval(Q, ti) • xi  Eval(Q, ti) • Lemma: there exist a single and a single s.t. for all i, • xi Eval(Q, t) • xi  Eval(Q, t)

Proof of Lower Bound (cont.) • For every S  { 1,…,C } define document DS: • DS is the same as D, except • For every i  S, we “mark” xi • Marking: an extra empty child with an auxiliary name • Note: DS is almost-isomorphic to D • A = any algorithm • Note: From output of A on DS, one can “reconstruct” the set S.

Proof of Lower Bound (cont.) • Consider state of A at step t when running on DS • If suffix = , none of the xi’s should be output •  A could not have output any xi by step t • If suffix = , no information in suffix about S but S can be reconstructed from output •  state of A at step t must have all information about S • Conclusion: space ≥ (C) • Actual proof: by one-way communication complexity

Conclusions • Our contributions: • Quantitative space lower bounds • Full-fledged evaluation of queries with predicates • Filtering/full-fledged evaluation of queries with “multi-variate” predicates • Matching upper bound • Open problems: • Quantitative lower bounds for XQuery evaluation over streams • Address larger fragments of XPath

Efficient Query Evaluation over XML Streams