340 likes | 466 Vues
This article explores the complexities of querying streaming XML data, particularly in scenarios involving real-time information streams. It addresses common issues faced during XML querying, such as filtering and specifying path queries, and proposes a structured solution leveraging efficient design principles. By dissecting the components of XPath queries and their practical applications, this piece highlights the advantages of streaming XML in various domains, including the stock market and news dissemination. Learn how to effectively construct and refine queries for optimal data retrieval.
E N D
Layout of the presentation • Introduction • Common Problems faced • Solution proposed • Basic Building blocks of the solution • How to build up a solution to a given query • Features of the system
Streaming XML • XML – standard for information exchange. • Some XML documents only available in streaming format. • Streaming is like reading data from a tape drive. • Used in Stock Market, News, Network Statistics. • Predecessor systems used to filter documents.
Structure of an XPath Query • Consists of a Location path and an Output Expression (name). • Location path consists of closure axis(//), node test (book) and predicate (year>2000). • e.g. //book[year>2000]/name
Features of our Approach • Efficient • Easy to understand design. • Design of BPDT is tricky
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Element satisfies the path
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path Test passed. But year=2002?
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path Test passed. But year=2002? Buffer both A & B
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path Test passed. But year=2002? Buffer both A & B Failed price<11. Remove
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path Test passed. But year=2002? Buffer both A & B Failed price<11. Remove Test passed. Output
Problems caused by closure axis • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name
Problems caused by closure axis • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name Fails year=2002
Problems caused by closure axis • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name Fails year=2002 Passes year=2002
Problems caused by closure axis • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <author> B </author> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name Lets add author. Result? Fails year=2002 Passes year=2002
Handling XML Stream • Input – well formed XML stream. • Use SAX API to parse XML. • Events belong to • Begin = {(a, attrs, d)} • End = {(/a, d)} • Text = {(a, text(), d)} • XML Stream: {e1,e2,…,ei,…} ¦ eiЄ Begin υ End υ Text
Grammar for XPath Queries • Q N+[/O] • N [/¦//] tag [F] • F [FO[OP constant]] • FO @attribute ¦ tag [@attribute] ¦text() • O @attribute ¦text() • OP > ¦≥ ¦ = ¦ < ¦ ≥ ¦ ≠ ¦ contains • XPath query of the form N1N2…Nn/O • Cant handle Reverse Axis, Positional Functions.
Solution to Query Query: /pub[year=2002]/book[price<11]/author PDA PDT
Basic PushDown Transducer (BPDT) • Similar to PushDown Automata • Actions defined on Transition Arcs • Finite set of states • A Start state • A set of final states • Set of input symbols • Set of Stack symbols
Building a BPDT Query: /pub[year>2000]/book[author]/name/text() Consider location step: /book[author] • Book – Author: Buffer for future: Begin event of Author. • Book – Author: Remove from Buffer: End event of Book. • Book – Author: Output result if predicates true: Begin event of Author.
Basic Building Blocks XPath Expression: /tag[child]
Buffer Operations needed • Enqueue(x): Add x to the end of the queue. • Clear(): Removes all items from the queue. • Flush(): Outputs all items in the queue in FIFO order. • Upload(): Moves all items to the end of the queue of a parent BPDT. • No Dequeue operation needed.
Basic Building Blocks XPath Expression: /tag[@attr=val]
Basic Building Blocks XPath Expression: /tag[text()=val]
Basic Building Blocks XPath Expression: /tag[child@attr=val]
Basic Building Blocks XPath Expression: /tag[child=val]
A sample BPDT Query: /pub[year>2000]
Building a solution HPDT for Query: //pub[year>2000]//book[author]//name/text()
HPDT Structure • Each BPDT in HPDT has: • Position • BPDT POSITION(l,K) :- l = depth of BPDT in HPDT, K = sequence # from right to left • BPDT Position (i-1,k) – has right child BPDT position (i,2k) – connected to NA state • BPDT Position(i-1,k) – has left child BPDT position (I,2k+1) – connected to True state. • BPDT Position (i, 2i – 1) – means predicates in higher level BPDT’s evaluate to true Buffer – potential results Stack – stack of elements (SAX) events Depth Vector
Example Query • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name 3 paths from $1 to $14
Reference • Feng Peng and Sudarshan Chawate. XPath Queries on Streaming Data. In SIGMOD 2003.
Thank You ???