870 likes | 1.02k Vues
This document discusses advanced structure-encoded sequence indexing schemes for optimized XML querying. It details the use of Virtual Suffix Trees (ViST) and other techniques to represent XML documents and perform queries effectively. By employing top-down and bottom-up approaches, it showcases methods for preventing expensive join operations, supporting dynamic index updates, and providing a unified index covering both content and structure. The algorithms for matching XML queries are explored, including the Naïve algorithm and RIST (Relationships Indexed Suffix Tree), facilitating efficient subsequence matching in XML data.
E N D
Sequence Indexing Schemes Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601
Introduction • Graph indexes • precise • Path, (twig only few methods) • Sequence indexing schemes • Top-down or bottom-up • XML document and XML queries in structure-encoded sequences • Path and twig
ViST – Virtual Suffix Tree • Top-down Sequence Indexes • Represent XML documents and XML queries in structure-encoded sequences • Querying XML data is equivalent to finding subsequence matching • Avoid to expensive join operations • Provides unified index on both content and structure • Support dynamic index update • B+Trees which are supported in DBMSs
DTD of purchase records <!ELEMENT purchases (purchase*)> <!ELEMENT purchase (seller, buyer)> <!ATTRIST seller ID ID location CDATA name CDATA> <!ELEMENT seller (item*)> <!ATTRIST buyer ID ID location CDATA name CDATA> <!ELEMENT item (item*)> <!ATTRIST item name CDATA manufacturer CDATA>
Preorder Sequence of XML • Use capital letters to represent names of elements/attributes • Use hash function h(), to encode attribute values into integers • v1 = h(“dell”) • v2=h(“ibm”) • Preorder sequence of XML purchase record example • PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8 • Isomorphic trees may produce different preorder seq. • DTD schema embodies linear order of all elements/attributes • Without DTD – use lexicographical order
Structure-Encoded Sequence Definition: A Structure-Encoded Sequence, derived from a prefix traversal of semi-structured XML document, is a sequence of (symbol, prefix) pairs: D = (a1,p1), (a2,p2),…, (an,pn) Where ai represents a node in the XML document tree, (of which a1, … ,an is the preorder sequence), and pi is the path from the root node to node ai.
Structure-Encoded Sequence D= (P,ϵ),(S,P),(N,PS),(v1,PSN),(I,PS),(M,PSI),(v2,PSIM),(N,PSI), (v3,PSIN),(I,PSI),(M,PSII),(v4,PSIIM),(I,PS),(N,PSI),(v5,PSIN), (L,PS),(v6,PSL),(B,P),(L,PB),(v7,PBL),(N,PB),(v8,PBN)
XML Queries in Path Expression and Sequence Form • Query: Path Expression Structure-Encoded Sequence • Q1 : /Purchase/Seller/Item/Manufacturer (P, ϵ)(S,P)(I,PS)(M,PSI) • Q2 : /Purchase[Seller[Loc = v5]]/Buyer[Loc = v7] (P,ϵ)(S,P)(L,PS)(v5,PSL)(B,P)(L,PB)(v7,PBL) • Q3 : /Purchase/*[Loc= v5] (P, ϵ)(L, P)(v5,P*L) • Q4 : /Purchase//Item[Manufacturer = v3] (P, ϵ)(I,P//)(M, P//I)(v3,P//IM)
Querying XML through Structure-Encoded Sequence Matching • Querying XML is equivalent to finding (non-contiguous) subsequence matches • Most structural XML queries can be performed through direct subsequence matching • Exception: branch has multiple identical child nodes • Q5=/A[B/C]/B/D • Two different sequences • (A, ϵ)(B,A)(C,AB)(B,A)(D,AB) • (A, ϵ)(B,A)(D,AB)(B,A)(C,AB) • Find matches separately and union their result • We may find false matches if the indexed documents contain branches with identical child nodes, then we ask multiple queries and compute set difference on result • If the query contains a large number of same child nodes under the branch, we can choose disassemble the tree into multiple trees and use join operations to combine their results
Algorithms • Naïve algorithm • RIST – Relationships Indexed Suffix Tree • ViST – Virtual Suffix Tree
Naïve algorithm: Suffix-Tree-Like structure • Doc1 : (P, ϵ)( S, P)(N, PS)(v1, PSN)(L, PS)(v2, PSL) • Doc2 : (P, ϵ)(B, P)(L, PB)(v2, PBL) • Q1 : (P, ϵ)(B, P)(L,PB)(v2, PBL) • Q2 : (P, ϵ)(L, P*)(v2,P*L)
D-Ancestorship and S-Ancestorship • D-Ancestorship • Ancestor-descendant relationships in original XML tree • Element (S,P) is a D-Ancestorship of (L,PS) • S-Ancestorship • Ancestor-descendant relationships in suffix tree • Element (v1, PSN) is an S-Ancestorship of (L, PS)
RIST – Indexing Construction • S-Ancestorship requires additional information • Label each suffix tree node x by pair <nx, sizex> • nx prefix traversal order of x in suffix tree • sizex is total number of descendants of x in suffix tree • x … <nx, sizex>, y …<ny, sizey> • x is S-Ancestor of node y if nyϵ (nx, nx + sizex] • Construct the B+Trees: • Tree nodes into the D-Ancestorship B+Tree using (Symbol, Prefix) as keys • For all nodes x inserted with the same (Symbol, Prefix) we index them by S-Ancestorship B+Tree, using the nx values of their labels as keys.
ViST – Virtual Suffix Tree • Dynamic Virtual suffix tree labeling • Semantic and statistical clues • Dynamic scope allocation without clues
Dynamic scope allocation • Number of child nodes of x is λ. We allocate 1/λ of the remaining scope to x’s first child Dynamic scope allocation with λ=2
subScope(parent, e): create a sub scopewithin the parent scope for e
Insertion index • Doc1 = (P,ϵ)(S,P)(N,PS)(v1,PSN)(L,PS)(v2,PSL) • Doc2 = (P,ϵ)(S,P)(L,PS)(v2,PSL)
EXPERIMENTS - Sample queries Path Expression Dataset Q1 /inproceedings/title DBLP Q2 /book/author[text=‘David’] DBLP Q3 /*/author[text= ‘David’] DBLP Q4 //author[text= ‘David’] DBLP Q5 /book[key=‘books/bc/MaierW88’]/author DBLP Q6 /site//item[location=‘US’]/mail/date[text=‘12/15/1999’] XMARK Q7 /site//person/*/city[text=‘Pocatello’] XMARK Q8 //closed_auction[*[person=‘person1’]]/date[text=‘12/15/1999’] XMARK
Comparing indexing methods time in seconds
Index structure • DBLP (301 MB of data) • XMARK (52MB of data)
Conclusion • structure-encoded sequences • Sequence matching • Avoid expensive join operations • Top-down scope allocation method • Index structure – B+Tree
PRIX: PRufer sequences for Indexing Xml • Rao & Moon (2006) proposed a new method for indexing XML documents using sequences • It uses the same idea as in ViST index: • The XML tree is transformed into a sequence and saved in the database • Each query is also transformed into a sequence • The answer of the query is acquired by performing subsequence matching
Motivation: Twig Queries and Wildcards • Like in ViST, PRIX also tries to efficiently answer twig queries as well as queries containing wildcards (‘*’ any and ‘//’ self or descendant queries) P P Q Q T S S Twig query XPath: P/Q[T]/S Query with wildcards XPath: P//Q/S
Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document A <A> <B> <C> <D> <E> </E> </D> </C> </B> </A> B D = (A, ε), (B, A), (C, AB), (D, ABC), (E, ABCD) C D Elements in height k appear k times E
Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms P P P Q R Q Q Q T S U T T S T S Doc1 = (P, e) (Q, P) (T, PQ) (S, PQ) (R, P) (U, PR) (T, PR) Doc2 = (P, e) (Q, P) (T, PQ) (Q, P) (S, PQ) XPath: P/Q[T]/S Q = (P, e) (Q, P) (T, PQ) (S, PQ)
Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms • False negatives • Correctly answering a twig query depends on the order the branches are created P P N F F N T G Doc = (P, e) (F, P) (T, PF) (N, P) (G, PN) Xpath: P[N]/F Q = (P, e) (N, P) (F, P) ???
Motivation: Problems in ViST Index • Memory requirements: • In the worst case, ViST requires O(N2) space to index the document • False positives • In many cases, query processing in Vist results in false alarms • False negatives • Correctly answering a twig query depends on the order the branches are created
Indexing and Querying in PRIX Indexing: • The first step is to take as input an XML document and convert it into a sequence • This is achieved using Prufer Sequences • The sequence is saved in the database in a way equivalent to the one used in ViST • It is a Virtual Trie implemented as B+ Trees XML document
Indexing and Querying in PRIX Querying • Queries are also transformed to trees and then to Prufer Sequences • The query sequence looked up in the document sequence and all matching subsequences are retrieved • After this initial filtering, three refinement phases follow XPath Query
Indexing XML Documents • The first step is to transform the XML document to the equivalent XML tree • Notice that both elements and text values are represented as nodes (the same stands for attributes) • The tree is not saved in the database <A> <B></B> <B> <C> D </C> <C> <F/> <E/> </C> </B> </A> A B B F D E C C
Indexing XML Documents • Then the Prufer Sequence is created from the XML tree • A Prufer Sequence is a method proposed by Prufer (1918) that constructs a one-to-one correspondence between a labeled tree and a sequence 8,A 8, 3, 7, 6, 6, 7, 8 1,B 7,B 2,D 5,E 4,F 3,C 6,C
Indexing XML Documents • Prufer Sequences can only be created from trees with numerical labeling, with each node having a unique number • Since the XML tree contains string labels (the names of elements etc.) we add an additional label to each node • We will use the post-order traversal to name the nodes • The prufer sequence can be extracted for any labeling of the tree, but using post-order numbering has some properties that makes the querying process easier
Indexing XML Documents • Initial labeling A 8,A B B 1,B 7,B F 2,D D E 5,E 4,F C C 3,C 6,C
Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left • In PRIX index, two sequences are held: • The actual Prufer Sequence holding the numbers of the labels called Numbered Prufer Sequence: NPS • The corresponding sequence holding the actual labels of the nodes of the XML Tree called Labeled Prufer Sequence: LPS
Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left 8,A 1,B 7,B 2,D NPS : 8, LPS : A, 5,E 4,F 3,C 6,C
Indexing XML Documents Finding the Prufer Sequence • The algorithm to find the Prufer sequence is the following: • Find the leaf with the smallest value and delete it. • Add the label of its parent to the sequence • Repeat until only one node is left 8,A 7,B 2,D NPS : 8, 3 LPS : A, C 5,E 4,F 1,B 3,C 6,C