XML Storage and Query Processing

XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

XML Storage Alternatives • Plain Text • Trees with Navigation • Tuples (i.e., mapping to RDBMS)

Plain Text • Use XML standards to encode data • Advantages: • simple, universal • indexing possible • Disadvantages: • need to re-parse (re-validate) all the time • no compliance with XQuery data model (collections) • not an option for XQuery processing

f1 f2 f3 f4 f5 f6 f7 f8 Trees • XML data model uses tree semantics • use Trees/Forests to represent XML instances • annotate nodes of tree with data model info • Examples: • Document Object Model (DOM) http://www.w3.org/DOM/ • Object Exchange Model (OEM)

generates Data formulates execute against Queries DataGuides [Goldman & Widom 97] • Schema-based environments Schema

Summarized into generate Data DataGuides formulate execute against Queries DataGuides [Goldman & Widom 97] • Schema-free environments: • don't know the schema in advance. • semantic heterogeneity (i.e. a mix of schemas) App-specific Templates

Schema vs. DataGuides • A DataGuide only includes info that exists in a DB. • A schema can be a superset of any DB that conforms to it. • So, a schema defines a superset of a DataGuide. • Issues addressed in the paper: • Summarize data into DataGuides; • Use them for query formulation and optimization.

Object Exchange Model (OEM) • Object Exchange Model (OEM) • Each object has an id (oid) and a value (atomic or a set of subobjects). • Each edge links an object to one of its subobjects with a label; a subobject may have multiple parents. • Label path: a seq. of labels • Data path: an alternating seq. of labels and oids • Target set: a set of all objects reached by traversing a label path

Definition of a DataGuide • Conciseness: a DataGuide describes every unique label path of a source exactly once • Accuracy: a DataGuide does not encode any label path that does not appear in the source • Convenience: represented as an OEM model, like the data • A DataGuide reflects the structure of a DB; it contains no atomic values.

From Data to DataGuides • Creating a DataGuide is equivalent to converting an NFA to DFA! • Consider a label path (query) as a string to be accepted by the data source and the DataGuide. • Intuition: The data source has multiple matches, so execution is non-deterministic. But the DataGuide has only one path, so execution is deterministic. • Cost of creation • Source DB is a tree: linear • Worst-case: exponential in #. of objects and edges in the source • Empirical results: average performance for certain datasets is quite encouraging

22 Multiple DataGuides • An OEM source may have multiple DataGuides • A single NFA may have many equivalent DFAs. • Minimal DataGuide • Can be created using DFA minimization • Minimality may not always be desirable • Hard to maintain as the data source changes--well known problem with DFA. • Does not allow annotations. ? ?

Annotations • Annotation: a property of the target set of a label path l in the data source s • Statistical information: e.g. # occurrences of l in s • Pointers to objects reachable via l • … • Issue with minimality Annotation for A.C ? Annotation for B.C

Strong DataGuides • Each set of label paths that share a node in the DataGuide is the set of label paths that share the same target set in the source. • Label paths can be merged in the DataGuide if they lead to the same target set. • There is one-to-one correspondence between source target sets and DataGuide objects. • Creation from the data source • A DFS algorithm that examines source target sets reachable by al possible label paths… • Maintenance uses a similar set of data structures…

Query Formulation & Optimization • Query formulation • Query by example: click buttons to select a path and add value filters • Blurs the distinction between formulating a query and browsing a query result • Query optimization • Uses the DataGuide for structural matching (e.g. A.B.C) and retrieves the target set • Uses value indexes (e.g. B+trees) for value filters for a specific label (e.g. C.price>100) • Intersects the two resulting sets of objects

XML Data Stored as Tuples • Motivation: Use an RDBMS infrastructure to store and process the XML data • query optimization • scalability • richness and maturity of RDBMS • Alternative relational storage approaches: • Map XML schema to relational schema • Generic shredding of the data (edge, binary, …) • New XML storage integrated tightly with the relational processor

Relational Support for XML [Zhang et al. 2001] • Goal: relational support for path queries, including storage and query processing • Assumption: we have the DTD/schema • Problem addressed: to support XML path queries • Can we use a relational DBMS? • Shall we design a native XML store, i.e. using novel storage and indexing techniques?

Representation of XML <?xml version="1.0" ?> 1<book> 2<section id=“intro” difficulty=“easy”> 3<title> 4XML 5</title> 6<section difficulty=“easy”> 7<title> 8XML 9Processing 10</title> 11<figure source=“g1.jpg”> 12<title> 13XML 14Processing 15Cost 16</title> 17</figure> 18</section> 19<figure source=“g2.jpg”> 20<title> 21Scalability 22</title> 23</figure> 24</section> 25</book> • Each XML document is parsed to a seq. of items: • Start tag • Text word • End tag • All items are numbered, from 1.

Element Index • An Element Index (E-index) records occurrences of each element name inside the entire collection of documents. • Each index entry in an E-index corresponds to one occurrence of the element name. It has: • document identifier, • start position of the element in the doc, i.e. position of its start tag. • end position of the element in the doc, i.e. position of its end tag • document level of the element in the doc, i.e. level from the root. • An E-index is sorted in increasing order of <document id, start position, end position>.

<book> (1, 1:25, 1) (2, … <section> (1, 2:24, 2) (1, 6:18, 3) (2, … <title> (1, 3:5, 3) (1, 7:10, 4) (1, 12:16, 5) (1, 20:22, 4) (2, … <figure> (1, 11:17, 4) (1, 19:23, 3) (2, … Example of E-Index <?xml version="1.0" ?> 1<book> 2<section id=“intro” difficulty=“easy”> 3<title> 4XML 5</title> 6<section difficulty=“easy”> 7<title> 8XML 9Processing 10</title> 11<figure source=“g1.jpg”> 12<title> 13XML 14Processing 15Cost 16</title> 17</figure> 18</section> 19<figure source=“g2.jpg”> 20<title> 21Scalability 22</title> 23</figure> 24</section> 25</book>

Text Index • A Text Index (T-index) records the occurrences of each text word inside the entire collection of documents, similar to E-Index. • Difference is that each index entry in a T-index contains a single word position, instead of the pair of start and end positions. • Similarly, a T-index is sorted in increasingof <document identifier, word position>.

“XML” (1, 4:4, 4) (1, 8:8, 5) (1, 13:13, 6) (2, … “Processing” (1, 9:9, 5) (1, 14:14, 6) (2, … “Cost” (1, 15:15, 6) (2, … “Scalability” (1, 21:21, 5) (2, … Example of T-Index <?xml version="1.0" ?> 1<book> 2<section id=“intro” difficulty=“easy”> 3<title> 4XML 5</title> 6<section difficulty=“easy”> 7<title> 8XML 9Processing 10</title> 11<figure source=“g1.jpg”> 12<title> 13XML 14Processing 15Cost 16</title> 17</figure> 18</section> 19<figure source=“g2.jpg”> 20<title> 21Scalability 22</title> 23</figure> 24</section> 25</book>

doc_id start_pos end_pos doc_level <figure> <figure> (a) Element-Index (b) Text-Index “XML” “Processing” “Cost” term <title> <title> term <section> 1 2 1 1 1 1 2 1 1 2 1 1 2 … 20 … 3 12 11 1 … 7 … 6 2 19 … 24 … 25 10 23 18 22 … … 16 5 17 4 1 4 3 5 … 3 … 4 3 … 2 … “XML” “Processing” <title> <section> “Scalability” <book> “XML” <title> doc_id word_pos doc_level “Cost” “Processing” “XML” <book> “Scalability” <title> <section> <figure> 1 4 4 1 8 5 1 13 6 1 9 5 1 14 6 1 15 1 21 2 2 2 2 … … … … … … … … 6 5 Relational Storage

Relational Storage (contd.) • One relation for elements, one for text words • Clustered B+trees over each table • On (term, docno) • On all columns: lead to index-only plans

 (//) l.doc_id = r.doc_id and l.start_pos < r.start_pos and l.end_pos > r.end_pos Index Scan on <title> Index Scan on <section> “//section//title”

Questions

Outline • Storage and Query Processing • DataGuides [Goldman and Widom 97] • Relational Approach [Zhang et al. 2001] • Other Research Topics • Query Rewriting • Benchmarking • …

Node Identifiers • XQuery Data Model Requirements • identify a node uniquely (implementing identity) • lives as long as node lives • robust to updates • Identifiers might include additional information • Schema/type information • Document order • Parent/child relationship • Ancestor/descendent relationship • Document information • Required for indexes

Simple Node Identifiers • Examples: • Alternative 1 (data: trees) • id of document (integer) • pre-order number of node in document (integer) • Alternative 2 (data: plain text) • file name • offset in file • Encode document ordering (Alternative 1) • identity: doc1 = doc2 AND pre1 = pre2 • order: doc1 < doc2 OR (doc1 = doc2 AND pre1 < pre2) • Assessment: • bad: Not robust to updates • bad: Not able to answer more complex queries

Dewey Order • Idea: • Generate surrogates for each path • 1.2.3 identifies the third child of the second child of the first child of the given root • Assessment: • good: order comparison, ancestor/descendent easy • bad: updates expensive, space overhead • Improvement: ORDPath Bit Encoding O‘Neil et al. 2004 (Microsoft SQL Server)

1 person 1.1 name child 1.2 1.2.1 person name hobby hobby 1.2.1.1 1.2.1.2 1.2.1.3 Example: Dewey Order

XML Storage Alternatives • Plain Text • Trees with Random Access • Tuples (i.e., mapping to RDBMS)

Plain Text • Use XML standards to encode data • Advantages: • simple, universal • indexing possible • Disadvantages: • need to re-parse (re-validate) all the time • no compliance with XQuery data model (collections) • not an option for XQuery processing

f1 f2 f3 f4 f5 f6 f7 f8 Trees • XML data model uses tree semantics • use Trees/Forests to represent XML instances • annotate nodes of tree with data model info • Example <f1> <f2>..</f2> <f3>..</f3> <f4> <f7/> <f8>..</f8> </f4> <f5/> <f6>..</f6> </f1>

Trees • Advantages • natural representation of XML data • good support for navigation, updates index built into the data structure • compliance with DOM standard interface • Disadvantages • difficult to partition • high overhead: mixes indexes and data • index everything • Example: Document Object Model (DOM) • http://www.w3.org/DOM/

Edge Approach (Florescu & Kossmann 99) Edge Table Value Table (String) Value Table (Integer)

XML Example <person id = “4711“> <name> Lilly Potter </name> <child> <person id = “314“> <name> Harry Potter </name> <age> 12 </age> </child> </person> <person id = “666“> <name> James Potter </name> <child idref = “314“/> </person>

0 person person <person id = “4711“> <name> Lilly Potter </name> <child> <person id = “314“> <name> Harry Potter </name> <age> 12 </age> </child> </person> <person id = “666“> <name> James Potter </name> <child idref = “314“/> </person> 4711 666 name child name Lilly Potter i314 James Potter person 314 age name Harry Potter 12

Kinds of Indexes • Value Indexes • index atomic values; e.g., //emp/salary/fn:data(.) • use B+ trees (like in relational world) • (integration into query optimizer more tricky) • Structure Indexes • materialize results of path expressions • (pendant to Rel. join indexes, OO path indices) • Full text indexes • Keyword search, inverted files • (IR world, text extenders) • Any combination of the above

Outline • XML Storage • XML Indexing • Query Processing • Other Research Topics • Query Rewriting • Benchmarking • …

What is a Correct Rewriting • E1 -> E2 is a legal rewriting iff • Type(E2) is a subtype of Type(E1) • FreeVar(E2) is a subset of FreeVar(E1) • For a binding of free variables, either • E1 or E2 return ERROR (possibly different errors) • Or E1 and E2 return the same result • This definition allows the rewrite E1->ERROR • Trust your vendor she does not do that for all E1!

Handling Backwards Navigation • Replace backwards navigation with forward navigation for $x in $input/a/b for $y in $input/a, return <c>{$x/.., $x/d}</c> $x in $y/b return <c>{$y, $x/d}</c> for $x in $input/a/b return <c>{$x//e/..}</c> ?? • Enables streaming

FLWR Unnesting • Traditional database technique for $x in $input/a/b for $x in $input/a/b, where $x/c eq 3 $y in $x/d return (for $y in $x/d where ($x/e eq 4) and ($x/c eq 3) where $x/e eq 4 return $y return $y) • Problem simpler than in OQL/ODMG • No nested collections in XML

XML Query Processing • Techniques vary a lot, depending on • Storage model • Indexes available • Algebra used • … • A large body of ongoing work • Research community: McHugh and Widom 1999, Zhang et al. 2001, Bruno et al. 2002, Ghua et al. 2002, Chen et al. 2003, Paparizos et al. 2004, Jagadish 2004, … (just look at SIGMOD and VLDB proceedings in recent years!) • Industry: IBM DB2, Oracle, SQL Server, …

XML Processing Benchmark • We cannot really compare approaches until we decide on a comparison basis • XML processing very broad • Industry not mature enough • Usage patterns not clear enough • Existing XML benchmarks (Xmark, etc. ) limited • Strong need for a TP benchmark

XML Storage and Query Processing