The Storage and Benchmarking of XML

The Storage and Benchmarking of XML Presenter: Kevin See (see@ca.ibm.com) IBM Toronto Lab. DB2 SQL /Catalog Development Date: Nov 2, 2001 <?XML?> The Storage and Benchmarking of XML

Outline • Introduction • Text file / OODBMS/ native DB approach • Relational DB approaches • Categories • 2 latest proposals • XML benchmarks • Conclusion The Storage and Benchmarking of XML

Introduction • XML is emerging to become the standard for data exchanging • Demand for storage and management of the XML documents is growing • There are a few ways to manage the XML document The Storage and Benchmarking of XML

Text Approach • File system • A separate query engine will need to be implemented • Parsing not possible • Index strategies : (parent_offset, tag), (child_offset, parent_offset), (tagname, value), (attribute_name, attribute_value)  not good for update The Storage and Benchmarking of XML

Object-oriented Database Management System • Michael R. Olson and Byung S. Lee (1997) • OO model fit well • Immature technology: Hard to scale • Conclude the experiment without any great success. The Storage and Benchmarking of XML

Native Database Approach • Prototypes: Lore (Stanford University), Xyleme (INRIA, France). • Immature technology • No optimization capabilities The Storage and Benchmarking of XML

Relational Database Technology • Very mature technology • Query optimization techniques and the processing mechanisms in relational databases have been studied for a quarter of a century • A very large percentage of the data are currently stored in RDMS The Storage and Benchmarking of XML

Storage and Retrieval of XML Using Relational Database XML to Relational Mapping Table1 XML XML Query Language such as XPath, XQuery, Quilt, XQL SQL XML Query to SQL Table1 XML Relational to XML Conversion The Storage and Benchmarking of XML

Structure-mapping approach The XML document’s logical structures (or DTDs if available) are represented by the database schemas 1 DTD : 1 set of generated schemas Model-mapping approach Constructs of XML model are represented by the database schemas 1 set of generated schemas for all/any DTD Classifications of Various Mapping Methods The Storage and Benchmarking of XML

Relational Schema Prototype Tree Mapping Method • M. Yan and A. Fu @ The Chinese University of Hong Kong (2001) • Structure-mapping • Global Schema Extraction Algorithm • DTD Splitting Schema Extraction Algorithm The Storage and Benchmarking of XML

Relational Schema Prototype Tree Mapping Method (Cont’d) • Relational Databases for Querying XML Documents: Limitations and Opportunities (J. Shanmugasundaram) • Basic steps: • Simplify DTD • Construct schema prototype tree • Generate relational schema prototypes • Detect functional dependencies and candidate keys • Normalize the relational schema prototypes The Storage and Benchmarking of XML

DTD Splitting Schema Extraction Algorithm • Step 1: Simplify DTD p|p'  p, p' p+  p* (p, p')  p, p' ..., p,..., p*,...  p* p?  p (p, p’)*  p*, p’* ..., p,..., p,...  p* The Storage and Benchmarking of XML

An Book DTD <!ENTITY %txt “(#PCDATA)”> <!ELEMENT book (booktitle, price?, author, authority*)> <!ELEMENT authority (authname, country)> <!ELEMENT authname %txt> <!ELEMENT country %txt> <!ELEMENT booktitle %txt> <!ELEMENT price %txt> <!ELEMENT monograph (title, author, editor)> <!ELEMENT title %txt> <!ELEMENT editor (monograph+)> <!ATTLIST editor name CDATA #REQUIRED> <!ELEMENT author (name, address)> <!ATTLIST author id ID> <!ELEMENT name (firstname, lastname)> <!ELEMENT firstname %txt> <!ELEMENT lastname %txt> <!ELEMENT address %txt> The Storage and Benchmarking of XML

Transformed/ Simplified DTD <!ELEMENT book (booktitle, price, author, authority*)> <!ELEMENT authority (authname, country)> <!ELEMENT authname (#PCDATA)> <!ELEMENT country (#PCDATA)> <!ELEMENT booktitle (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT monograph (title, author, editor)> <!ELEMENT title (#PCDATA)> <!ELEMENT editor (monograph*)> <!ATTLIST editor name CDATA> <!ELEMENT author (name, address)> <!ATTLIST author id ID> <!ELEMENT name (firstname, lastname)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT address (#PCDATA)> The Storage and Benchmarking of XML

Step 2: Construct Schema Prototypes Trees • Only an element can become a root • An element that is not nested inside other elements can become the root • A non-#PCDATA element that is nested in more than 1 other element becomes the root • If a non-#PCDATA element B is not the only subelement of A and B only appears in A with a “*”, it becomes the root • One of the elements in the recursion is selected as root should recursion occurs in the DTD The Storage and Benchmarking of XML

Roots for the Example DTD <!ELEMENT book (booktitle, price, author, authority*)> <!ELEMENT authority (authname, country)> <!ELEMENT authname (#PCDATA)> <!ELEMENT country (#PCDATA)> <!ELEMENT booktitle (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT monograph (title, author, editor)> <!ELEMENT title (#PCDATA)> <!ELEMENT editor (monograph*)> <!ATTLIST editor name CDATA> <!ELEMENT author (name, address)> <!ATTLIST author id ID> <!ELEMENT name (firstname, lastname)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT address (#PCDATA)> • Element book is selected as root – rule 2 • Element author is selected as root – rule 3 • Element authority is selected as root – rule 4 • Element monograph is selected as root – rule 5 The Storage and Benchmarking of XML

Step 2: Construct Schema Prototypes Trees (Cont’d) • Tree construction: • Depth-first scan on DTD for all selected root(s) starting from the subelements of the root • New nodes for each visited elements and attributes • A mixed element (element containing both #PCDATA and other subelement) will be marked with a “#” in the tree • Recursion – a new leaf node with label <node name>.A The Storage and Benchmarking of XML

Schema Prototype Trees The Storage and Benchmarking of XML

Step 3: Generate Relational Schema Prototype • All necessary descendants are inlined starting from the root except key nodes or foreign key nodes. The Storage and Benchmarking of XML

Relational Schema Prototype Book (booktitle, price) Authority (country, authname) Author (address, id, firstname, lastname) Monograph (title, name) The Storage and Benchmarking of XML

Step 4: Discover FDs and Candidate Keys • Functional dependencies (FDs) and the candidate keys discovery by analyzing the XML data • TANE algorithm (http://www.cs.helsinki.fi/research/fdk/datamining/tane/) The Storage and Benchmarking of XML

Candidate Keys Book {booktitle} Authority {country, authname} Monograph {title} Author {id}, {lastname, address} The Storage and Benchmarking of XML

Relational Schema Prototype With Candidate Keys Book (booktitle, price, author.id) Authority (country, authname, assigned, book.booktitle) Author (address, id, firstname, lastname) Monograph (title, name, author.id, monograph.title) Book (booktitle, price) Authority (country, authname) Author (address, id, firstname, lastname) Monograph (title, name) The Storage and Benchmarking of XML

Step 5: Normalize the Relational Schema Prototypes • The last step. • Normalize the schema to 3NF (third normal form) if possible. • Structure mapping methods does not handle order but leave it to metadata or user to handle. The Storage and Benchmarking of XML

X-Rel • Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura and Shunsuke Uemura @ Nara Institute of Science and Technology, Japan (2001) • Model-mapping • Data model: XPath (root node, element nodes, attribute nodes, and text nodes) • The concept of region The Storage and Benchmarking of XML

Definition of Region • The region of: • An element node or a text node is a pair of numbers representing the start and end positions of the node in the XML document • An attribute node is a pair of identical numbers equal to the start position of the parent element node plus one The Storage and Benchmarking of XML

Simple Path Expressions • Path – an unit of decomposition of XML trees • Store simple path expression (denoted by SimplePathExpr) from the root node. Why? • Path is appear in XML queries frequently The Storage and Benchmarking of XML

Why “#” Is Added? • Look for family descendants of issue. • WHERE p1.pathexp LIKE ‘/issue%/family’ • WHERE p1.pathexp LIKE ‘#/issue#%/family’ • /issuelist/family (WRONG) is match for the first but not the second. The Storage and Benchmarking of XML

Example XML Document <Paper Title = “The Suffix-Signature Method for Searching Phrases in Text”> <Authors> <FN> Mei </FN> <LN> Zhou </LN> <Affiliation> Open Text Corporation </Affiliation> </Authors> <Authors> <FN> Frank </FN> <LN> Tompa </LN> <Affiliation> University of Waterloo </Affiliation> </Authors> </Paper> The Storage and Benchmarking of XML

XML Tree The Storage and Benchmarking of XML

Simple Path Expressions /Regions • Node 3 • #/Paper#/@Title • (1,1) • Node 9 • #/Paper#/Authors#/affiliation • (99, 145) The Storage and Benchmarking of XML

Mapping Idea • A relational table per node type • Simple path expression are normalized • docID is introduced • Basic XRel schema  Element (docID, pathID, start, end, index, reindex) Attribute (docID, pathID, start, end, value) Text (docID, pathID, start, end, value) Path (pathID, pathexp) The Storage and Benchmarking of XML

docID pathID start end index reindex 1 1 0 257 1 1 1 3 66 155 1 2 1 4 75 86 1 2 1 5 87 99 1 2 1 6 100 145 1 2 1 3 156 249 2 1 1 4 165 178 2 1 1 5 179 192 2 1 1 6 193 239 2 1 Table - Element The Storage and Benchmarking of XML

Table - Attribute The Storage and Benchmarking of XML

Table - Text The Storage and Benchmarking of XML

Table - Path The Storage and Benchmarking of XML

XML Benchmarking Desiderata • Bulk loading • Reconstruction • Path traversals • Casting • Missing elements The Storage and Benchmarking of XML

XML Benchmarking Desiderata (continued) • Ordered access • References • Joins • Construction of large results • Containment, full-text search The Storage and Benchmarking of XML

XML Benchmarks • 3 XML benchmark proposals. • XMach-1. • University of Leipzig, Germany. • XML benchmark project. • CWI, the Netherlands. • Kanda et al. Proposal (unpublished). • University of Michigan, IBM Toronto lab center for advanced studies. The Storage and Benchmarking of XML

Conclusion • From the different mapping approaches and experiments, there are a few places where relational database enhancement can help in coping with XML model differences. • Support for sets. • Flexible comparisons operators. • Multi-predicate merge join. The Storage and Benchmarking of XML

Questions & Answers The Storage and Benchmarking of XML

Appendix A Enhancing Structural Mappings Based on Statistics The Storage and Benchmarking of XML

Optimal Hybrid Database Algorithm • M. Klettke, and H. Meyer • XML and object-relational database systems - enhancing structural mappings based on statistics (2000) • An algorithm that finds a type of optimal mapping based on the statistics and the DTD The Storage and Benchmarking of XML

Optimal Hybrid Database Algorithm • Build a graph representing the hierarchy of the elements and attributes of the DTD. • For every element/attribute of the graph, a measure of significance, w, is determined. • Derive the resulting database design from the graph. The Storage and Benchmarking of XML

The Storage and Benchmarking of XML

Graph for an Example DTD The Storage and Benchmarking of XML

Calculate the Weight (Step 2) • W = 1/6 (SQ + SA + SH) + ¼ (DA/DG) + ¼ (QA/QG) where • SQ - exploitation of quantifiers • SA - exploitation of alternatives • SH - position in the hierarchy • DA -number of documents containing the element/attribute • DG - absolute number of XML documents • QA - number of queries containing the element/attribute • QG - absolute number of queries The Storage and Benchmarking of XML

The Graph With the Colored Weight The Storage and Benchmarking of XML

Step 3 - Deriving Hybrid Databases From the Graph • First, specify a limit on which attributes and/or elements is represented as attributes of the databases and which attributes and/or element are represented as XML attributes The Storage and Benchmarking of XML

Step 3 - Deriving Hybrid Databases From the Graph • Then, search for all nodes of the graph that satisfy the following conditions: • The node is not a leaf of the graph • The node and all its descendants are below the limit given • No predecessor that satisfies the first two conditions exists. The Storage and Benchmarking of XML

The Storage and Benchmarking of XML

The Storage and Benchmarking of XML

Presentation Transcript

The Challenge of International Benchmarking

Advanced Views: XML-Relational Storage and Datalog

Benchmarking the Business of Agriculture

XML Storage and Indexing Native XML

Benchmarking Holistic Approaches to XML TPQ Processing

Merging the Storage Interfaces and I/O Benchmarking WGs

XML Storage and Query Processing

2 Basics of XML and XML documents

XML Storage and XPath Queries in Oracle

Benchmarking and the Kyoto Protocol

Adaptive XML Storage

Storing XML using native storage

The Utility of XML

Lecture 13: XQuery XML Publishing, XML Storage

XML Storage and Indexing Native XML

Benchmarking of the FLUKA code

Efficient XML Storage, Query, and Update

The Benefits of Benchmarking

The Why and What of XML

THE ROLE OF BENCHMARKING AND INDEPENDENT REVIEWS

Lecture 12: XML Publishing, XML Storage

The Utility of XML