1 / 53

The Storage and Benchmarking of XML

The Storage and Benchmarking of XML. Presenter: Kevin See (see@ca.ibm.com) IBM Toronto Lab. DB2 SQL /Catalog Development Date: Nov 2, 2001. <? XML ?>. Outline. Introduction Text file / OODBMS/ native DB approach Relational DB approaches Categories 2 latest proposals XML benchmarks

esben
Télécharger la présentation

The Storage and Benchmarking of XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Storage and Benchmarking of XML Presenter: Kevin See (see@ca.ibm.com) IBM Toronto Lab. DB2 SQL /Catalog Development Date: Nov 2, 2001 <?XML?> The Storage and Benchmarking of XML

  2. Outline • Introduction • Text file / OODBMS/ native DB approach • Relational DB approaches • Categories • 2 latest proposals • XML benchmarks • Conclusion The Storage and Benchmarking of XML

  3. Introduction • XML is emerging to become the standard for data exchanging • Demand for storage and management of the XML documents is growing • There are a few ways to manage the XML document The Storage and Benchmarking of XML

  4. Text Approach • File system • A separate query engine will need to be implemented • Parsing not possible • Index strategies : (parent_offset, tag), (child_offset, parent_offset), (tagname, value), (attribute_name, attribute_value)  not good for update The Storage and Benchmarking of XML

  5. Object-oriented Database Management System • Michael R. Olson and Byung S. Lee (1997) • OO model fit well • Immature technology: Hard to scale • Conclude the experiment without any great success. The Storage and Benchmarking of XML

  6. Native Database Approach • Prototypes: Lore (Stanford University), Xyleme (INRIA, France). • Immature technology • No optimization capabilities The Storage and Benchmarking of XML

  7. Relational Database Technology • Very mature technology • Query optimization techniques and the processing mechanisms in relational databases have been studied for a quarter of a century • A very large percentage of the data are currently stored in RDMS The Storage and Benchmarking of XML

  8. Storage and Retrieval of XML Using Relational Database XML to Relational Mapping Table1 XML XML Query Language such as XPath, XQuery, Quilt, XQL SQL XML Query to SQL Table1 XML Relational to XML Conversion The Storage and Benchmarking of XML

  9. Structure-mapping approach The XML document’s logical structures (or DTDs if available) are represented by the database schemas 1 DTD : 1 set of generated schemas Model-mapping approach Constructs of XML model are represented by the database schemas 1 set of generated schemas for all/any DTD Classifications of Various Mapping Methods The Storage and Benchmarking of XML

  10. Relational Schema Prototype Tree Mapping Method • M. Yan and A. Fu @ The Chinese University of Hong Kong (2001) • Structure-mapping • Global Schema Extraction Algorithm • DTD Splitting Schema Extraction Algorithm The Storage and Benchmarking of XML

  11. Relational Schema Prototype Tree Mapping Method (Cont’d) • Relational Databases for Querying XML Documents: Limitations and Opportunities (J. Shanmugasundaram) • Basic steps: • Simplify DTD • Construct schema prototype tree • Generate relational schema prototypes • Detect functional dependencies and candidate keys • Normalize the relational schema prototypes The Storage and Benchmarking of XML

  12. DTD Splitting Schema Extraction Algorithm • Step 1: Simplify DTD p|p'  p, p' p+  p* (p, p')  p, p' ..., p,..., p*,...  p* p?  p (p, p’)*  p*, p’* ..., p,..., p,...  p* The Storage and Benchmarking of XML

  13. An Book DTD <!ENTITY %txt “(#PCDATA)”> <!ELEMENT book (booktitle, price?, author, authority*)> <!ELEMENT authority (authname, country)> <!ELEMENT authname %txt> <!ELEMENT country %txt> <!ELEMENT booktitle %txt> <!ELEMENT price %txt> <!ELEMENT monograph (title, author, editor)> <!ELEMENT title %txt> <!ELEMENT editor (monograph+)> <!ATTLIST editor name CDATA #REQUIRED> <!ELEMENT author (name, address)> <!ATTLIST author id ID> <!ELEMENT name (firstname, lastname)> <!ELEMENT firstname %txt> <!ELEMENT lastname %txt> <!ELEMENT address %txt> The Storage and Benchmarking of XML

  14. Transformed/ Simplified DTD <!ELEMENT book (booktitle, price, author, authority*)> <!ELEMENT authority (authname, country)> <!ELEMENT authname (#PCDATA)> <!ELEMENT country (#PCDATA)> <!ELEMENT booktitle (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT monograph (title, author, editor)> <!ELEMENT title (#PCDATA)> <!ELEMENT editor (monograph*)> <!ATTLIST editor name CDATA> <!ELEMENT author (name, address)> <!ATTLIST author id ID> <!ELEMENT name (firstname, lastname)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT address (#PCDATA)> The Storage and Benchmarking of XML

  15. Step 2: Construct Schema Prototypes Trees • Only an element can become a root • An element that is not nested inside other elements can become the root • A non-#PCDATA element that is nested in more than 1 other element becomes the root • If a non-#PCDATA element B is not the only subelement of A and B only appears in A with a “*”, it becomes the root • One of the elements in the recursion is selected as root should recursion occurs in the DTD The Storage and Benchmarking of XML

  16. Roots for the Example DTD <!ELEMENT book (booktitle, price, author, authority*)> <!ELEMENT authority (authname, country)> <!ELEMENT authname (#PCDATA)> <!ELEMENT country (#PCDATA)> <!ELEMENT booktitle (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT monograph (title, author, editor)> <!ELEMENT title (#PCDATA)> <!ELEMENT editor (monograph*)> <!ATTLIST editor name CDATA> <!ELEMENT author (name, address)> <!ATTLIST author id ID> <!ELEMENT name (firstname, lastname)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT address (#PCDATA)> • Element book is selected as root – rule 2 • Element author is selected as root – rule 3 • Element authority is selected as root – rule 4 • Element monograph is selected as root – rule 5 The Storage and Benchmarking of XML

  17. Step 2: Construct Schema Prototypes Trees (Cont’d) • Tree construction: • Depth-first scan on DTD for all selected root(s) starting from the subelements of the root • New nodes for each visited elements and attributes • A mixed element (element containing both #PCDATA and other subelement) will be marked with a “#” in the tree • Recursion – a new leaf node with label <node name>.A The Storage and Benchmarking of XML

  18. Schema Prototype Trees The Storage and Benchmarking of XML

  19. Step 3: Generate Relational Schema Prototype • All necessary descendants are inlined starting from the root except key nodes or foreign key nodes. The Storage and Benchmarking of XML

  20. Relational Schema Prototype Book (booktitle, price) Authority (country, authname) Author (address, id, firstname, lastname) Monograph (title, name) The Storage and Benchmarking of XML

  21. Step 4: Discover FDs and Candidate Keys • Functional dependencies (FDs) and the candidate keys discovery by analyzing the XML data • TANE algorithm (http://www.cs.helsinki.fi/research/fdk/datamining/tane/) The Storage and Benchmarking of XML

  22. Candidate Keys Book {booktitle} Authority {country, authname} Monograph {title} Author {id}, {lastname, address} The Storage and Benchmarking of XML

  23. Relational Schema Prototype With Candidate Keys Book (booktitle, price, author.id) Authority (country, authname, assigned, book.booktitle) Author (address, id, firstname, lastname) Monograph (title, name, author.id, monograph.title) Book (booktitle, price) Authority (country, authname) Author (address, id, firstname, lastname) Monograph (title, name) The Storage and Benchmarking of XML

  24. Step 5: Normalize the Relational Schema Prototypes • The last step. • Normalize the schema to 3NF (third normal form) if possible. • Structure mapping methods does not handle order but leave it to metadata or user to handle. The Storage and Benchmarking of XML

  25. X-Rel • Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura and Shunsuke Uemura @ Nara Institute of Science and Technology, Japan (2001) • Model-mapping • Data model: XPath (root node, element nodes, attribute nodes, and text nodes) • The concept of region The Storage and Benchmarking of XML

  26. Definition of Region • The region of: • An element node or a text node is a pair of numbers representing the start and end positions of the node in the XML document • An attribute node is a pair of identical numbers equal to the start position of the parent element node plus one The Storage and Benchmarking of XML

  27. Simple Path Expressions • Path – an unit of decomposition of XML trees • Store simple path expression (denoted by SimplePathExpr) from the root node. Why? • Path is appear in XML queries frequently The Storage and Benchmarking of XML

  28. Why “#” Is Added? • Look for family descendants of issue. • WHERE p1.pathexp LIKE ‘/issue%/family’ • WHERE p1.pathexp LIKE ‘#/issue#%/family’ • /issuelist/family (WRONG) is match for the first but not the second. The Storage and Benchmarking of XML

  29. Example XML Document <Paper Title = “The Suffix-Signature Method for Searching Phrases in Text”> <Authors> <FN> Mei </FN> <LN> Zhou </LN> <Affiliation> Open Text Corporation </Affiliation> </Authors> <Authors> <FN> Frank </FN> <LN> Tompa </LN> <Affiliation> University of Waterloo </Affiliation> </Authors> </Paper> The Storage and Benchmarking of XML

  30. XML Tree The Storage and Benchmarking of XML

  31. Simple Path Expressions /Regions • Node 3 • #/Paper#/@Title • (1,1) • Node 9 • #/Paper#/Authors#/affiliation • (99, 145) The Storage and Benchmarking of XML

  32. Mapping Idea • A relational table per node type • Simple path expression are normalized • docID is introduced • Basic XRel schema  Element (docID, pathID, start, end, index, reindex) Attribute (docID, pathID, start, end, value) Text (docID, pathID, start, end, value) Path (pathID, pathexp) The Storage and Benchmarking of XML

  33. docID pathID start end index reindex 1 1 0 257 1 1 1 3 66 155 1 2 1 4 75 86 1 2 1 5 87 99 1 2 1 6 100 145 1 2 1 3 156 249 2 1 1 4 165 178 2 1 1 5 179 192 2 1 1 6 193 239 2 1 Table - Element The Storage and Benchmarking of XML

  34. Table - Attribute The Storage and Benchmarking of XML

  35. Table - Text The Storage and Benchmarking of XML

  36. Table - Path The Storage and Benchmarking of XML

  37. XML Benchmarking Desiderata • Bulk loading • Reconstruction • Path traversals • Casting • Missing elements The Storage and Benchmarking of XML

  38. XML Benchmarking Desiderata (continued) • Ordered access • References • Joins • Construction of large results • Containment, full-text search The Storage and Benchmarking of XML

  39. XML Benchmarks • 3 XML benchmark proposals. • XMach-1. • University of Leipzig, Germany. • XML benchmark project. • CWI, the Netherlands. • Kanda et al. Proposal (unpublished). • University of Michigan, IBM Toronto lab center for advanced studies. The Storage and Benchmarking of XML

  40. Conclusion • From the different mapping approaches and experiments, there are a few places where relational database enhancement can help in coping with XML model differences. • Support for sets. • Flexible comparisons operators. • Multi-predicate merge join. The Storage and Benchmarking of XML

  41. Questions & Answers The Storage and Benchmarking of XML

  42. Appendix A Enhancing Structural Mappings Based on Statistics The Storage and Benchmarking of XML

  43. Optimal Hybrid Database Algorithm • M. Klettke, and H. Meyer • XML and object-relational database systems - enhancing structural mappings based on statistics (2000) • An algorithm that finds a type of optimal mapping based on the statistics and the DTD The Storage and Benchmarking of XML

  44. Optimal Hybrid Database Algorithm • Build a graph representing the hierarchy of the elements and attributes of the DTD. • For every element/attribute of the graph, a measure of significance, w, is determined. • Derive the resulting database design from the graph. The Storage and Benchmarking of XML

  45. The Storage and Benchmarking of XML

  46. Graph for an Example DTD The Storage and Benchmarking of XML

  47. Calculate the Weight (Step 2) • W = 1/6 (SQ + SA + SH) + ¼ (DA/DG) + ¼ (QA/QG) where • SQ - exploitation of quantifiers • SA - exploitation of alternatives • SH - position in the hierarchy • DA -number of documents containing the element/attribute • DG - absolute number of XML documents • QA - number of queries containing the element/attribute • QG - absolute number of queries The Storage and Benchmarking of XML

  48. The Graph With the Colored Weight The Storage and Benchmarking of XML

  49. Step 3 - Deriving Hybrid Databases From the Graph • First, specify a limit on which attributes and/or elements is represented as attributes of the databases and which attributes and/or element are represented as XML attributes The Storage and Benchmarking of XML

  50. Step 3 - Deriving Hybrid Databases From the Graph • Then, search for all nodes of the graph that satisfy the following conditions: • The node is not a leaf of the graph • The node and all its descendants are below the limit given • No predecessor that satisfies the first two conditions exists. The Storage and Benchmarking of XML

More Related