L09: Introduction to XML Data Management

L09: Introduction to XML Data Management XML and XML Query Languages Structural Summary and Coding Scheme Managing XML Data in Relational Systems

XML and XML Query Languages XML and XML Query Languages Structural Summary and Coding Scheme Managing XML Data in Relational Systems

XML • Extensible Markup Language for data • A W3C standard to complement HTML http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000) • Standard for publishing and interchange • Origins: structured text SGML • “Cleaner” SGML for the Internet • Motivation: • HTML describes presentation • XML describes content

XML – Describing the Content <project> <talk > <title> XML Query Processing & Optimization</title> <date> March 18, 2004 </date> <instructor> Instructor <name> Lu Hongjun </name> <affiliation> HKUST </ affiliation > <email> luhj@cs.ust.hk </email> <name> Jeffrey X. Yu </name> <affiliation> CUHK </ affiliation > <email> yu@se.cuhk.edu.hk </email> </ instructor > </talk> </project>

XML Document/Data • Hierarchical document format for information exchange in WWW • Self describing data (tags) • Nested element structures having a root • Element data can have • Attributes • Sub-elements

Basic XML Structures • Elements: <title>… </title>,<name>… </name> • Open & close tags or “empty tag” • Ordered, nestable • an element can be empty • Attributes • PCDATA/CDATA • An XML document: single root element • well formed XML document: if it has matching tags

Basic XML Structures: Attributes • Single-valued, ordered <project proj_id = “P1234” budget = “1000000”> <title> XML Data Management </title> … <year> 2003-2004 </year> </project> • Special types: ID, IDREF, IDREFS • <memberid=“m007”> <name> James </name> </member> • <projectid=“p123”> <title> XML Data Management </title> <member idref=“m007 m008”/> </project>

Other XML Structures • Processing instructions: instructions for applications <?xml version=“1.0”?> • CDATA sections: treat content as char data <![CDATA[<tag>Whatever!!!</tag><whatever>]]> • Comments: just like HTML  • Entities: external resources and macros • &my-entity; (non-parameter entity) • %param-entity; (parameter entity for DTD declarations)

Data Centric vs. Document centric <project> <pname> XML </pname> <member ID=”&3”, age = 50 > <name>H. Lu </name> <email> luhj@cs.ust.hk </email> <publication author = ”H. Lu”> <title> Managing XML data using RDBMS </title> <year> 2001 </year> </publication> … </member> <member ID=”&24”, age = 35 > <name> J.X. Yu </name> <project> <pname> Data mining </pname> </project> </member> </project> <bio> Dr Lu is a professor at HKUST. He worked at NUS> before 1998. </bio>

XML Data Model • Several competing models • Document Object Model (DOM) • a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents • http://www.w3.org/DOM/

DOM Core Interface : Node • DOM tree: a tree-like structure of Node objects – the root of the tree is a document object. • Node Object (nodeName, nodeValue, nodeType, parentNode, childnodes, firstChild, lastChild, previousSibling, nextSibling, attributes, ownerDocument) • nodeType: ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, DOCUMENT_FRAGMENT_NODE, NOTATION_NODE

DOM Interface • Each node of the document tree may have a number of child nodes, contained in a NodeList object. • Two ways of accessing a node object • Based on the location of an object in the document tree • Based on the name of an object

Project Node NodeType=ELEMENT_NODE tagName = “project” NodeValue = ‘nill’ A Sample DOM Tree &1 project &2 &3 &24 pname member member XML &26 &28 &27 &65 &66 publication project name email name H. Lu &69 luhj@cs.ust.hk J.X. Yu &70 &294 &71 author title pname year Managing … 2001 Data mining name Node NodeType=TEXT_NODE tagName = “name” NodeValue = ‘H. Lu’ publicatiom Node NodeType=ELEMENT_NODE tagName = “publication” NodeValue = ‘nill’

age age Data Graph • Similar to DOM tree, but may have different notations that represent an XML document &1 project &2 &3 &24 member pname member &28 &26 &65 &27 author &66 XML publication project email name name 50 &69 &70 &294 &71 50 J.X. Yu luhj@cs.ust.hk H. Lu title year pname Managing … 2001 Data mining

Document Type Definition • Inherited from SGML DTD standard • BNF grammar establishing constraints on element structure and content • Specification of attributes and their types • Definitions of entities

project &1 * * A Sample DTD publication member pname * &2 &3 &4 author * <?xml version="1.0" standalone="yes"?> <!DOCTYPE Research > <!ELEMENT project(pname,member*,publication*)> <!ELEMENT pname(#PCDATA)> <!ELEMENT member (name,email?, publication*, project*)> <!ATTLIST member ID ID #REQUIRED> <!ELEMENT name(#PCDATA)> <!ELEMENT email(#PCDATA)> <!ELEMENT publication(title,year)> <!ATTLIST publication author IDREF IMPLIED)> <!ELEMENT title(#PCDATA)> <!ELEMENT year(#PCDATA)> ? &5 &6 &7 &8 &9 ID name email year title

XML Query Languages • There have been a large number of proposals during the past few years: • XPath [Clark, DeRose, W3C 1999] • XQuery [Boag, Chamberlin et al, W3C 2003] • XML-QL[Deutsch, Fernandez et al, QL99] • XQL [Robie, Lapp, QL99] • XML_GL [Ceri, Comai et al, WWW99] • Quilt [Chamberlin, Robie et al, 2000] • From W3C • XQuery 1.0 (W3C Working Draft, 12 November 2003) • http://www.w3.org/TR/xquery/ • XPath 2.0 (W3C Working Draft 12 November 2003) • http://www.w3.org/TR/xpath20/

XPath: XML Path Language • The purpose • To address the node of an XML tree using a path notation for navigating through the hierarchical structure of an XML document. • Uses a compact, non-XML syntax • Designed to be embedded in a host language (e.g., XSLT, XQuery) • XPath Expressions • String of characters • Value of an expression is always an ordered collection of zero or more items (atomic value, node)

XPath: Steps • An XPath expression has following syntax: Path::=/Step1/Step2/…/Stepn, where each Xpath step is defined as follows: Step::=Axis::Node-test Predicate* Axis specifies the “direction” in which the document should be navigated. For example, child::title[position() = 2] • There are 12 axes: child, descendant, descendant-or-self, parent, ancestor, ancestor-or-self, following, preceding, following-sibling, preceding-sibling, attribute, self, namespace

XPath Path Expressions project matches a project element * matches any element / matches the root element /project matches a project element under root project/member matches a member in project project//name matches a name in project, at any depth //title matches a title at any depth member|publciation matches a member or a publication @age matches an ageattribute project/member/@agematches age attribute in member, in project project/member/[@age<“45”] matches a member with age < 45

XPath Query Examples /project/member/name: matches a name of member in project /project/member/name/text(): text of name elements Result: <name> H. Lu </name> <name> J.X. Yu </name> /project/publication/venue Result: empty – there was no venue element //pname :matches a pname at any depth Result: <pname> XML </pname> <pname> Data mining </pname> Result: H. Lu J.X. Yu

More XPath Queries /project/member[publication] <member ID=”&3”, age = 50 > <name>H. Lu </name> <email> luhj@cs.ust.hk </email> <publication author = ”H. Lu”> <title> Managing XML data using RDBMS </title> <year> 2001 </year> </publication> </member> /project/member[@age < “45”] <member ID=”&24”, age = 35 > <name> J.X. Yu </name> <project> <pname> Data mining </pname> </project> </member> /project [member/@age < “25”] No element returned /project/member[email/text()] luhj@cs.ust.hk

XQuery • XQuery 1.0: An XML Query Language • W3C Working Draft 12 November 2003 • http://www.w3.org/TR/xquery/ • XPath expressions are still the basic building block

FOR/LET Clauses • FOR$x in expr • binds $x to each value in the list expr • LET$x = expr • binds $x to the entire list expr • Useful for common subexpressions and for aggregations Ordered list of tuples of bound variables WHERE Clause Pruned list of tuples of bound variables RETURN Clause Instance of XML Query data model XQuery • XQuery 1.0: An XML Query Language • W3C Working Draft 12 November 2003 • http://www.w3.org/TR/xquery/ • FLWR Expressions: FOR-LET-WHERE-RETURN

XQuery Examples <result> FOR$x in /project/member/publication WHERE$x/year > 2000 RETURN <recentpub> $x/title </ recentpub > </result> distinct = a function that eliminates duplicates <active_members> FOR$m IN distinct(document(“project.xml")//member) LET$p := document(“project.xml")//publication[author = $m] WHEREcount($p) > 10 RETURN$m </ active_members> count = a (aggregate) function that returns the number of elements

Structural Summary and Coding Scheme XML and XML Query Languages Structural Summary and Coding Scheme Managing XMLData in Relational Systems

Structural Summary • A structural summary for a data graph GD(VD, ED ) is another labeled graph GI (VI, EI ). • Each node viGI represents a set of nodes, extent(vi ), and extent(vi )  VD. • An edge ed (vi , vi’) GI exists if there is an edge ed (vd , vd’) GDvd extent(vi ), vd’  extent(vi’ ). • The summary preserves all the paths in the data graph. A path expression query can be executed on GI instead of GD, which is most likely more efficient since size of GI is much smaller than GD.

Structural Summary • Basically, nodes in the data graph is grouped based on certain criteria, each group of nodes is represented by one node in the summary. • The size of summary will be determined by the grouping criteria. • Desired properties in supporting evaluating path expression queries using summary: • The results are safe (no false negatives) • If not safe, only approximate answers can be obtained • The results are precise: contains no false positives • If not precise, need validate results using the data graph

r a2 a1 a3 b2 b1 b3 c2 c1 c3 Structural Summary R {r} A {a1,a2,a3} B {b1,b2,b3} C {c1,c2,c3} Data Graph Structural summary

Sample Structural Summaries • Query workload independent summaries • Data Guide • 1-index [Milo, Suciu, ICDT99] • A(k) index [Kaushik, Shenoy, ICDE02] • Query workload dependent summaries • APEX [Chung, Min et al, SIGMOD02] • D(k)-index [Chen, Lim et al, SIGMOD03]

Data Guides • DataGuide: dynamic structural summary of current database • Each label path in database appears once in DataGuide • No extraneous paths in DataGuide • Maintained incrementally as database evolves • Serves role of schema C1 is duplicated to achieve determinism in DataGuides

Bisimilarity and 1-Index • Most existing structural summary are based on graph bisimilarity, defined as follows: • Two data nodes u and v are bisimilar (u v) if • u and v have the same label; • if u’ is a parent of u, then there is a parent v’ of v such that u’ v’, and vice versa; • Intuitively, the set of paths coming into them is the same if two nodes are bisimilar • Tova Milo and Dan Suciu. Index structures for path expressions. In ICDT’99. 277-295, January 1999.

1-Index • 1-index: Each index node represents an equivalence class, in which data nodes are mutually bisimilar. • Evaluating path expression query using 1-index • safe: the result always contains the result of evaluating on the data graph; • precise: its result contains nofalse data node;

K-bisimilarity • 1-index can be big • Formally, based on the notion of k-bisimilarity (k ) which is defined inductively: • Node uk v iff uk-1v, and for every parent u’ of u, there is a parent v’ of v such that u’ k-1v’, and vice versa; • For any two nodes, u and v, u0v iff u and v have the same label; • Intuitively, if two data nodes are k-bisimilar, the set of paths coming into them with length ( k) is the same

A(k)-Index • A(k)-Index: group nodes based on their local structure – paths of length up to k, instead of the global path information • data nodes in each index nodes of A(k) index are mutually k-bisimilar; • Evaluation path expression query using A(k)-index: • safe: its result always contains the result of evaluating on the data graph; • precision: its result contains nofalse data node; • Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes. Exploiting local similarity for indexing paths in graph-structured data. ICDE’02, 129-140.

A(2)-Index C2 and C3 can be grouped because their length-2 incoming paths are the same

APEX: Adaptive Path Index • 1-index, A(k)-index and F&B index are all workload independent • APEX: Adaptive Path index • Maintains two types of paths in the summary: • All paths of length two so that all queries can be answered using APEX • Full paths are maintained for those paths that frequently appear in query workload so that frequently asked queries can be answered efficiently • A hash table is included in the index so that partial matching queries with the self-or-descendent axis (//) can be processed efficiently • C-W Chung, J-K Min, K. Shim, APEX: An Adaptive Path Index for XML Data, SIGMOD 02

D(k)-Index • A generalization of 1-Index and A(k)-Index. • Assigning different local bisimilarites to index nodes in the summary structure according to the query load to optimize its structure. • for any two index nodes niand nj, k(ni)  k(nj)-1 if there is an edge from ni to nj, in which k(ni) and k(ni) are ni and nj’s local bisimilarities, respectively. • Advantage over 1-Index and A(k)-Index • workload-sensitive; • can be more efficiently updated • Qun Chen, Andrew Lim and Kian Win Ong. D(k)-index: An adaptive structural summary for graph-structured data. SIGMOD 03, 134-144.

Node (Edge) Encoding • Structural relationships • Is node u an ancestor of node v? • Is node u the parent of node v? • Assigning a unique code to a node (edge) in the data graph so that the above question can be answered by looking at the codes rather than the original data graphs. • Issues: • Length of the code. • Complexity for computing the structural relationship. between two nodes from their codes. • Efficient code generation and code maintenance.

XML Data Coding Scheme • Region-based • XML document is ordered • Codes are assigned based on the lexicographical location of an element in the original document • Path-based • XML document is nested • Codes are assigned based on the nesting structure of the document, or the path that reaches and element from the root. • There are quite a number of variants for both categories of coding schemes

XML Region Based Coding • Region code: (start, end, level) • u is an ancestor of v iff u.start < v.start < u.end • u is the parent of v, additionally, u.level = v.level-1 • Only a depth-first traversal for code generation • Property: strictly nesting • Completely disjoint (case 1,4) or containing (case 2,3) • Formally, a.start < b.start < a.end, if a is an ancestor of b

Sample of Region Codes • The order of start values is also the document order • The region can also be interpreted as an interval

Dewey 1 <contact> <name>blah</name> <phone> <office>1234</office> <home>5678</home> <mobile>0000</mobile> </phone> </contact> contact 1.2 1.1 phone name 1.2.1 1.2.3 1.2.2 office blah home mobile 1.1.1 0000 5678 1234 1.2.1.1 1.2.2.1 1.2.3.1 a.Dewey is a prefix of d.Dewey Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. SIGMOD 2002.

Managing XML Data in Relational Systems XML and XML Query Languages XML Coding Scheme and Structural Summary Managing XMLData in Relational Systems

XML-Enabled DB Systems • IBM DB2 XML Extender • XML column support, XML Collection, File liked from the DBMS, or Character Large Objects (CLOBs). • Side Tables server as XML indexes • Oracle 9i • CLOB, OracleText Cartridge, XMLType, and XML SQL Utility • Microsoft SQL Server • CLOBs, Generic Edge technique and user-defined decomposition (from XML to tables), XML views.

Storing XML Data in RDBMSs • RDBMS: a matured technology • RDBMS widely available • Less investment to adopt the new technology • Easy to be integrated with other existing applications • Impedance mismatch • Two level nature of relational schema (tuples and attributes) vs. arbitrary nesting of XML DTD • Flat structure vs. recursion • Structure-based and content-based query

XQuery vs SQL: Different Culture • Data Characteristics • Relational data: regular, homogeneous, flat structure in nature, and no order among tuples. • XML data: irregular, heterogeneous, unpredictable structure, order sensitive. • Query Languages • SQL: • Select-from-where • With capability to support some fix-point operation • XQuery: • FLWOR (pronounced “flower”): For-let-where-order-return • Simple/Regular Path expressions

XML Documents XML Query XML Result DTD Relational Schema Relational Result Storing XML Data in RDBMSs: Architecture Automatic Schema/Data Mapping SQL Query Tuples Commercial RDBMS

Storing XML Data in RDBMSs: Issues • Schema/Data mapping: • Automate storage of XML in RDBMS • Query mapping: • Provide XML views of relational sources • Result construction: • Export existing data as XML

XML-Relational Mapping • Model mapping • Database schemas represent constructs of the XML document model. • DTD Independent [Florescu & Kossmann 99, Yoshikawa, et. al. TOIT01] • Structure mapping • Database schemas represent the logical structure of target XML documents • DTD Dependent [Shanmugasundaram et. al. VDLB 99]

L09: Introduction to XML Data Management

L09: Introduction to XML Data Management

Presentation Transcript

II. XML Data Management

XML Data Management

Introduction to Semistructured Data and XML

Introduction to Data Management

L09

Introduction to XML

Introduction to Data Management

Introduction to XML

Introduction to XML

XML and Data Management XML Processors

XML und Data Management - Introduction -

Sensor Data Management and XML Data Management

Introduction to data management

Introduction to XML

Introduction to Semistructured Data and XML

Introduction to XML

Introduction to Semistructured Data and XML

XML Data Management XLST

XML Data Management XQuery

XML: Introduction to XML

Introduction to Semistructured Data and XML

Sea Ice

Sea Ice