Chapter 29 Semistructured Data and XML Transparencies
Chapter - Objectives • What semistructured data is. • Concepts of the Object Exchange Model (OEM), a model for semistructured data. • Basics of Lore, a semistructured DBMS, and its query language, Lorel. • Main language elements of XML. • Difference between well-formed and valid XML documents. • How Document Type Definitions (DTDs) can be used to define the valid syntax of an XML document.
Chapter - Objectives • How Document Object Model (DOM) compares with OEM. • About other related XML technologies. • Limitations of DTDs and how the W3C XML Schema overcomes these limitations. • How RDF and RDF Schema provide a foundation for processing meta-data.
DTD: XML Names and NMTOKEN • Name Characters are letters, digits, hyphens, underscores, colons or full stops. • An NMTOKEN is any collection of Name Characters • NMTOKENSis any list ofNMTOKEN’s separated bywhite space(space, tab, newline etc.) • Case is significant: PERSON and person are distinct names • Attribute and Elementnames must be (a subset of) NMTOKEN with restriction • Names cannot begin with a digit • Names cannot begin with xml (or any variant gotten by case changes) – system will use this prefix
Element Declarations: EMPTY • Keyword ELEMENT Introduces a new element<!ELEMENT NAME CONTENT_MODEL> • Element name must begin with a letter, and may additionally contain digits and some punctuations, i.e. ‘.’, ‘-’, ‘_’, and ‘:’ as we described earlier under NMTOKEN • If an element can hold no child elements, and also no text, then it is known as empty element and denoted by EMPTY for CONTENT_MODEL • This seems trivial but it isn’t because the present or absence of this element in an XML file can be used as a flag • As an example we can find several in HTML such as HR and IMG which never have children and include no text. Here we would write<!ELEMENT HR EMPTY>and then<HR/>or <HR></HR>generates a horizontal line • EMPTY ELEMENTScan have attributes such as theSRCattribute in<IMG/>to specify source of image.
Element Declarations: ANY • An element declared to have a content of ANY may contain all of the other elements declared in theDTD • This is not quite the same as no DTD for the file <!DOCTYPE fred [<!ELEMENT fred ANY >]> <fred> <people>Me and You</people> <people>Them</people></fred> • Gets an error due to presence of<people>tag • Adding<!ELEMENT people ANY >inside DTD declaration produces a valid document.
Entities • The DTD of an XML document can contain entity declarations. These are like macro substitutions in other languages. • ENTITY’s are defined in DTD and consist of several flavors: • General Entities are referenced as &EntName; • Parameter Entities are referenced as %Entname; • We have already seen the character entities • & for & • ' for ‘ • > for > • < for < • " for “ • These are built in but you could add other such entities with • <!ENTITY aitself “A” > and &aitself; would be replaced by A
General Entities • As another example, we can use in DTD<!ENTITY TODAY “May 12 2003” > and<comment>&TODAY; was very quiet in Irvine</comment>is parsed as<comment>May 12 2003 was very quiet in Irvine</comment> • General Entity references can be nested inside a DTD, e.g., one can write<!ENTITY YEAR “2003” > <!ENTITY TODAY “May 12 &YEAR;” > • However one must use Parameter Entities and not General Entities for macro substitution in other DTD declarations like <!ATTLIST and <!ELEMENT • Parameter entities are defined as in<!ENTITY % CUSTARDTAGS “(NAME,DATE,ORDERS)” >
Parameter Entities • <!ENTITY %peopletags “(firstname,lastname,dateofbirth)” ><!ELEMENT student %peopletags; > <!ELEMENT teacher %peopletags; > <!ELEMENT administrator %peopletags; > • Defines a bunch of people ELEMENTS to have the same child elements • Parameter entities are even more commonly used for attributes because almost always several ELEMENTS share the same attributes (with often a basic set being augmented in different ways for different ELEMENTS) • This basic set can be set in a parameter Entity
Defining Implied Attributes • Attributes must be declared in the DTD to be able to be used • “Implied” means that this attribute optional and there is no default value • <!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA #IMPLIED> • The attribute year can be defined or undefined in the element population. Valid Examples: • <population year=“2000”>80</population> • <population>80</population>
Defining Required Attributes • <!ELEMENT population (#PCDATA)> <!ATTLIST population year #REQUIRED> • The population must contain a year attribute: <population year=“1996”>80</population> • <!ELEMENT population (#PCDATA)> <!ATTLIST population year (2000|2001) #REQUIRED> • The population must contain a year attribute of 2000 or 2001 <population year=“2000”>80</population> • No quotes on the enumeration values
Defining Default Attributes • <!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA “2000”> • All these are valid • <population year=“2001”>80</population> • <population year=“2000”>80</population> • <population>80</population>
Defining Fixed Attributes • <!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA #FIXED “2000”> • Invalid <population year=“2001”>80</population> • Valid <population year=“2000”>80</population> • Valid <population>80</population>
Defining Unique Attributes • <!ELEMENT animal (name)> <!ATTLIST animal code ID #REQUIRED> • The code attribute has to be unique in the XML document • <animal code=“T50”><name>Lion</name> </animal> <animal code=“T51”><name>Rabbit</name> </animal>
Referring Unique Attributes • <!ELEMENT website (url)> <!ATTLIST website animal_refer IDREF #REQUIRED> • animal_refer attribute refers to previous ID attribute defined • <website animal_refer=“T50”> <url>http://www.lions.com</url> </website>
Referring Multiple Unique Attributes • <!ELEMENT website (url)> <!ATTLIST website contents IDREFS #REQUIRED> • contents attribute contain series of IDs • <website contents=“T50 T51”> <url>http://www.animals.com</url> </website>
XML Example - the DTD <!ELEMENT addressBook (person)+> <!ELEMENT person (name, email*, link?) > <!ATTLIST person id ID #REQUIRED > <!ATTLIST person gender (male|female) #IMPLIED> <!ELEMENT name (#PCDATA|(family,given))> <!ELEMENT family (#PCDATA)> <!ELEMENT given (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT link EMPTY ><!ATTLIST link manager IDREF #IMPLIED subordinates IDREF #IMPLIED>
DOCTYPE declarations • Internal: local definition of DTD • External: to an external file • Can combine both
Internal DTD <?xml version="1.0" standalone="yes" ?> <!--open the DOCTYPE declaration - the open square bracket indicates an internal DTD--> <!DOCTYPE foo [ <!--define the internal DTD--> <!ELEMENT foo (#PCDATA)> <!--close the DOCTYPE declaration--> ]> <foo>Hello World.</foo>
Internal DTD: rules • The document type declaration must be placed between the XML declaration and the first element (root element) in the document . • The keyword DOCTYPE must be followed by the name of the root element in the XML document . • The keyword DOCTYPE must be in upper case .
External DTD • Useful for creating a common DTD that can be shared between multiple documents. • Any changes that are made to the external DTD automatically updates all the documents that reference it. • Two types: private, and public. • Rules: • If any elements, attributes, or entities are used in the XML document that are referenced or defined in an external DTD, standalone="no" must be included in the XML declaration .
"Private" External DTDs • Identified by the keyword SYSTEM • Intended for use by a single author or group of authors. • Example: <!DOCTYPE root_element SYSTEM "DTD_location"> where: DTD_location is relative or absolute URL (such as “http:/” and “file:/”).
"Private" External DTDs (cont) XML document: <?xml version="1.0" standalone="no" ?> <!DOCTYPE document SYSTEM "subjects.dtd"> <document> … </document> subjects.dtd: <!ELEMENT document …> …
“Public" External DTDs • Identified by the keyword PUBLIC • Intended for broad use. <!DOCTYPE root_element PUBLIC "DTD_name" "DTD_location"> where: • DTD_location: relative or absolute URL • DTD_name: follows the syntax: "prefix//owner_of_the_DTD// description_of_the_DTD//ISO 639_language_identifier“ • "DTD_location" is used to find the public DTD if it cannot be located by the "DTD_name".
“Public" External DTDs (cont) <?xml version="1.0" standalone="no" ?> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <HTML> <HEAD> <TITLE>A typical HTML file</TITLE> </HEAD> <BODY> … </BODY> </HTML>
“Public" External DTDs (cont) Valid DTD_name Prefix: ISO :The DTD is an ISO standard. All ISO standards are approved. + : The DTD is an approved non-ISO standard. - : The DTD is an unapproved non-ISO standard.
Combining Internal and External DTDs • A document can use both internal and external DTD subsets. • The internal DTD subset is specified between the square brackets of the DOCTYPE declaration. • The declaration for the external DTD subset is placed before the square brackets immediately after the SYSTEM keyword. • Declaring an ELEMENT with the same name in both the internal and external DTD subsets is invalid
Example <?xml version="1.0" standalone="no" ?> <!DOCTYPE document SYSTEM "subjects.dtd" [ <!ATTLIST assessment assessment_type (exam | assignment | prac)> <!ELEMENT results (#PCDATA)> ]> subjects.dtd <!ELEMENT document (title*,subjectID,subjectname,prerequisite?, classes,assessment,syllabus,textbooks*)> <!ELEMENT prerequisite (subjectID,subjectname)> …
DTD Validation • A XML content can be well-formed but invalid under DTD rules • e.g. DTD rule: <!ELEMENT name (#PCDATA)> • Acceptable: <name> Giancarlo Succi </name> • Unacceptable: <name> <first_name> Giancarlo </first_name> <last_name> Succi </last_name> </name>
Beyond DTDs… • DTD limitations • Simple document structures • Lack of “real” datatypes • Advanced schema languages • XML Schema • Relax NG • …
Limitations of DTDs • No typing of text elements and attributes • All values are strings, no integers, reals, etc. • Difficult to specify unordered sets of subelements • Order is usually irrelevant in databases • (A | B)* allows specification of an unordered set, but • Cannot ensure that each of A and B occurs only once • IDs and IDREFs are untyped • The owners attribute of an account may contain a reference to another account, which is meaningless • owners attribute should ideally be constrained to refer to customer elements
XML Schema • XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports • Typing of values • E.g. integer, string, etc • Also, constraints on min/max values • User defined types • Is itself specified in XML syntax, unlike DTDs • More standard representation, but verbose • Is integrated with namespaces • Many more features • List types, uniqueness and foreign key constraints, inheritance .. • BUT: significantly more complicated than DTDs.
XML Schema – Simple Types • Elements that do not contain other elements or attributes are of type simpleType. <xsd:element name=“STAFFNO” type = “xsd:string”/> <xsd:element name=“DOB” type = “xsd:date”/> <xsd:element name=“SALARY” type = “xsd:decimal”/> • Attributes must be defined last: <xsd:attribute name=“branchNo” type = “xsd:string”/>
XML Schema – Complex Types • Elements that contain other elements are of type complexType. • List of children of complex type are described by sequence element. <xsd:element name = “STAFFLIST”> <xsd:complexType> <xsd:sequence> <!-- children defined here --> </xsd:sequence> </xsd:complexType> </xsd:element>
Cardinality • Cardinality of an element can be represented using attributes minOccurs and maxOccurs. • To represent an optional element, set minOccurs to 0; to indicate there is no maximum number of occurrences, set maxOccurs to “unbounded”. <xsd:element name=“DOB” type=“xsd:date” minOccurs = “0”/> <xsd:element name=“NOK” type=“xsd:string” minOccurs = “0” maxOccurs = “3”/>
References • Can use references to elements and attribute definitions. <xsd:element name=“STAFFNO” type=“xsd:string”/> …. <xsd:element ref = “STAFFNO”/> • If there are many references to STAFFNO, use of references will place definition in one place and improve the maintainability of the schema.
Defining New Types • Can also define new data types to create elements and attributes. <xsd:simpleType name = “STAFFNOTYPE”> <xsd:restriction base = “xsd:string”> <xsd:maxLength value = “5”/> </xsd:restriction> </xsd:simpleType> • New type has been defined as a restriction of string (to have maximum length of 5 characters).
Groups • Can define both groups of elements and groups of attributes. Group is not a data type but acts as a container holding a set of elements or attributes. <xsd:group name = “StaffType”> <xsd:sequence> <xsd:element name=“StaffNo” type=“StaffNoType”/> <xsd:element name=“Position” type=“PositionType”/> <xsd:element name=“DOB” type =“xsd:date”/> <xsd:element name=“Salary” type=“xsd:decimal”/> </xsd:sequence> </xsd:group>
Constraints • XML Schema provides XPath-based features for specifying uniqueness constraints and corresponding reference constraints that will hold within a certain scope. <xsd:unique name = “NAMEDOBUNIQUE”> <xsd:selector xpath = “STAFF”/> <xsd:field xpath = “NAME/LNAME”/> <xsd:field xpath = “DOB”/> </xsd:unique>
<xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema> <xsd:element name=“bank” type=“BankType”/> <xsd:element name=“account”><xsd:complexType> <xsd:sequence> <xsd:element name=“account-number” type=“xsd:string”/> <xsd:element name=“branch-name” type=“xsd:string”/> <xsd:element name=“balance” type=“xsd:decimal”/> </xsd:squence></xsd:complexType> </xsd:element> ….. definitions of customer and depositor …. <xsd:complexType name=“BankType”><xsd:squence> <xsd:element ref=“account” minOccurs=“0” maxOccurs=“unbounded”/> <xsd:element ref=“customer” minOccurs=“0” maxOccurs=“unbounded”/> <xsd:element ref=“depositor” minOccurs=“0” maxOccurs=“unbounded”/> </xsd:sequence> </xsd:complexType> </xsd:schema> XML Schema Version of Bank
References http://www.java.sun.com/xml/docs/tutorial/TOC.html http://www.xml.com/pub/a/1999/09/expat/index.html http://xmlfiles.com/dtd/dtd_attributes.asp http://xmlwriter.net/xml_guide/doctype_declaration.shtml
What is an XML Parsing API? • Programming model for accessing an XML document • Sits on top of an XML parsing engine • Language/platform independent
Java XML Parsing Specification • The Java XML Parsing Specification is a request to include a standardised way of parsing XML into the Java standard library • The specification defines the following packages: • javax.xml.parsers • org.xml.sax • org.xml.sax.helpers • org.w3c.dom • The first is an all-new plugability layer, the others come from existing packages
Two ways of using XML parsers: SAX and DOM • The Java XML Parsing Specification specifies two interfaces for XML parsers: • Simple API for XML (SAX) is a flat, event-driven parser • Document Object Model (DOM) is an object-oriented parser which translates the XML document into a Java Object hierarchy
SAX • Simple API for XML • Event-based XML parsing API • Not governed by any standards body • Guy named David Megginson basically owns it… • SAX is simply a programming model that the developers of individual XML parsers implement • SAX parser written in Java would expose the equivalent events • "serial access" protocol for XML
SAX (cont) • A SAX parser reads the XML document as a stream of XML tags: • starting elements, ending elements, text sections, etc. • Every time the parser encounters an XML tag it calls a method in its HandlerBase object to deal with the tag. • The HandlerBase object is usually written by the application programmer. • The HandlerBase object is given as a parameter to the parse() method in the SAX parser. It includes all the code that defines what the XML tags actually ”do”.
endElement & endDocument endElement startElement & characters startElement & characters startElement endElement startElement & characters startElement & characters startElement startDocument How Does SAX work? XML Document SAX Objects <?xml version=“1.0”?> Parser <addressbook> </addressbook> Parser <person> </person> <name>John Doe</name> Parser <email>email@example.com</email> Parser Parser <person> </person> Parser <name>Jane Doe</name> Parser Parser <email>firstname.lastname@example.org</email> Parser Parser
SAX tutorial http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/sax/index.html Notes: some files are at http://www.ics.uci.edu/~ics185/handouts/slides13-sax/
More info about SAX • Read the tutorial http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/sax/index.html