570 likes | 669 Vues
Learn about the features, goals, and terminology of XML, an extensible markup language for data exchange. See how XML is different from HTML, its advantages, and key concepts like tags, elements, and attributes.
E N D
HTML to XML • HTML documents • Emerging Web Standards - XML • XML good for data interchange across platforms enterprise wide • conversion HTML to XML - IBM, Microsoft
XML - Motivation • In HTML, both the tag semantics and tags are fixed. There is limited and strict interpretation of tags. • HTML is widely successful in disseminating documents across internet. • Though data can be disseminated through HTML, its extraction is painful, and laborious. • EDI has been a predominate mode of exchanging data among businesses. But it has very rigid format that requires highly customized applications.
XML - Introduction • XML aims to provide ease of authoring HTML documents with ease of data exchange that is possible with EDI. • Tags are used to markup documents. • XML is a meta-language for describing markup languages. • XML provides a facility to define tags and structural relationships between them. • No pre-defined tag set implied no preconceived semantics, semantics of XML document is defined by applications that process them
XML - Goals • Straightforward to use over internet • Support wide variety of applications, authoring, browsing, content analysis, etc. • Easy to write programs that process XML documents and validate them. • XML documents must be human-legible and reasonably clear. • Design of XML shall be formal and concise - expressed as EBNF (extended Backus Naur Form) - amenable to modern compiler tools and techniques.
XML-features • Some structure - not rigid • Extensibility - User defined tags • nested elements • validation - documents may specify their own grammar • DTD (Document Type Descriptor) - schema exists with data as tag names • Application -EDI - extraction, conversion, , transformation, integration • can be modeled using DOM
More terminology • RDF - Resource Description Framework - a method to describe metdata for XML documents • XSL - Extensible Stylesheet Language - language for transforming and formatting XML. • Transformation Language - XSLT, XPath, Xpointer, Xlink
Example-HTML • Print - Sanjay Madria Web Warehouse Tutorial, ADBIS’99 HTML <H2> Sanjay Madria </H2> <I> Web Warehouse Tutorial, ADBIS’99</I> Very difficult to understand, structure is hidden, describes only appearance
XML • <Ref> <Speaker> <Firstname> Sanjay</firstname> <Lastname> Madria</lastnaame> </Speaker> <Title > Web Warehouse Tutorial</Title> <Conference> ADBIS’99</Conference> </empty> </Ref> another format: <Firstname Value “Sanjay”/>
XML can Separate Data from HTML • XML is used to Exchange Data • XML can be used to Share Data • XML can be used to Store Data • XML can be used to Create new Languages (WML)
XML • <Person> - a start-tag • </Person> - a end tag • Tags are also called markups. • Tags must be balanced; close in inverse order of their opening • Tags are defined by users, no predefined tags
<person> <name> Alan </name> <age> 42 </age> <email> agb@abc.com </ email > </person> Element - <Person>…..</Person> Subelement – Age
XML elements must follow these naming rules: • Names can contain letters, numbers, and other characters • Names must not start with a number or "_" (underscore) • Names must not start with the letters xml (or XML or Xml ..) • Names can not contain spaces
<table> <description> People on the fourth floor </description> <people> <person> <name> Alan </name> <age> 42 </age> <email> agb@abc.com </ email > </person> <person> <name> Patsy </name> <age> 36 </age> <email> ptn@abc.com </ email > </person> <person> <name> Ryan </name> <age> 58 </age> <email> rgz@abc.com </ email > </person> </people> </table>
<married></married> Can be abbreviated to <married/>
XML Attributes Att. (Name, value) pair <product> <name language=“French”> trompette six trous </name> <price currency=“Euro”> 420.12 </price> <address format=“XLB56” language=“French”> <street>31 rue Croix-Bosset</ street> <zip>92310</zip><city>Sevres</city> <country>France</country> </address> </product>
Attributes takes always string values (“..”) • A given attribute may occur only once within a tag, while subelements within same tag can repeat attributes
XML tags are case sensitive • With XML, White Space is Preserved • <b><i>This text is bold and italic</b></i> • Ok in HTML • <b><i>This text is bold and italic</i></b>
XML Elements are Extensible • Extract to • MESSAGETo: ToveFrom: Jani • Don't forget me this weekend!
<?xml version="1.0" ?>-<note><to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
<note> <date>1999-08-01</date> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> • No problem
Book Title: My First XML • Chapter 1: Introduction to XML • What is HTML • What is XML • Chapter 2: XML Syntax • Elements must have a closing tag • Elements must be correctly nested
<book> • <title>My First XML</title> • <prod id="33-657" media="paper"></prod> • <chapter>Introduction to XML • <para>What is HTML</para> • <para>What is XML</para> • </chapter> • <chapter>XML Syntax <para>Elements must have a closing tag</para> <para>Elements must be properly nested</para> </chapter> • </book>
<person sex="female"> <firstname>Anna</firstname> <lastname>Smith</lastname> • <person> <sex>female</sex> <firstname>Anna</firstname> <lastname>Smith</lastname> </person>
Bad Design • <note day="12" month="11" year="99" to="Tove" from="Jani" heading="Reminder" body="Don't forget me this weekend!"> </note>
<note date="12/11/99"> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
<note> <date>12/11/99</date> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
<note> <date> <day>12</day> <month>11</month> <year>99</year> </date> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
PCDATA • XML parsers treat all text as Parsable Characters (PCDATA). • When an XML element is parsed, the text between the XML tags is also parsed: • CDATA • Everything inside a CDATA section is ignored by the parser. • Starts with "<![CDATA[" and ends with "]]>":
<person> <name> Alan </name> <age> 42 </age> <email> agb@abc.com </ email > </person> or <person name=“Alan” age = “42” email = “agb@abc.com” /> or <person age = “42” > <name> Alan </name> <email> agb@abc.com </ email > </person>
person person email name age name email age Alan 42 agb@abc.com Alan agb@abc.com 42
XML can associates unique identifier to elements, as the value of certain attribute Called id • Refer that element using idref
<messages> • <note ID="501"> • <to>Tove</to> • <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> • </note> • <note ID="502"> <to>Jani</to> <from>Tove</from> <heading>Re: Reminder</heading> <body>I will not!</body> </note> • </messages>
<state id=“s2”> <scode>NE</scode> <sname>Nevada</sname> </state> <city id=“c2”> <ccode>CCN</ccode> <cname>Carson City</cname> <state-of idref = “s2”/> </city>
a a c b
<a><b id=“&o123”> some string </b></a> <a c=“&o123”/> Assume c as reference attribute <a b=“&o123”/> <a><c id=“&o123”> some string </b></a> Assume b as reference attribute
<geography> <states> <state id=“s1”> <scode>ID</scode> <sname>Idaho</sname> <capital idref=“c1”/> <cities-in idref=“c1”/><cities-in idref=“c3”/>…… </state> <state id=“s2”> <scode>NE</scode> <sname>Nevada</sname> <capital idref=“c2”/> <cities-in idref=“c2”/>……. </state> …. </states>
<cities> <city id=“c1”> <ccode>BOI</ccode> <cname>Boise</cname> <state-of idref = “s1”/> </city> <city id=“c2”> <ccode>CCN</ccode> <cname>Carson City</cname> <state-of idref = “s2”/> </city> <city id=“c3”> <ccode>MOC</ccode> <cname>Moscow</cname> <state-of idref = “s1”/> </city> … </cities> </geography>
Ordering person:{firstname: “John”, lastname:“Smith”} person:{lastname: “Smith”,firstname: “John”} As SSD, both are same
These two are not same as XML documents <person><firstname>John</firstname> <lastname>Smith </lastname></person> <person><lastname>Smith </lastname> <firstname>John</firstname></person> The following two are equivalent as attributes are not ordered <person firstname=“John”lastname=“Smith”/> <person lastname=“Smith” firstname=“John”/>
Mixing elements and Text <Person> This is my best friend <Name> Alan </Name> <Age> 42 </Age> I am not too sure of the following email <Email> agb@abc.com </Email > </Person>
<!- - this is a comment - -> - Comments are allowed anywhere except inside markup and is a part of the document. <?xml-stylesheet href=“book.css” type=“text/css”?> - Processing instructions for applications <?xml version=“1.0”?> This is not PI, not passed to application. <![CDATA[<start>this is an incorrect element </end>]]> <!DOCTYPE name [markupdeclarations]> <?xml….?> <!DOCTYPE name [markupdeclarations]> <name>…</name>
<db><person> <name> Alan </name> <age> 42 </age> <email> agb@abc.com </ email > </person> <person>… </person> … </db> <!DOCTYPE db [ <!ELEMENT db (person*)> <!ELEMENT person (name,age,email)> <!ELEMENT name (#PCDATA)> <!ELEMENT age (#PCDATA)> <!ELEMENT email (#PCDATA)> ]>
Recursion <!ELEMENT node (leaf | (node,node))> <!ELEMENT leaf (#PCDATA)> An example of such XML document is <node> <node> <node> <leaf> 1 </leaf> </node> <node> <leaf> 2 </leaf> </node> </node> <node> <leaf> 3 </leaf> </node> </node>
<db> <r1><a> a1 </a><b> b1 </b><c> c1 </c></r1> <r1><a> a2 </a><b> b2 </b><c> c2 </c></r1> <r2><c> c2 </c><d> d2 </d></r2> <r2><c> c3 </c><d> d3 </d></r2> <r2><c> c4 </c><d> d4 </d></r2> <db>
<!DOCTYPE db [ <!ELEMENT db (r1*,r2*)> <!ELEMENT r1 (a,b,c)> <!ELEMENT r2 (c,d)> <!ELEMENT a (#PCDATA)> <!ELEMENT b (#PCDATA)> <!ELEMENT c (#PCDATA)> <!ELEMENT d (#PCDATA)> ]>
<!ELEMENT r2 ((c,d) | (d,c))> <!ELEMENT db ((r1|r2)*)> <!ELEMENT r1 (a,b?,c+)> <!DOCTYPE db [<!ELEMENT …>…]> <!DOCTYPE db SYSTEM “schema.dtd”> <!DOCTYPE db SYSTEM “http://www.schemaauthority.com/schema.dtd”>
<product> <name language=“French” department = “music”> trompette six trous </name> <price currency=“Euro”> 420.12 </price> </product> <!ATTLIS name language CDATA #REQUIRED department CDATA #IMPLIED> <!ATTLIS price currency CDATA #IMPLIED>
IDREF – attribute’s value is some other element’s identifier iDREFS – attribute’s value is a list of identifiers, separated by spaces <!DOCTYPE family [ <!ELEMENT family (person*)> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIS person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>
<family> <person id=“jane” mother=“mary” father=“john”> <name> Jane Doe </name> </person> <person id=“john” children =“jane jack” > <name> John Doe </name> </person> <person id=“mary” children =“jane jack” > <name> Mary Smith </name> </person> <person id=“jack” mother=“smith” father=“john”> <name> Jack Smith </name> </person> </family>