610 likes | 788 Vues
Intro to XML. Hachim Haddouti Al Akhawayn University SSE H.Haddouti@alakhawayn.ma http://mail.alakhawayn.ma/~H.Haddouti. TOC. Intro W3C Historical ( ) development Scenarios in XML and Data Management. 1) Motivation. XML - E X tensible M arkup L anguage Markup-Language
E N D
Intro to XML Hachim Haddouti Al Akhawayn University SSE H.Haddouti@alakhawayn.ma http://mail.alakhawayn.ma/~H.Haddouti
TOC • Intro • W3C • Historical () development • Scenarios in XML and Data Management
1) Motivation • XML - EXtensible Markup Language Markup-Language - mark up – Data and information about information within a document in einem Dokument • Developed thru World Wide Web Consortium (W3C) • Well readable • Mostl used as an exchange format
W3C (World Wide Web Consortium) • Over 400 members: organisations, Hosted by MIT, INRIA, Keio University, 50 full-time staff members • Invention of WWW • Examples: • XML • HTML • DOM • XPath • XML Schema • ...
Process at W3C • Note • Sugesstions(no responsibilty W3C) • Working Draft • Working in progress, work has to be approved by all invloved in W3C • Candidate Recommendation • Approved for test implementation only • Recommendation
Phenomena XML • „XML is the ASCII of the 21th century.“ • „XML is the ASCII of the Web“ Henry Thompson (1999) • Why popular ?
What’s Wrong with HTML? Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina. “Object Fusion in Mediator Systems”. In VLDB 96. HTML is markup for presentation only <DT> <IMG SRC="greenball.gif" > <A NAME="object-fusion"></A> Y.Papakonstantinou, S.Abiteboul, H.Garcia-Molina. <A HREF="http://www-cse.ucsd.edu/~yannis/papers/fusion.ps"> "ObjectFusion in Mediator Systems".</A> In <I>VLDB 96.</I> </DT>
...What’s Wrong with HTML... No Explicit Structure, Semantics, or Object-Orientation <DT> <IMG SRC= "greenball.gif" > <A NAME="object-fusion"></A> Y.Papakonstantinou, S.Abiteboul, H.Garcia-Molina. <A HREF="http://www-cse.ucsd.edu/~yannis/papers/fusion.ps"> "ObjectFusion in Mediator Systems".</A> In <I>VLDB 96.</I> </DT> Author Title Conference
... And Some Repercussions • Lack of schema/semantics when querying the Web (HTML): • "find documents (books, papers, ...) whereauthor= Michael Jackson" (... and learn how software engineering meets the moon walker ...) • "create a list of M. Jackson'sbooksand (if available) theirprices" => HTML is inappropriate for • data exchange • automation of information management (retrieval, manipulation, integration)
XML is .. • Standardized for all applications • Exchange format worldwide (write once, read everywhere) • XML is a Meta language sprache in order to define other languages • examples: MathML, ChessML, XUL (User Interfaces), CellML, Gene Expression Markup Language, Chemical Markup Language, XML/EDI, UN/EDIFACT • Nowadays over 300
So what is XML (all about)? Executive Summary: • XML= HTML – idiosyncrasies (simplified syntax) + user-definable ("semantic") tags • Separation ofdataand itspresentation • Tags such Font or CENTRE are not necessary in XML. XML uses struktured Tags, such as TITLE, CHAPTER, etc. document structure remains constant over different media. => simple, very flexible data exchange format: semistructured data model => new applications: • Information exchange (B2B), sharing (diglib), integration ("mediation"), archival, ... • Web site management (XML+XSL stylesheets), ...
… It takes ten minutes to understand (base) XML, and then ten month to understand the new technologies hung around it. (Peter Chen) So why XML course in AUI?
Extensible Markup Language (XML) • De-facto standard adopted by W3C • Describes content rather than presentation • Three major differences to HTML • New tags may be defined at will • Nested structures • XML doc can contain optional description of its grammar
Historical Development XML /1 from Neil Bradley: The XML companion XML 1997 HTML WWW 1992 SGML 1986 Generalized Internet 1960 Markup
Historical Development XML /2 XUpdate XPath 2.0 XQuery 1.0 2002 XML Schema 2001 Quilt 2000 XPath 1.0 1999 W3C recomendations XQL XML-QL DOM 1998 In progress XML Other proposals 1997
2) Documents ... • For communication between humans • Human – Human • Natural (human) language Sprache is used, contains complex and irregular structures • For computer communication: • Computer – Computer • Data-oriented • Human – Computer • Document-oriented • XML allows representation and transport of this information
XML Documents • Bevor Syntax some examples of XML Documents
Example: XML Document <?xml version="1.0" encoding="UTF-8"?> <invoice customerNo="k333063143"> <monthsprice>0,00</monathprice> <detailedinvoice> <call> <date>26.2.</date> <time>19:47</time> <number>200xxxx</number> <itemprice currency ="Euro">0,66</ itemprice> </call> <call> <date>27.2.</date> <time>19:06</time> <number>200xxxx</number> <itemprice currency ="Euro">0,46</ itemprice> </call> <call_charge_total currency "Euro">2.19</call_charge_total> </ detailedinvoice> </invoice>
XML Document - Features • XML documents contain enthalten data and structure of data withing a document (self describing) • All documents have the same/similar structure (regular) • Typed Information in XML documents • Fo the previous example: Information could be stored in DB.
Other XML Documents • XML documents are also irregular Semi structured information document-centric information
Recall semi structured data Features of semi structured data • Structure is irregular. • Schema is implicitly included in data. • Structure of data is incomplete. • Schema is felxible. • Schema is big. • Schema is changing. (Abiteboul, 1997)
Object Exchange Model (OEM) /1 • See slides of first session
XML is Based on Markup Markup indicates structure and semantics <bibliography> <paper ID= "object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S.Abiteboul</author> <author>H.Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> </bibliography> Decoupled from presentation
XML Elements • Structure in an XML document is provided by markup, which consists of elements • Element consists of start/end tags • Element can be empty • See naming rules for elements • XML processors are case-sensitive • Start with letter or underscore, avoid colon • Special element called ROOT element • Contains all other elements
Element Content XML element can be: • empty OR • have content ANY * OR • be of mixed content-type (PCDATA I car I train I plane) OR • have a list of child elements *Elements with content type ANY are not checked by XML validators
Nested Element Character Content (PCDATA) Sample Elements and their Content Element Content Element name <bibliography> <paper ID="object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S.Abiteboul</author> <author>H.Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> </bibliography> Empty Element
Markup and Character Data • XML docs are made up of markup and character data • You can reference EXTERNAL binary data with entity references • Markup: Start/end tags, entity references, character references, comments, CDATA section delimiters, document type declarations, and processing instructions • Character data: all text that is NOT markup
Example <?xml version=“1.0” encoding=“UTF-8”?> <DOCUMENT> <GREETING> This text is inside the <GREETING> element. </GREETING> <MESSAGE> Welcome to the wild and woolly world of XML. </MESSAGE> </DOCUMENT> general entity references turns into Parsed Character Data (PCDATA) Markup Parsed character data Character data
XML Attributes • Be careful with the terminology vs. relational attributes • Attributes defined as Name-value pairs • Let you specify additional information in start and empty tags • Follow same naming rules as for tag names • Attribute values are text • Enclose in quotation marks (“ ”) • Given attribute may only occur once within a tag, but element can repeated. • Special attribute types
Element Attributes Attribute name Attribute Value <bibliography> <paper ID="object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S.Abiteboul</author> <author>H.Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper> </bibliography>
Other XML Constructs • Prolog: XML Declaration, comments, processing instructions, DTD. • XML Declaration <?xml version = “1.0” standalone=“yes” encoding=“UTF-8”?> • Comments <!– this is a comment --> • Processing Instruction <?xml-stylesheet href=“book.css” type=“text/css”?> • CDATA • Escape block containing characters that are not to be parsed (o.w. would be recognized as markup), (Note with specifying DTD all attributes per definition are from type CDATA) <![CDATA[<start>this is an incorrect element</end>]]>
Other XML Constructs • Entities (like macros) • < for < • > forr > • & for & • ' for ´ • " for “ • Parameter entities in DTD’s • Document Type Definition • Defines the documents grammar • See separate lecture
Well-Formed XML Documents • XML documents are subject to two specific constraints • Well-formedness: A data object is an XML document if it is well-formed. A textual object is well-formed if: • Document follows document production (prolog, root element, optional misc. part, e.g., comments and/or processing instrc.) • Must adhere to the syntax rules specified in the XML 1.0 recommendation (eg. Unique attribute names, tags properly nested) • Each parsed entity must itself be well-formed • Validity: An XML document is valid, if it obeys the document type definition (DTD) or XML schema that you use to specify the legal syntax of the document • Ensures that XML document parses into labeled tree
XML and Semistructured Data <bibliography> <paper id=23...> <authors> <author>Yannis</author> <author>Serge</author> ... </authors> <title>Object Fusion</title> ... </paper> </bibliography> XML Document can be represented as ssd-expression {bibliography: {paper: { id: 23, authors: {author: “Yannis”, author: “Serge”, … }, title: “Object Fusion …”} } }
bibliography paper paper authors title fullpaper author author XML = Labeled Ordered Graph XML denotes graphs with labels on nodes ... @id 23 Object Fusion … ... Yannis Serge
Ssd-expression = Labeled Unordered Graph Ssd-expression denotes graphs with labels on edges bibliography paper paper ... id authors title 23 fullpaper Object Fusion … author author ... Yannis Serge
XML Graphs • So far, only seen XML trees • Using references, can create graphs <state sid=“s2”> <scode> NE </scode> <sname> Nevada </sname> </state> • Can use IDREF attribute type, which holds the ID value of another element in the document <city cid=”c2”> <ccode>CCN</ccode> <cname>Carson City</cname> <state-of state_ref=“s2”/> </city> ID: special attribute type; XML processors make sure that no two elements have the same value for the attribute that is of type ID in the same document. IDREF: holds the ID value of another Element in the document.
A Word About Order • Our semistructured data model is based on UNORDERED collections of tuples • As are relations • Unordered collections can be processed more efficiently (exploited by commercial DBMS) • XML is ORDERD • Based on its origins in the information retrieval community • Order is critical in documents • However, attributes in XML are UNORDERD
Usage of Schema Description : DTD • Presentation, which elements can occur and how they will be nested • Also: Declaration of structure information • Advantages of DTD: • Similar to a documentation for XML Documents • Errors in XML documents could be detected • Better quality of XML Documents bcs structured and well thought methodology
Definition of Elements in DTD XML Document: <speaker> Ronald Bourret </spaeker> Corresponding DTD: <!ELEMENT spaeker (#PCDATA)> XML document: <speaker> <lname> Bourret </lname> <fname> Ronald </fname> </speaker> Corresponding DTD: <!ELEMENT speaker (lname, fname)> <!ELEMENT lname (#PCDATA)> <!ELEMENT fname (#PCDATA)>
Definition of Elements in DTD cont. <!ELEMENT hotel (name, address)> <!ELEMENT name (#PCDATA)> <!ELEMENT address (zip, city, ((street, number ?) | BP))> <!ELEMENT description (#PCDATA | equipment | gastronomy)*> • Sequence (A , B) A and B must occur in document in the given order • Alternative (A | B) either A or B occurs in document • Repetition A? - 0..1 times A+ - 1..n times A* - 0..n times • Mixed Content (#PCDATA | A | B)* A, B or other text occurs in document
Example: Definition of Elements <!ELEMENT hotel (name, address)> <hotel> <name>Hotel Anwal</name> <address>...<address> </hotel>
Example: Definition of Elements cont. <!ELEMENT address (zip, city, ((street, number?) | BP))> <address> <zip>53000</zip > <city>Ifrane</city > <street> Abdelkrim El Khattabi</street> <number>12<number> <address> <address> < zip > 62000 </zip > <city> Al Hoceima</city > <bp>12345</bp> <address>
Example: Definition of Elements DTD cont. <!ELEMENT description (# PCDATA | equipment | gastronomy)*> <description> The hotel Anwal is located in front of the City Hall, with view to Atlas mountain, and high quality service.</ description > <description > Our Hotel consists of <equipment> Sauna </equipment> und eine < equipment > swimming pool </equipment>. The <gastronomy> Hotel restaurant</gastronomy> offers regional cuisine and see foods. </ description >
Attribute will be assigned to each element of an XML document: <spaeker tutorial=´T1´> Ronald Bourret </speaker> Corresponding DTD: <!ELEMENT speaker (#PCDATA)> <!ATTLIST speaker tutorial CDATA #REQUIRED> Attribute Syntax /1 Attribute name Attribute value Start Tag Element content EndTag
XML Document <coordinates x=´200´ y=´300´ z=´150´ /> DTD <!ELEMENT coordinates (EMPTY)> <!ATTLIST coordinates x CDATA #REQUIRED y CDATA #REQUIRED z CDATA #IMPLIED > Attribute Syntax / 2
Representation of XML Documents incl. elements and Attributes) • XML Documents are trees! Example: <spaeker tutorial=´T1´> <lname>Bourret</lname> <fname>Ronald</fname> </speaker> Speaker tutorial lname fname T1 Bourret Ronald Element nodes Text nodes Attribut nodes
Declaration of Attributes in DTD Attributes have • A name • A type (CDATA, ID, IDREF/IDREFs, ENTITY/ENTITYS, NMTOKEN/NMTOKENS or (value1|value2|...) • predicates, whether the attribute have to occur (#REQUIRED, #IMPLIED oder #FIXED) oder • An optional default value (in case of #FIXED this is necessary) <!ATTLIST price currency CDATA #REQUIRED> <!ATTLIST project id ID #REQUIRED> <!ATTLIST person project IDREF #REQUIRED> <!ATTLIST zip xml-sqltype CDATA #FIXED ´INTEGER´>
ID / IDREF Attribute could also be defined as ID/ IDREF/ IDREFS Values within a document are unique <project member=´p0001´> <title>...</title> </project> <project member=´a0001´> <title>...</title> </project> ... <person id=´p0001´> <name> Khattabi</name> </person> ... <department dep_id=´a0001´> <!ELEMENT project (title)> <!ATTLIST project member IDREF #REQUIRED> <!ELEMENT person (EMPTY)> <!ATTLIST person id ID #REQUIRED> ... <!ELEMENT department (EMPTY)> <!ATTLIST department dep_id ID #REQUIRED>
ID/IDREF-Overview Values of IDREF Attributes show which IDs are referenced. These could of different element types. Global uniquenesses of IDs is MUST. (compare to PK and FK in DB?) Project dept Project Project dept dept member member dept_id Dep_id title title person Person person id person id person id person