introduction to xml and its processing techniques n.
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to XML and its processing techniques PowerPoint Presentation
Download Presentation
Introduction to XML and its processing techniques

Introduction to XML and its processing techniques

231 Vues Download Presentation
Télécharger la présentation

Introduction to XML and its processing techniques

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Introduction to XML and its processing techniques Cheng-Chia Chen 4/22 2003

  2. outlines • What is XML ? • A glimpse of XML • Why do we need XML ? • Some XML applications • XML and related Core Specifications • APIs for XML • Combine XML technology with traditional language processing technology. • Other important XML programming technology • Summary and information for further study

  3. What is XML ? • The eXtensible Markup Language • a data format (syntax) used for the representation, storage and transmission of data whose format is defined by xml. • a data-structure definition language : let you define the structure and format of your own data. • Text-based markup Language, let you define your own HTML-likemarkup languages. • Recommended by World Web Consortium (W3C) in Feb 1998. • intended to be used as a new message format over the Internet to complement the inadequacy of HTML.

  4. The idea of XML • Existing student information • S9010 張得功 資科系 三年級 • S9021 王德財 應數系 二年級 null • …

  5. HTML’s concerns • How to present the data: <TABLE BORDER=1 bgcolor=“yellow” > <TR><TH>學號</TH>姓名<TH>科系</TH> <TH>年級</TH> <TH>電郵</TH> </TR> <TR><TD> S9010</TD><TD>張得功</TD> <TD>資科系</TD> <TD>三年級</TD> <TD> </TD></TR> <TR> <TD> S9021 </TD> <TD>王德財</TD> <TD>應數系</TD> <TD>二年級 </TD> </TR> </TABLE>

  6. XML’s concerns • XML uses markup tags as well, but, describe the content, rather than the presentation of that content. • the same example coded in XML: <students> <student><學號> S9010 </學號> <姓名>張得功</姓名> <科系>資科系</科系> <年級>三年級</年級> <電郵> </電郵> </student> <student><學號> S9021 </學號> <姓名>王德財</姓名> <科系>應數系</科系> <年級>二年級</年級><電郵/> </student> … </students> Notes: 1. Only contents are encoded in the XML text. 2. All data are annotated by tags indicating their roles or functions in the message.

  7. Where does XML come from ? • a simplified subset of the Standard Generalized Markup Language (SGML) standardized in 1986. • simplified for more general use on the Web and as a data interchange format. • without losing extensibility, • easier for anyone to write valid XML. • easier to write a parser • easier for the parser to quickly verify that documents are well-formed and/or valid. • Recommended by W3C at Feb. 1998.

  8. A Glimpse of XML

  9. An example XML document <?xml version="1.0"?> <note> <to>Wang</to> <from>Chen</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> Notes: • The XML declaration should always be included. • <note>…</root> is the root element which has 4 children.

  10. <!– the structure of the document element --> <department> <employee id=“s8931"> <name>張德治</name> </employee> <employee id=“s9017“ id-no =“L12345678” > <name>李大春</name> <url href = ""/> </employee> </department>

  11. Key terminology • Element • Element type (or element name) • Start tag • End tag • [Element] Content • child element • character data [PCDATA] • Attribute • Attribute name • Attribute value • DTD • Comment • Processing Instructions • <? Target data ?>

  12. <!– the structure of the document element --> Element type (or name) <department> start-tag <employee id=“s8931"> <name>張德治</name> </employee> <employee id=“s9017“ id-no =“L12345678” > <name>李大春</name> <url href = ""/> </employee> </department> end-tag Attributes PCDATA attribute value attribute name [The root or document] element

  13. Containment Hierarchy of XML Documents

  14. All XML elements must have an end tag • In HTML some elements do not have to have a closing tag. The following code is legal in HTML: <p>This is a paragraph <p>This is another paragraph • In XML all elements must have a closing tag like this: <p>This is a paragraph</p> <p>This is another paragraph</p>

  15. XML tags are case sensitive • XML tags are case sensitive. • <Letter> != <letter> • Opening and closing tags must match with the same case: • <Message>This is incorrect</message> • <message>This is correct</message>

  16. All XML elements must be properly nested • HTML allow overlapped elements: <b><i>bold and italic</b> italic only</i> • For XML all elements must be properly nested. <b><i>bold and italic</i> bold only</b>

  17. Single root[document] element • A document contains exactly one root element. • All other elements must be nested within the root element. • Elements can have sub (children) elements and subelemetns can have subsubelements and so on. • Elements and text data that can appear as children of an element, their order and multiplicity is definable [by DTD/XML Schema]. <root> <child> <subchild>…</subchild> or text data <subchild>…</subchild> </child> … </root>

  18. XML Attributes • Appear within the start tag of an element. • Attributes that can appear in the start tag of an element is definable [by DTD or XML schema]. • ID attributes are for identification and cannot have the same value in a document instance. • HTML examples: <img src="computer.gif"> <a href=demo.asp> • XML examples: <file type="gif"> <person id=’3344’> Note: • In XML attribute value must be quoted by ‘ or ".

  19. Well-formed v.s. Valid XML Documents • Well-Formed XML documents • Essentially any document conforming to the XML syntax rules that we have described. • All texts/documents must be well-formed to be XML documents. • Example: <?xml version="1.0“?> <note> <to>Wang</to> <from>Chen</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>

  20. Valid XML documents • A Valid XML document is • a well-formed XML document and • conforms to the grammar attached to it. • The grammar attached to XML Documents is called a DTD [Document type definition] • A Document with a reference to an external DTD: <?xml version="1.0"?> <!DOCTYPE note SYSTEM "Note.dtd"> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>

  21. DTD • DTD • Document Type Definition; • a grammar for a class of XML documents • used to define the legal building blocks of an XML document. • Document Type Declaration: • Declare the DTD for an XML document; • External subset: // defined at external places <!DOCTYPE note SYSTEM “note.dtd” > • Internal subset: // inline declarations <!DOCTYPE note SYSTEM “externSubset.dtd”[ ……inline markup declarations……… ]>

  22. DTD: markup Declarations • Element type declarations • Attribute list declarations • Entity declarations • declare macro-like abbreviations. • <!ENTITY chencc “Cheng-Chia Chen”> • <!ENTITY chapter1 SYSTEM “chapter1.xml”> • <!ENTITY % subDTD SYSTEM “dtd1.dtd”> • Notation declarations • Define types of non-xml data • <!NOTATION png SYSTEM “”>

  23. DTD: Element Type Declaration • Specifies the element type and content: <!ELEMENT NamecontentSpec> • Element’s Content: • Empty: <!ELEMENThomepageEMPTY > • Any: <!ELEMENTcontainerANY > • Only elements (element content) • No character data • Mixed: • Character data mixed

  24. DTD: Element content model • Basically represented by a regular expression over element types. • Building Blocks: • Choice (p | list | table | form ) • Sequence (street, zip, city, country) • Occurrences ? + * • Example: <!ELEMENT person (name, address+, homepage?, (email | telephone )+, note*)>

  25. DTD: Mixed element content • can contain either • other elements and character data or • only character data • Examples: <!ELEMENT para (#PCDATA |em | strong | abbr )* > <!ELEMENT p (#PCDATA |em | i | b | a| ul)*> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)>

  26. DTD: Attribute List Declaration • Define attributes that can appear in an element type. • format: <!ATTLIST elName attrName1attrType1 attrDefault1 attrName2attrType2 attrDefault2 ………………………………… > • Attribute types: • String type : • Tokenized type: • Enumerated type:

  27. DTD: ATTLIST Attribute Type • String type: <!ATTLIST person age CDATA #IMPLIED> • Tokenized types: • ID, IDREF, IDREFS • ENTITY, ENTITIES • NMTOKEN, NMTOKENS <!ATTLIST person id ID #REQUIRED> father IDREF #REQUIRED> children IDREFS #IMPLIED > • Enumerated type: <!ATTLIST person gender (Male|Female) #REQUIRED>

  28. DTD:ATTLIST Attribute defaults Provide information about the attribute’s presence: • #REQUIRED • Attribute must appear in the associated element. • <!ATTLIST person gender (Male |Female) #REQUIRED> • #IMPLIED • The attribute may be absent. • no default value. • <!ATTLIST person age CDATA #IMPLIED> • Default/constant value • <!ATTLIST list type (ol|ul) “ul”> • <!ATTLIST list type (ol|ul) #FIXED “ul”>

  29. Why do we need XML ?

  30. XML unifies the syntax of information • Layers of information(data): • bit • byte • Character BCD EBCDIC ASCII BIG5 ISO-8859 ==> • UNICODE • syntax(form) XML • semantics (ontology) Semantic Web • Application • Semantic Web: • an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. • --- Tim Berners-Lee

  31. New desired requirements in the internet age • Easy retrieval of information over the net • realized by current Web/internet technology • good browser, • web server • HTTP, DNS, search engines. • HTML, URI, HyperText, MIME • Easy/cheap interoperation of existing software in the internet. • also the old goal of distributed system/computing • RPC, RMI, CORBA,... • a prerequisite for eCommerce • issues: • data transmission ==> solved by existing internet infrastructure • data representations ?

  32. Why needing a unifying format for data ? • Case: 10 word processors, each need to be able to process docs generated by any other. • 1st approach: • write a converter A-->B for any A and B. • #converter = n x (n-1) = 90 (bad!) • 2nd approach: • invent a common format (C). • write a pair of converters (A --> C, C-->A) for each word processor. • To process doc generated from A by B, simply • A ==(A-->C)== C == (C-->B) == B • required converts: 2 x n = 20 (much better!) • prerequisite: need a common format. • This is what XML plays!!

  33. Example:XML in EDA (Electric Design Automation)

  34. Additional benefits of XML (as a common format) • Enable the interoperation of internet/intranet/extranet software/service. • Free (or cheap) cost of obtaining required software for processing XML. • without the need to reinvent the wheel. • can focus on value-added software based on these underlying software. • Decoupling of tightly-coupled distributed systems into loosely one. • less monopolization of software by vendors • more selections of combinations for buyers • more chances of contributing software for small company. • less investment for buyers.

  35. Comparison of XML with Other formats • HTML • Text-based non-markup formats • .c .cpp .java .ini … • Binary formats • .dll .exe .o .swf • .class .png .jpeg …

  36. Advantages of XML over HTML • XML can define your own tags. • XML tags describe the content, rather than the presentation of that content • easier for content search (no annoying presentation data). • easier for page development (separating content from view) • easy for devices to render the contents depending on its environments (single model/multiple views)

  37. Advantage of XML over text formats Ex: • JavaML v.s Java; CppML v.s Cpp • XMI v.s rational’s proprietary format • web.xml, plugin.xml v.s ***.ini (for configuration) • build.xml v.s. makefile • XQuery XML format v.s plain text format • RelaxNG XML v.s. plain text format • advantage: • structure explicitly represented in the XML format. • (free and) standard tools (and API) exists for quick parsing of the XML format. => front-end processing avoided/reduced • disadvantage: too verbose. • for storage and transmission. • can be overcome by compression • for human generation; (not a problem for machine generation) • require smarter editor • for human reading/comprehension: • a real problem!!

  38. Advantage of XML over binary formats • Example: • ASN.1 XER Encoding rule v. BER/CER/DER/PER • classML v.s .clss file format. • swfml v.s swf (Flash file format) • advantage: • readable; editable • (free and) open software and APIs available • disadvantage: • take longer time to parse. The trend: • one data model/ multi representation formats + • converters among the formats.

  39. Some XML Aapplications

  40. Some XML applications • An XML application is an language adopting the XML syntax [which is usually defined by DTD/ Schema]. • XML as an alternative representation format • (SVG) Scalar Vector Graph : for vector graph • (MathML) : for mathematical expressions • SMIL (Synchronized Multimedium Integration language): • Resource Description Framework (RDF) : an XML language for describing web resources and their relationship • CML (Chemical Markup Language) : for chemical molecule • JavaML : for java programs • CppML : XML formats for C++ • Ant : a replacement of make for java • Maven:a Java project management and project comprehension tool • OOML : a OO PL in XML • UIML : user interface Markup language • WAP WML (Wireless Markup Language) • See The XML Cover Pages for a bulky listing.

  41. Mathematical Markup Language <?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" ""> <html xmlns="" xmlns:m="" > <head> <title>Fiat Lux</title> </head> <body> <p> And God said, </p> <m:math> <m:mrow> <m:msub> <m:mi>&delta;</m:mi> <m:mi>&alpha;</m:mi> </m:msub> <m:msup> <m:mi>F</m:mi> <m:mi>&alpha;&beta;</m:mi> </m:msup> <m:mi> </m:mi> <m:mo>=</m:mo> <m:mi></m:mi> <m:mfrac> <m:mrow> <m:mn>4</m:mn> <m:mi>&pi;</m:mi> </m:mrow> <m:mi>c</m:mi> </m:mfrac> <m:mi> </m:mi> <m:msup> <m:mi>J</m:mi> <m:mrow> <m:mi>&beta;</m:mi> <m:mo> </m:mo> </m:mrow> </m:msup> </m:mrow> </m:math> <p> and there was light </p> </body> </html>

  42. Vector Graphics • Scalable Vector Graphics (SVG) • Adobe SVG Viewer • Apache Batik SVG toolkit • Vector Markup Language (VML) • Internet Explorer 5.0 or above • Microsoft Office 2000

  43. Example

  44. Ant • A make-like building tools • Sample Build.xml <project default="echoFoo" name="ant-test" basedir="."> <property name="foo5.1" value="${foo5}"/> <target name="writeFoo3Bar3"> <echo message="foo3 = bar3" file=""/> </target> <target name="readWriteFoo4.1Foo4"> <echo message="foo4.1 = ${foo4}" file=""/> </target> <target name="readWriteFoo5.1Foo5InStart"> <echo message="foo5.1 = ${foo5.1}" file=""/> </target> <target name="echoFoo"> <echo message="${foo}"/> </target> </project>

  45. XML and related Core Specifications

  46. Major W3c XML Technologies

  47. Related technologies • XML is a key technology to ensure interoperability • But XML, by itself, is not really useful... we need to • have datatypes, validation (DTD-s, Schemas, ...) • mix XML applications (Namespaces) • link (XLink, XBase,...) • compose/decompose (XInclude, Fragments, ...) • refer to XML data content (XPath, Query, ...) • transform (XSLT) • encrypt, decrypt, sign (Signature, Encryption, ...) • interact, script (DOM, Events, ...) • etc

  48. Core specifications for XML • XML 1.0 • XML Namespace • XML Path language (XPath) • XML Stylesheet Langugae (XSL) • XSL Transformation language (XSLT) • XSL formating Objects (XSLFO) • XML Linking language (XLink) • XML Pointer Langugae (XPointer) • XML schemas (; RelaxNG) • XHTML • XML signatures/canonicalization • XML protocols • XMLForm • XQuery (XML language for Querying XML Documents)

  49. Core Specifications for XML • XML • document type definition (DTD) : a utility used to define the formats and contents of valid XML documents. • a specification to define what kinds of texts are well-formed XML document • XML namespace • Define a mechanism to avoid collision of elements and/or attribute names in documents using multiple sets of DTDs. • Xlink • Define the mechanism for linking to web resources from an XML document. • Xpointer • Define a mechanism for linking to inside an XML document. • XPath • Define a mechanism to refer to part of an XML document