1 / 76

COMS E6125 Web-enHanced Information Management (WHIM)

COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2008. Today’s Topic: Markup Languages. History of markup languages SGML = Standard Generalized Markup Language HTML = HyperText Markup Language XML = eXtensible Markup Language. What is Markup?.

haven
Télécharger la présentation

COMS E6125 Web-enHanced Information Management (WHIM)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008 Kaiser: COMS E6125

  2. Today’s Topic: Markup Languages • History of markup languages • SGML = Standard Generalized Markup Language • HTML = HyperText Markup Language • XML = eXtensible Markup Language Kaiser: COMS E6125

  3. What is Markup? • Special text (“mark”) that is added to the regular text of a document in order to convey some information about it • A markup language is a formalized way of providing markup, and specifies: • what markup is allowed (the lexicon) • what markup is required • how markup is distinguished from content text • what the markup “means” Kaiser: COMS E6125

  4. Specific Coding • Historically, electronic manuscripts contained procedural control codes (markup) that caused the text to be formatted in a particular way • tj6 • troff • TeX Kaiser: COMS E6125

  5. Procedural Markup • Advantages: • Instructs agent how to process text • Generally concerned with formatting and presentation • Is “efficient” because requires little further interpretation • Disadvantages • Often specific to one proprietary processing system • Usually ties a document to a single purpose • printing on a paper • viewing on a screen • provides no information on “meaning” Kaiser: COMS E6125

  6. Markup Steps • Author first analyzes the information structure and other attributes of the document; that is, s/he identifies each meaningful separate element, and characterizes it as a paragraph, heading, ordered list, footnote, or some other element type • Author then determines, from memory or a style book, the processing instructions (“marks”) that will produce the format desired for that type of element • Finally, s/he inserts the chosen marks into the text Kaiser: COMS E6125

  7. Example Specific Coding .SK 1 Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called "markup", serves two purposes: .TB 4 TaB stop .OF 4 OFfset .SK 1 1.#Separating the logical elements of the document; and .OF 4 .SK 1 2.#Specifying the processing functions to be performed on those elements. .OF 0 .SK 1 SKipping vertical space Kaiser: COMS E6125

  8. Generic Coding • In contrast, generic (or generalized, or descriptive) coding uses descriptive tags (e.g., “heading”) • Scribe • LaTeX • HTML Kaiser: COMS E6125

  9. Descriptive Markup • Advantages: • Identifies the logical components of a document • Generally concerned with what text is • Does not specify what procedures are to be applied to text • Therefore requires that other process(es) supply formatting and presentation Kaiser: COMS E6125

  10. Descriptive Markup • Disadvantages • Is (usually) human and machine readable • Identifies information content • Is not directed towards a particular purpose or rendition of the document • Therefore can be non-proprietary Kaiser: COMS E6125

  11. Markup Steps • Author first analyzes the information structure and other attributes of the document; that is, s/he identifies each meaningful separate element, and characterizes it as a paragraph, heading, ordered list, footnote, or some other element type same as above • Author then associates each significant element with the mnemonic tag (“mark”) that s/he feels best characterizes it Kaiser: COMS E6125

  12. Example Generic Coding <p> Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called <em>markup</em>, serves two purposes: <ol> <li>Separating the logical elements of the document; and <li>Specifying the processing functions to be performed on those elements. </ol> Kaiser: COMS E6125

  13. The Case for Generalized Markup • Markup should describe a document's structure and other attributes rather than specify processing to be performed on it, so markup need be done only once and will suffice for all future processing • Markup should be rigorous so that the techniques available for rigorously-defined objects like programs and data bases can be used for processing documents as well Kaiser: COMS E6125

  14. Who Invented Markup? • Specialized markup: ??? • Generalized markup: • Many credit William Tunnicliffe, chairman of the Graphic Communications Association Composition Committee, who presented a talk on the separation of information content of documents from their format during a meeting at the Canadian Government Printing Office, September 1967 • Others credit Stanley Rice, a New York book designer, who proposed the idea of a universal catalog of parameterized editorial structure macros in several articles, e.g., "Editorial Text Structures," Memorandum to Standards Planning and Requirements Committee, ANSI, March 17, 1970 Kaiser: COMS E6125

  15. An Early Implementation • At IBM in 1969, Charles Goldfarb, Ed Mosher and Ray Lorie invented Generalized Markup Language (GML) as part of a law office project integrating text editing with information retrieval and page composition • Instead of a simple tagging scheme, GML introduced the concept of a formally-defined document type (DTD = Document Type Definition) with an explicit nested element structure • By 1971 developed first DTD, for the manuals for IBM's “Telecommunications Access Method”, which enabled all the headings of a given head-level to be automatically formatted identically • Productized in 1973 in IBM’s Document Composition Facility (DCF) Kaiser: COMS E6125

  16. Example GML :h1.Chapter 1: Introduction :p.GML supported hierarchical containers, such as :ol :li.Ordered lists (like this one), :li.Unordered lists, and :li.Definition lists :eol. as well as simple structures. :p.Markup minimization (later generalized and formalized in SGML), allowed the end-tags to be omitted for the "h1" and "p" elements. Kaiser: COMS E6125

  17. SGML = Standard GML • Standardization effort started in 1978, when ANSI (American National Standards Institute ) creates The Computer Languages for the Processing of Text Committee • Series of draft standards 1980-1986 (1983 version adopted by IRS and DoD), ISO (International Standard Organization joins ANSI effort in 1984 • Final international standard in 1986 based in part on an SGML system developed by Anders Berglund, then of the European Particle Physics Laboratory (CERN) • Hmm… isn’t CERN where Tim Berners-Lee invented the “World Wide Web” in 1989? Kaiser: COMS E6125

  18. SGML • A metalanguage (grammar) • How to write tags, how to define the document structure • Structural paradigm is that of • an inverted tree structure, a root component branching out into leaves • or a series of nested containers • Defines three kinds of objects • Elements are the basic structural components • Attributes are qualities of elements • Entities are a short representation of special characters Kaiser: COMS E6125

  19. SGML Pro and Con • Advantages: • Documents held in a standards-based, non-proprietary, platform-independent storage format • Scope for document re-use and re-presentation, enhancement of retrieval possibilities • Easy to process • Can (optionally) validate against DTDs • Disadvantages: • Remained a niche market in the 1980s, unknown to the masses • Not well supported by the major document processing vendors, tools expensive Kaiser: COMS E6125

  20. Then Came the Web… • HyperText Markup Language (HTML) is derived from SGML • As an SGML-compliant language, it has a DTD with a fixed set of tags • Initially, the number of tags were very limited ( ~ 10 ) and very easy to remember and to use Kaiser: COMS E6125

  21. HTML Example <html> <head> <title> My title </title> </head> <body> <h1> A huge heading </h1> <h2> A smaller one </h2> <ul> <li> a list item in <b>bold</b> </li> <li> a list item in <i>italics</i> </li> </ul> <p> A paragraph </p> </body> </html> Kaiser: COMS E6125

  22. Another HTML Example • From original IETF Internet Draft for HTML See <A HREF="http://info.cern.ch/">CERN</A>'s information for more details. A <A NAME=serious>serious</A> crime is one which is associated with imprisonment. The Organization may refuse employment to anyone convicted of a <a href="#serious">serious</A> crime. Warning: < IMG SRC ="triangle.gif" ALT="Warning:"> This must b e done by a qualified technician. < A HREF="Go">< IMG SRC ="Button"> Press to start</A> Kaiser: COMS E6125

  23. HTML Pro and Con • Advantages • Simple to learn and to use • Easy to create from scratch or by converting legacy text files • Easy to parse and render • Drawbacks • Syntaxless • Much more a presentation language than a structural language • Too limited, not a good substitute for a word processor Kaiser: COMS E6125

  24. HTML History • 1990: First implementation by TBL on a NeXT computer at CERN • Used SGML tools to create original HTML language (DTD, parser) • Scalability and simplicity of HTML (and HTTP), compared to OHS or Gopher part of the basis for WWW success • 1991-1992: Various text-only and graphical browsers developed, latter usually platform-specific Kaiser: COMS E6125

  25. HTML History • 1993: NCSA Mosaic • First widely available graphical WWW browser (Unix X-Windows and Mac) • Developed primarily by UIUC undergraduate Marc Andreessen • The killer application of the Internet is born and the number of Web servers explode • 1994: Competition • Mosaic team leaves NCSA to found Netscape • Microsoft adopts the Web (Internet Explorer bundled with Windows 95) • Divergence of supported HTML tags between Internet Explorer and Netscape –> browser wars • HTTP traffic becomes more common than telnet and ftp Kaiser: COMS E6125

  26. HTML History • 1994-1995: HTML 2.0 adds image maps, forms • 1995 and beyond: Commercial websites • Java development started (as “Oak”) for programming settop boxes in 1991, BIG FAILURE - but launched on Web in March 1995 (in HotJava) and May 1995 (in Netscape), BIG SUCCESS • Amazon.com opens in July 1995 • “dot com” era begins (and soon ends) Kaiser: COMS E6125

  27. HTML History • Jan 1997: HTML 3.2 adds tables, applets, text flow around images, superscripts and subscripts • Dec 1997: HTML 4.0 addsframes, cascading style sheets, more multimedia options, scripting languages, web accessibility conventions, internationalization Kaiser: COMS E6125

  28. XHTML = eXtensible HyperText Markup Language • XHTML 1.0 W3C Recommendation January 2000, revised August 2002 (XHTML 1.1 still working draft) • Made element and attribute names case-sensitive (in particular, use lowercase) • Include end tags, e.g., <p> … </p> • Add a “/” to empty elements, e.g., <br/> and <hr/> • Quote all attribute values, e.g., <img src="duck.jpg" alt="A Duck"/> • Most browsers still work fine with older HTML Kaiser: COMS E6125

  29. Where did the “X” come from? • XML = eXtensible Markup Language • XHTML is a reformulation of HTML 4.x in XML • XHTML can be used in conjunction with other XML vocabularies • SMIL (Synchronized Multimedia Integration Language) • SVG (Scalable Vector Graphics) • MathML (Mathematical Markup Language) • Plus hundreds dedicated to specific applications (the extensible part) Kaiser: COMS E6125

  30. What is XML for? • The universal markup format for structured documents and data on the Web • For data exchange (messages) and persistent data • Syntax • Data Modeling • Data Processing Kaiser: COMS E6125

  31. XML History • XML 1.0 became a W3C Recommendation in February 1998, revised several times - most recently September 2006 • XML 1.1 draft released Nov 2003, recommendation last revised September 2006 (addresses various issues wrt Unicode and mainframe compatibility) • Conceptually an SGML descendant • Unlike SGML, it quickly became widespread Kaiser: COMS E6125

  32. SGML->XML • Like SGML, XML is a grammar (or a metalanguage), NOT a specific language • Specification simplified • SGML spec ~600 pages • XML spec 36 pages (initial 1.0) -> 54 pages (1.1 2nd edition) • Parsing made simpler through two-level mechanism • Well-formed • Valid Kaiser: COMS E6125

  33. Well-Formed • (Optionally) starts with XML declaration <?xml version="1.0"?> • Rest of document inside the root element <myroot>…</myroot> • All text contained in some element <someelement>text text text</someelement> • Explicit empty elements <anotherelement></anotherelement> <anotherelement/> Kaiser: COMS E6125

  34. Well-Formed • Element tags must be properly nested (no crossing tags) NO <i><b>blah blah blah</i></b> • Start and end tags must match exactly (same case) • Quotes placed around all attribute values <a href=“stuff.html”>stuff</a> Kaiser: COMS E6125

  35. Valid • Well-formed, plus • Conforms to a DTD or Schema • tags and attributes are all declared • tags and attributes are used correctly • XML browsers and editors usually require validity • Other tools might not (e.g., search engines) Kaiser: COMS E6125

  36. XML more oriented to distributed computing than to document markup Thus complements rather than replaces HTML (or XHTML) DOM = Document Object Model SAX = Simple API for XML SOAP = Simple Object Access Protocol Web Services XML Goes Beyond Document Processing Kaiser: COMS E6125

  37. Let’s Reinvent XML • Someone in the far future sends a message in a virtual bottle, containing parts of the universal library of human and post-human literature, back into the 1970s when ... • … the Web, XML, P2P, Java were unheard of • ... computer manufacturers talked about mips and kilobytes • … music was played by rotating vinyl discs under a diamond-tip stylus or on cassette tapes Kaiser: COMS E6125

  38. … and Microsoft looked like Kaiser: COMS E6125

  39. The Message in the Bottle, 1st try ÐÏ^Qࡱ^Zá^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@þÿ^@^F^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@#^@^@^@^@^@^@^@^@^P^@^@%^@^@^@^A^@^@^@þÿÿÿ^@^@^@^@"^@^@^@ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á^@q^@^D^@^@^@^R¿^@^@^@^@^@^@^P^@^@^@^@^@^D^@^@Ç^G^@^@^N^@bjbjt+t+^@^@^@ ^@Some Quotations from the Universal Library^M1 Famous Quotes^M1.1 By William I^M[2, Sonnet XVIII]^MShall I compare thee to a summer's day?^MThou art more lovely and more temperate.^MRough winds do shake the darling buds of May,^MAnd summer's lease hath all too short a date.^MSometime too hot the eye of heaven shines,^MAnd often is his gold complexion dimmed.^MAnd every fair from fair some declines,^MBy chance or nature's changing course untrimmed.^MBut thy eternal summer shall not fade,^MNor lose possession of that fair thou owest,^MNor shall Death brag thou wander'st in his shade^MWhile in eternal lines to time thou growest.^MSo long as men can breathe, or eyes can see,^MSo long live this, and this gives life to thee.^M1.2 ^M[2] W. Shakespeare. The Sonnets of Shakespeare.609.^M^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ Kaiser: COMS E6125

  40. The Message in the Bottle, 2nd try \documentclass{article} \begin{document} \title{Some Quotations from the Universal Library} ... \section{Famous Quotes} \subsection{By William I} \textbf{\cite[Sonnet XVIII]{shakespeare-sonnets-1609}} \begin{verse} Shall I compare thee to a summer's day?\\ Thou art more lovely and more temperate. \\ Rough winds do shake the darling buds of May, \\ … \end{verse} \bibliographystyle{abbrv} \bibliography{msg} \end{document} Kaiser: COMS E6125

  41. The Message in the Bottle, finally <?xml version=“1234.56"?> <universal_library> <books> <book> <title>Some Quotations from the Universal Library</title> <section> <title>Famous Quotes</title> <subsection> <title>By William I</title> <quote bibref="shakespeare-sonnets-1609"> <title>Sonnet XVIII</title> <verse> <line>Shall I compare thee to a summer's day?</line> <line>Thou art more lovely and more temperate. </line> <line>Rough winds do shake the darling buds of May, </line> … </verse> </section> </book> … </books> </universal_library> Kaiser: COMS E6125

  42. XML as a Self-DescribingData Exchange Format • Someone from the 1970s receives the message in the virtual bottle, and it … • … can be easily “understood” (even using CP/M & edlin) • … can be parsed easily • … allows the application programmer to rediscover schema and semantics (sort of…) • … may include an explicit schema description • … allows separation of marked-up content from presentation Kaiser: COMS E6125

  43. XML Anatomy element name element attribute name element content <bibliography> <paper ID= “goto”> <authors> <author>Edsger W. Dijkstra </author> </authors> <title>Go To Statement Considered Harmful</title> <booktitle>Communications of the ACM</booktitle> <year>1968</year> <fullPaper source=“harmful”/> </paper> </bibliography> attribute value (attributes cannot contain elements) number content empty element character content Kaiser: COMS E6125

  44. Perspectives on XML • Document (SGML) Community • data = linear text documents • markup (annotate) text to describe context, structure, semantics • Database Community • XML as a prominent example of the semi-structured data model • captures the whole spectrum from highly structured, regular data to unstructured data • XML is the cure for your data exchange, information integration, e-commerce, … problems”(also cures baldness, lose 28 pounds in 14 days, get rich quick, …) Kaiser: COMS E6125

  45. A <A> <B>foo</B> <C>bar</C> <C>psl</C> </A> B C C A: B: "foo" "foo" "bar" "psl" children are ordered C: "bar" C: "psl" Pure XML - Instance Model • XML 1.0 implicit data model (infoset): • nested containers ("boxes within boxes") • labeled ordered trees (= semistructured data model) • relational, object-oriented easy to encode Kaiser: COMS E6125

  46. Identifying Vocabularies • My element may not be your element: • geometry context: <element>line</element> • chemistry context: <element>oxygen</element> Kaiser: COMS E6125

  47. Identifying Vocabularies • An XML Schema (with XML 1.1) defines a vocabulary of names of type definitions, element and attribute declarations [Schema ~= new improved DTD] • Use XML Namespaces(with XML 1.1) to identify which vocabulary • Simple method for qualifying element and attribute names used in XML documents • Useful when a single XML document contains elements and attributes that are defined for and used by multiple software modules Kaiser: COMS E6125

  48. XML namespaces are declared with an xmlns attribute, which can associate a prefix with the namespace The declaration is in scope for the element containing the attribute and all its descendants <html:html xmlns:html='http://www.w3.org/1999/xhtml'> <html:head> <html:title>Frobnostication </html:title> </html:head> <html:body> <html:p>Moved to <html:a href='http://frob. example.com'>here.</html:a> </html:p> </html:body> </html:html> Namespace Scoping Kaiser: COMS E6125

  49. Namespace Defaulting <?xml version="1.1"?> <!-- elements are in the HTML namespace, in this case by default --> <html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>Frobnostication</title> </head> <body> <p>Moved to <a href='http://frob.example.com'>here</a>.</p> </body> </html> Kaiser: COMS E6125

  50. Multiple Namespaces All element types are prefixed <bk:bookxmlns:bk='urn:loc.gov:books'xmlns:isbn='urn:ISBN:0-395-36341-6' xmlns:money='urn:Finance:AllAboutMoney'> <bk:title>Cheaper by the Dozen</bk:title><isbn:number>1568491379</isbn:number> <bk:price money:currencySymbol="$">99.99</bk:price> </bk:book> Kaiser: COMS E6125

More Related