740 likes | 758 Vues
Processing of structured documents. Spring 2001 Helena Ahonen-Myka. Course organization. 581290-5 laudatur course, 3 cu lectures (in Finnish) 27.2.-5.4. Tue 12-14, Thu 10-12 exceptions: no lectures 6. and 8.3. exercise sessions
E N D
Processing of structured documents Spring 2001 Helena Ahonen-Myka
Course organization • 581290-5 laudatur course, 3 cu • lectures (in Finnish) • 27.2.-5.4. Tue 12-14, Thu 10-12 • exceptions: no lectures 6. and 8.3. • exercise sessions • 6.3.-5.4. Tue 10-12 A318 (in English?), Thu 12-14 C454 (in Finnish; 22.3. at 8-10) • course assistant: Olli Lahti • not obligatory
Project work • an XML application that is constructed during the course • a framework is given in the first lecture • in connection with the exercises, more requirements are given • a report has to be returned by 12.4.
Requirements • Exam (Wed 11.4. at 16-20): 45 points • Project: 15 points • Exercises: 5 extra points • Maximum of points: 60
Outline (preliminary) • 1. Introduction • 2. Descriptions of structure • context-free grammars • XML DTD, XML Schema • 3. Programming interfaces • SAX, DOM • 4. Querying structured documents • XML Query
Outline... • 5. Transforming structured documents • XSL (XSLT, formatting objects) • presentation issues • 6. Document architectures • 7. Metadata: RDF • 8. Compressing XML data • 9. ...
Structured documents • Document? • A structured representation of (textual) information on some medium • normally for a human reader • messages, manuals, memos, books… • also to/from/between applications • source code, program-generated mail, EDI (electronic data interchange) • static - dynamic
Presentation and structure • Presentation informs the human reader about the meaning of text and the role of its parts • markup: indicating the presentation or the meaning of different parts of text • originally hand-written annotations for the typesetter • nowadays primarily codes embedded in digital documents
Markup • Procedural markup • formatting commands (start boldface, produce an empty line, indent 5mm…) • Descriptive markup • indicating the logical structure of text using chosen names
Structured documents? • Generally speaking any text is structured (punctuation, words, sentences…) • but especially descriptively marked-up documents… • especially if they adhere to a rigorous specification of structure.
”Document”: <memo importance=”high” date=”19990323”> <from>Paul V. Biron</from> <to>Ashok Malhotra</to> <subject>Latest draft</subject> <body> We need to discuss the latest draft <emph>immediately</emph>. Either email me at <email> mailto:paul.v.biron@kp.org</email> or call <phone>555-9876</phone> </body> </memo>
”Data”: <invoice> <orderDate>19990121</orderDate> <shipDate>19990125</shipDate> <billingAddress> <name>Ashok Malhotra</name> <street>123 IBM Ave.</street> <city>Hawthorne</city> <state>NY</state> <zip>10532-0000</zip> </billingAddress> <voice>555-1234</voice> <fax>555-4321</fax> </invoice>
<body> <p><b>Order date:</b> 19990121</p> <p><b>Shipping date:</b> 19990125</p> <p><b>Address:</b></p> <table> <tr><th>name<th>street<th>city<th>state<th>zip <tr><td>Ashok Malhotra <td>123 IBM Ave. <td>Hawthorne <td>NY <td>10532-0000 </table> <p>Phone: 555-1234</p> <p>Fax: 555-4321</p> </body>
Theses of structured documenting • Separation of structure and presentation • markup of structure and other (meta) information should be done • at creation time • for future needs • rigor of markup • automatization of processing
Advantages of structure • Better control over documents • guidance of writing, validation of structure • higher-precision retrieval (conditions for parts) • reuse of information • automated processing • control of uniform style
Advantages of structure • Transport of documents between different environments and applications • archival of documents • storing in databases • multiuse of documents • different layout styles • paper, online, CD-ROM, pda • different versions
Disadvantages of structure • Start-up costs • design of document structures • conversion of legacy (non-structured) documents • implementation/adaptation of tools, procedures and policies • attitudes of authors • from a producer of a final publication to an information-feeding clerk?
2. Project work • The goal: everyone builds a (non-trivial) XML application that can be used during the course to train different concepts and methods • Example: I would need a system to track the work of my Master’s thesis students
A wish list: • I want to store information about my students, e.g., name, contact information, scheduled meetings and deadlines, comments, problems, ”deals”, links to the drafts and the homepages of the students, etc. • As a primary interface I’d like to have a web page (with forms)
A wish list: functions • I want to add information using the HTML form on the web page (easily!) • I want to have a listing on the web page of 1) all the students 2) information about one student • I need also other listings (e.g. simple ASCII) for reporting the state of my students (or just a list of my current students)
And now you... • Design an application that is somehow ”similar” to mine • set of persons (or other objects) with information (e.g. your customer contacts) • some parts free text • several different ways to use the data, e.g. several listings (both content and presentation)
Requirements • More requirements follow later... • return a report by 12.4. • The report should include • (short) requirements analysis • descriptions of the structure (DTD, Schema) • other designs, architecture, ... • Some kind of a working prototype • not necessarily the whole system
3. Structure descriptions • Regular expressions, context-free grammars • XML Document type definitions • XML Schema
Regular expressions • A way to describe set of strings over an alphabet (of chars, events, elements…) • many uses: • text searching (e.g. emacs, grep, perl) • in grammatical formalisms (e.g. XML DTDs) • relevant for document structures: what kind of structural content is allowed for different document components
Regular expressions • A regular expression over alphabet is either • (an empty set) • epsilon; sometimes lambda ) • a, where a • R | S (choice; sometimes R S) • R S (catenation) or • R* (Kleene closure) • where R and S are regular expressions
Regular expressions • Regular expression E denotes a language (a set of strings) L(E): • L() = (empty set) • L() = {} (singleton set of empty string) • L(a) = {a} (singleton set of a ) • L(R|S) = L(R) L(S) = {w | w L(R) or w L(S)} • L(RS) = L(R)L(S) = {xy | x L(R) and y L(S)} • L(R*) = L(R)* = {x1…xn| xk L(R), k=1,…,n; n 0}
Example • top-level structure of a document: • = {title, author, date, sect) • title followed by an optional list of authors, followed by an optional date, followed by one or more sections: • title auth* (date | ) sect sect* • common abbreviations: • E? = (E | ); E+ = E E* • -> title auth* date? sect+
Context-free grammars • Used widely to syntax specification (programming languages) • G = (V, , P, S) • V: the alphabet of the grammar G; V = N • : the set of terminal symbols; N = V- : the set of nonterminal symbols • P: set of productions • S N: the start symbol
Productions and derivations • Productions: A -> , where A N, V* • e.g. A -> aBa (1) • Let , V*. String derives directly, => , if • = A, = for some , V*, and A -> is a production of the grammar • e.g. AA => AaBa (assuming prod. 1 above)
Language generated by a context-free grammar • derives , =>* , if there is a sequence of 0 or more direct derivations that transforms to • The language generated by a CFG G: • L(G) = {w * | S =>* w} • L(G) is a set of strings: to model structural elements, we consider parse trees
Parse trees of a CFG • Aka syntax trees or derivation trees • nodes labelled by symbols of V (or by ): • internal nodes by nonterminals, root by start symbol • leaves using terminal symbols (or ) • parent with label A can have children labeled by X1,…,Xk only if A -> X1…Xk is a production
CFGs for document structures • Nonterminals represent document structures • e.g. Ref -> AuthorList Title PublData AuthorList -> Author AuthorList AuthorList -> • problem: • obscures the relation of elements (the last Author several hierarchical levels away from Ref) -> solution: extended CFGs
Extended CFGs (ECFGs) • Like CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData • Let , V*. String derives directly, => , if • = A, = for some , V*, and A -> E is a production such that L(E) • e.g. Ref => Author Author Author Title PublData
Language generated by an ECFG • Defined similarly to CFGs • Theorem: Languages generated by extended and ordinary CGFs are the same
Parse trees of an ECFG • Similar to parse trees of an ordinary CFG, except that… • parent with label A can have children labeled by X1,…,Xk when A -> E is a production such that X1…Xk L(E) • -> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)
What is XML? • W3C Recommendation Feb 1998 • metalanguage that can be used to define markup languages • gives syntax for defining extended context free grammars • XML documents that adhere to the ECFG are strings in the language • document types (grammars)- document instances (strings in the language)
XML encoding of structure • XML document essentially a parenthesized linear encoding of a parse tree • corresponds to a preorder walk • start of inner node (element) A denoted by a start tag <A>, end denoted by end tag </A> • leaves are strings (or empty elements) • + certain extensions (especially attributes)
Terminal symbols in practice • Leaves of parse trees are labeled by single characters (symbols of ) • too granular in practice: instead terminal symbols which stand for all values of a type • e.g. #PCDATA in XML for variable length content of data characters • richer data types in proposed XML schema formalisms
XML: logical structure • Elements • correspond to internal nodes of the parse tree • unique root element -> document is a single parse tree • indicated by matching (case-sensitive!) tags <ElementTypeName>…</ElementTypeName> • can contain text and/or subelements • can be empty: • <elem-type></elem-type> • <br />
Logical structure • Attributes • name-value pairs attached to elements • ”metadata”, usually not treated as content • e.g. <div class=”preface” date=”990126”> • also: • <!-- comments --> • <?note this text would be passed to the application as a processing instruction named ’note’?>
Document type declaration • Provides a grammar (document type definition, DTD) for a class of documents • syntax: • <!DOCTYPE root-type-name SYSTEM ”ex.dtd” <!-- external subset in file ex.dtd --> [ <!-- internal subset may come here --> ]> • external and internal subset make up the DTD; internal has higher precedence
XML declaration • <?xml version=”1.0” encoding=”UTF-8” standalone=”yes” ?>
Defining the structure: DTD • document type definition (DTD) • content model for each element • describes how the elements are formed from the other elements and text • defines which attributes an element may/must have; default values • content models are regular expressions
Markup declarations • Element type declarations (similar to productions of ECFGs) • attribute-list declarations (for declared element types) • entity declarations • notation declarations
Element type declarations • The general form is • <!ELEMENT elem-type-name (E)> • where E is a content model = regular expression over element names
Regular expression syntax • + : 1 or more • * : 0 or more • ? : 0 or 1 • | : choice (one has to be chosen) • () : grouping • , : order
Examples of definitions • <!ELEMENT name (fname+, lname)> • <!ELEMENT address (name, street, (city, state, zipcode) | (zipcode, city))> • <!ELEMENT contact (address, phone*, email?)> • <!ELEMENT contact2 (address | phone | email)*>
DTD for the Invoice example <!DOCTYPE invoice [ <!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)> <!ELEMENT orderDate (#PCDATA)> <!ELEMENT shipDate (#PCDATA)> <!ELEMENT billingAddress (name, street, city, state, zip)> <!ELEMENT voice (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)>]>
Attribute-list declarations • Name, data type and possible default value for each attribute for a given element type • Example: • <!ATTLIST FIG • id ID #IMPLIED • descr CDATA #REQUIRED • class (a | b | c) ”a”> • semantics mainly up to the application