1 / 31

Processing of structured documents

Processing of structured documents. Spring 2002, Part 1 Helena Ahonen-Myka. Course organization. 581290-5 laudatur course, 3 cu lectures (in Finnish) 22.1.-21.2. Tue 12-14, Thu 10-12 not obligatory exercise sessions 29.1.-27.2.

delta
Télécharger la présentation

Processing of structured documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka

  2. Course organization • 581290-5 laudatur course, 3 cu • lectures (in Finnish) • 22.1.-21.2. Tue 12-14, Thu 10-12 • not obligatory • exercise sessions • 29.1.-27.2. • course assistants: Olli Lahti and Miro Lehtonen (new group Wed 12-14 A318) • not obligatory

  3. Requirements • Exam (Wed 6.3. at 16-20): 45 points • Project: 15 points • Exercises: 5 extra points • Maximum of points: 60

  4. Outline (preliminary) • 1. Descriptions of structure • context-free grammars • namespaces, information sets • (XML DTD,) XML Schema • 2. Programming interfaces • SAX, DOM • SOAP • 3. Traversing documents • XPath

  5. Outline... • 4. Querying structured documents • XML Query • 5. XML Linking • 6. XML databases • 7. Metadata: RDF • 8. Compressing XML data • 9. ...

  6. Prerequisites • You should know the basics of XML • DTD, elements, attributes, syntax • XSLT (basics), formatting • some programming experience is needed

  7. Group project • Group of 4-5 students • groups are formed in the exercise sessions in the 2nd week • Task: construct a toy B2B e-commerce application • a travel agency which sells packages containing hotel nights and concerts • a hotel (or several) • a concert ticket office

  8. Group project • Task continues • a customer can reserve packages using a web page • a reservation causes a query to the hotels and the ticket offices for the availability of rooms and tickets • for all the communication and for the storage of all the documents you should use XML

  9. Group project • Try to get some simple implementation work • may depend on the support we can offer • you don´t have to consider all the real life problems, like consistency of reservations • concentrate on playing with XML • state of the work is presented in the last exercise sessions (also students who don’t normally attend exercises)

  10. Requirements for project • More instructions follow later... • return a report by 22.3. (as an URL) • The report should include • (short) requirements analysis • descriptions of the structure (DTD, Schema) • other designs, architecture, ... • Some kind of a working prototype • not necessarily the whole system

  11. 1. Structure descriptions • Regular expressions, context-free grammars -> What is XML? • (XML Document type definitions) • namespaces, information sets • XML Schema

  12. Regular expressions • A way to describe set of strings over an alphabet (of chars, events, elements…) • many uses: • text searching (e.g. emacs, grep, perl) • in grammatical formalisms (e.g. XML DTDs) • relevant for document structures: what kind of structural content is allowed for different document components

  13. Regular expressions • A regular expression over alphabet  is either •  (an empty set) •  (epsilon; sometimes lambda ) • a, where a   • R | S (choice; sometimes R  S) • R S (catenation) or • R* (Kleene closure) • where R and S are regular expressions

  14. Regular expressions • Regular expression E denotes a language (a set of strings) L(E): • L() =  (empty set) • L() = {} (singleton set of empty string) • L(a) = {a} (singleton set of a  ) • L(R|S) = L(R)  L(S) = {w | w  L(R) or w  L(S)} • L(RS) = L(R)L(S) = {xy | x  L(R) and y  L(S)} • L(R*) = L(R)* = {x1…xn| xk  L(R), k=1,…,n; n  0}

  15. Example • top-level structure of a document: •  = {title, author, date, sect} • title followed by an optional list of authors, followed by an optional date, followed by one or more sections: • title auth* (date | ) sect sect* • common abbreviations: • E? = (E | ); E+ = E E* • -> title auth* date? sect+

  16. Context-free grammars • Used widely for syntax specification (programming languages) • G = (V, , P, S) • V: the alphabet of the grammar G; V =   N •  : the set of terminal symbols; N = V- : the set of nonterminal symbols • P: set of productions • S  N: the start symbol

  17. Productions and derivations • Productions: A -> , where A  N,   V* • e.g. A -> aBa (1) • Let ,   V*. String  derives  directly,  => , if •  = A,  =  for some ,  V*, and A ->  is a production of the grammar • e.g. AA => AaBa (assuming prod. 1 above)

  18. Language generated by a context-free grammar •  derives ,  =>* , if there is a sequence of 0 or more direct derivations that transforms  to  • The language generated by a CFG G: • L(G) = {w  * | S =>* w} • L(G) is a set of strings: to model structural elements, we consider parse trees

  19. Parse trees of a CFG • Aka syntax trees or derivation trees • nodes labelled by symbols of V (or by ): • internal nodes by nonterminals, root by start symbol • leaves using terminal symbols (or ) • parent with label A can have children labeled by X1,…,Xk only if A -> X1…Xk is a production

  20. CFGs for document structures • Nonterminals represent document structures • e.g. Ref -> AuthorList Title PublData AuthorList -> Author AuthorList AuthorList ->  • problem: • obscures the relation of elements (the last Author several hierarchical levels away from Ref) -> solution: extended CFGs

  21. Extended CFGs (ECFGs) • Like CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData • Let ,   V*. String  derives  directly,  => , if •  = A,  =  for some ,  V*, and A -> E is a production such that   L(E) • e.g. Ref => Author Author Author Title PublData

  22. Language generated by an ECFG • Defined similarly to CFGs • Theorem: Languages generated by extended and ordinary CGFs are the same

  23. Parse trees of an ECFG • Similar to parse trees of an ordinary CFG, except that… • parent with label A can have children labeled by X1,…,Xk when A -> E is a production such that X1…Xk  L(E) • -> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)

  24. What is XML? • metalanguage that can be used to define markup languages • gives syntax for defining extended context free grammars • XML documents that adhere to an ECFG are strings in that language • document types (grammars)- document instances (strings in the language)

  25. XML encoding of structure • XML document essentially a parenthesized linear encoding of a parse tree • corresponds to a preorder walk • start of inner node (element) A denoted by a start tag <A>, end denoted by end tag </A> • leaves are strings (or empty elements) • + certain extensions (especially attributes)

  26. Terminal symbols in practice • Leaves of parse trees are labeled by single characters (symbols of ) • too granular in practice: instead terminal symbols which stand for all values of a type • e.g. #PCDATA in XML for variable length content of data characters • richer data types in XML schema formalisms

  27. An example DTD <!DOCTYPE invoice [ <!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)> <!ELEMENT orderDate (#PCDATA)> <!ELEMENT shipDate (#PCDATA)> <!ELEMENT billingAddress (name, street, city, state, zip)> <!ELEMENT voice (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)>]>

  28. And a document: <invoice> <orderDate>19990121</orderDate> <shipDate>19990125</shipDate> <billingAddress> <name>Ashok Malhotra</name> <street>123 IBM Ave.</street> <city>Hawthorne</city> <state>NY</state> <zip>10532-0000</zip> </billingAddress> <voice>555-1234</voice> <fax>555-4321</fax> </invoice>

  29. XML processing model • A processor (parser) • reads XML documents • passes data to an application • XML Specification tells how to read, what to pass

  30. Well-formed XML documents • documents that adhere to the formal requirements (syntax) of the XML specification • if a document is not well-formed, it is not an XML document (and the XML tools do not have to process it)

  31. Valid documents • a document is a valid XML-document, if it is well-formed and adheres to the structure defined in the DTD given • XML-processor can be validating or non-validating • sometimes validity is important, sometimes not

More Related