130 likes | 238 Vues
This document provides an extensive introduction to XML technologies useful for text encoding, focusing on processing XML files, the role of CSS in styling, and methods for information extraction through XPATH and XSLT. Readers will discover the benefits of XML, including its portability and structure clarity, as well as its limitations such as verbosity and efficiency issues. The text also discusses the Text Encoding Initiative (TEI) and standard annotation practices, making it an essential resource for anyone looking to understand XML's capabilities and applications in encoding various content types.
E N D
XML technologies for text encoding Tamás Váradi varadi@nytud.hu
Introduction • Processing XML files • CSS – getting the picture right • XPATH – Finding our way around • XSLT extracting the right info • Encoding content the right way • Text Encoding Initiative • TEI Lite • Tools
Benefits of XML • makes structure and content clear • encoding independent of display and device • portable, platform independent • ideal for exchange of data • with a DTD, validation of document is easy
Limitations of XML • Verbose annotation increases the size of the files (sometimes hugely) • Not very efficient format for fast access and recall
Displaying XML files? • Style sheets • consistent design • easy to change • one stylesheet can serve many XML documents • one documents can use different stylesheets
Cascading Stylesheets Elements are associated with display styles h1: { font-size: 3em; } value selector property A Stylesheet is a collections of style rules
Declaring the stylesheet <?xml-stylesheet type = "text/css" href = "url-of-stylesheet" ?> <? xml version="1.0' ?> <? xml-stylesheet type="text/css" href="cards.css" ?>
An example • Load the file letter.xml into Internet Explorer • Now load the file letter2.xml • View source • Open the file letter.css in notepad • Check that what you see corresponds to what is in the css file
Cascading stylesheets • Features are inherited down the XML tree • Three levels of applying styles: • External stylesheets • Internal style definitions • Inline style settings
Limitations of CSS • Elements are formatted in their original sequence • No means to reorder elements • No means to select a set of elements
More advanced techniques • XSL – Extensible stylesheet Language • XSLT – XSL with Transformations • XPath – a standard way to find elements in the XML hierarchy
XSLT • See the excellent introduction to XSLT by Sebastian Rahtz available here
Standard annotation of content • XML is an annotation standard • it is not designed for any particular domain • Need for standard way of encoding typical text genres like books, dictionaries, letters, radio news etc. etc. • => TEXT ENCODING INITIATIVES (TEI)