Solutions mentioned by the TEI
80 likes | 207 Vues
This document delves into the Text Encoding Initiative (TEI) solutions, particularly focusing on concurrency in SGML, milestone elements, and issues related to redundant encoding and fragmentation. It highlights the problems associated with milestone elements lacking content and the complexities of processing these regions. Additionally, the discussion covers stand-off annotation advantages and drawbacks, emphasizing the challenges of combining different annotation layers. Non-SGML markup languages like LMNL are also introduced as potential alternatives that address some of the inherent limitations in SGML and XML.
Solutions mentioned by the TEI
E N D
Presentation Transcript
Solutions mentioned by the TEI • CONCUR: an optional feature of SGML (not XML) that allows multiple hierarchies to be marked up concurrently in the same document • milestone elements: empty elements that mark the boundaries between elements in a non-nesting structure • fragmentation of an item: the division of a single element into two or more parts, each of which nests properly within its context • virtual joins: the re-creation of a virtual element from fragments of text • redundant encoding: information encoded in multiple forms
Problems with milestones • milestones are empty elements • milestones elements have no content • consequences: • no content model restriction can be stated by a document grammar • standard SG/XML editors cannot annotate these regions • SG/XML parsers cannot ensure proper nesting of the milestone elements • to process these regions by means of a style sheet is • more difficult (XSLT) or • impossible (CSS)
CLIX/Horse-milestones • Differing type of milestones <milestone type=’start’ gi=’q’ id=’foo’/> … <milestone type=’end’ gi=’q’ coid=’foo’/> <start gi=’q’ id=’foo’/>...<end gi=’q’ coid=’foo’/> • CLIX Non-XML: <B>s<I>xyz</B>t</I> Would be : <B sID=’1’/>b<I sID=’2’/>xyz<B eID=’1’/>t<I eID=’2’/>
Problems with the other TEI-solutions • CONCUR: • (de facto) not implemented (and not part of XML) • fragmentation of an item: • results in 'containers' containing only a part of the text, e.g. a fragmented sentence or para would not contain an entire sentence or paragraph, as implied • virtual joins: • requires a separate interpretation of the SGML document • redundant encoding: • results in multiple files • the files are not integrated in a larger unit • it exists no unit containing all the information
Stand-off annotation • new layers of annotation are added by building a new tree whose nodes are SGML elements which do not contain textual content, but links to another layer • in some respects a generalization of the virtual joins (although not mentioned by the TEI), because • not only contents of elements are joined, but also ranges between points within the document • link base: • Distinction 1: markup already contained in an annotation layer vs. text content, addressed by character offsets • Distinction 2: one (dedicated) layer as the link target vs. (free) interlinking of several layers
Advantages of stand-off annotation • Thompson & McKelvie (1997) • the source document might be read-only • annotation files can be distributed without distributing the source text • Michael Glass & Barbara Di Eugenio (2002) • discontinuous segments of text can be combined in a single annotation • independent parallel coders can produce independent annotations • different annotation files can contain different layers of information • Pianta & Bentivogli (2004) • elegance and clarity • processing conceptually simple
Drawbacks of stand-off annotation • new layers require a separate interpretation • the layers, although separate, depend on each other • the information, although included, is difficult to access using generic methods • standard parsing or editing software cannot be employed • standard document grammars can only be used for the level, containing both markup and textual data • linking at a sub-element range is difficult • the primary layer should be a (primary) level
Non SGML-based Markup Languages • some non-SGML-based markup languages have been proposed, e.g. Multi-Element Code System (MECS) or TexMECS • its major extension with respect to SGML and XML is that overlapping ranges are admitted within documents. • in 2002 the Layered Markup and Annotation Language (LMNL) was proposed Tennison and Piez 2002 • LMNL is a markup language which not only allows to annotate overlapping elements but also to connect the element names to corresponding annotation levels. • LMNL solves both problems, but • (full) LMNL is not SGML-based