text-technology.de

Information Modelling of Language and Text: XML-based, multi-level, semantic-oriented.-Some methods (only)- www.text-technology.de

Research Group „Texttechnological Information Modelling“ University of Bielefeld: D. Gibbon MODELEX D. Metzing SEKIMO associated: J.-T. Milde Multimodal Corpora TASX University of Dortmund: A. Storrer HYTEX University of Giessen: H. Lobin SEMDOC University of Tübingen: U. Mönnich COMOD The TASX-Annotator: http://tasxforce.lili.uni-bielefeld.de/

Methodological issues: Multidimensionality of linguistic data requires: • multiple tiers of annotation (xml-based) • connections between multiple tiers (specific methods) • multi-annotation of identical raw data (multiple trees) • specific relations between multi-level annotations

Methodological issues: Multidimensionality of linguistic data requires: (5) a distinction between one or more conceptual levels (semantic markup) and one or more annotation layers (syntactic markup) as well as mappings between both (6) ways to make use of and to generate different annotation sets (annotation + data) given more uniform conceptual representations (accessibility of corpora (search, hypothesis testing, comparative or typological analysis))

Semdoc: Annotation structural <sect1> <para> ... From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996). <footnoteref linkend="i5">5</footnoteref> </para> </sect1> thematic <segment id="s24" parent="g6" newtopic="illustration_bck" litref="s" footnoteref="s33a"> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet (see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5 </segment> rhetorical <segment id="i17" parent="i56" relname="span"> From the now infamous McDonald's coffee spill case to litigation against Ford and Firestone for injuries caused by tire tread separation to tobacco litigation, high stakes civil cases have become familiar staples of our media diet</segment> <segment id="i18" parent="i17" relname="evidence“>(see e.g., Are lawyers burning America, 1995; Budiansky, 1995; Church, 1986; Langley, 1986; Stossel, 1996).5</segment>

Sekimo: Multiple annotations of Japanese dialogue corpora • Annotation categories are based upon widely used tag-sets like IPADIC (Chasen) • The results of corpus analysis can be used to • compare the tag-sets empirically • augment tag-sets with conceptual information, • reuse existing corpora which are based upon the same tat-sets

Sekimo: Sample Annotation

Example: Modeling of congruency in Japanese and German Conceptual difference of congruency reflects in different configurations of annotations, related via secondary information structuring: Lexical-pragmatic congruency watashi ha murano to moushimasu Morpho-syntactic congruency Ich heiße Meier General two annotation units have marker Ja-Germ-1 verb has marker Ja-1 verb and utterance have marker Ja-Germ-2 sentence has subject Ja-Germ-3 subject has marker

Visualisation as SVG graphic

WORD NOUN KOPULA Sekimo: Example for mapping annotations <->concepts Concepts Mapping noun word[@pos=„noun“] NOUN word[feature=„pos“ & value=„noun“] Annotations <noun>watashi</noun> <word pos=„noun“>watashi</word> <word><feature>pos</feature> <value>noun</value> watashi</word> Transformation <NOUN> watashi </NOUN>

ModeLex: Temporal Calculus (Allen) for multimodal annotations • Relations between annotation layers • Can be applied to • Text: Order is given by character sequence • Signal: Order is given by timestamps

Lexicon Model: Subclassification of annotation units, based upontemporal relations Classification hierarchy Corpus Properties class properties of class subclass properties of subclass subsubclass properties of subsubclass

HyTex: Multi-level approach Adaptive generation of hypertext views on coherence criteria User model (static or dynamic) TermNet: Representation of semantic relations between technical terms of the domain Textgrammatical annotation: Definitions and technical terms Topical and rhetorical structures Linguistic annotation: POS-Tagging Lemmatization Chunk-Parsing

WordNet Project, Princeton University GermaNet Project, University of Tübingen Exchange of entities and relations for the TermNet model TEMIS: Text Mining Solutions Heidelberg/Paris Annotation schema for anaphoric and co-reference relations in German texts. Usage of the Text Mining-Tool Knowledge Extractor for the annotation of definitions DFG-Forschergruppe 437:text technological modelling of information Intelligent Views: Knowledge Management, Darmstadt Usage of the tool „K-Infinity“ supporting the convenient construction and maintenance of the TermNet DEREKO: Corpus Technology at the University of Tübingen Chunk Parser for the syntactic annotation of the HyTex corpus Hytex:Research Cooperation and Contacts Text-grammatical foundations for the (semi)automated text-to-hypertext conversion (HyTex)

secondary information structuring and comparative discourse analysis DFG-Forschergruppe 437 Texttechnologische Informationsmodellierung NITE:NaturalInteractivity Tools Engineering University of Southern DenmarkUniversitat Autònoma deBarcelona DFKI Saarbrücken HCRC Edinburgh IMS Stuttgart ILC Pisa Sekimo: Project Context SFB Mehrsprachigkeit Hamburg:Jadex Japanese and German expert discourse in mono- und multilingual constellations

Research Group „Texttechnological Information Modelling“ January 2004 International Conference Center for Interdisciplinary Research „Modeling Linguistic Information Resources“ University of Bielefeld • Semantics of Generic Document Structures and Discourse Parsing • Modelling Textual, Lexical and World Knowledge as a Basis for Hypertext Linking • Multiple Annotation of Language Data • Multimodal Lexical Information for Language Documentation

text-technology.de

text-technology.de

Presentation Transcript

Information Text – Text Features

Text text text.

Text

Text

Text

text

Text

Text

Text Text Text Character limit = 994 characters

Text Text Text Text Text Text Text Text Text Text Text Text Text Text

Your text Your text Your text Your text

Informational Text – Text Features

Text to text links

Text-to-Text Generation

Title Text Title Text Title Text Title Text Title Text Title Text Title Text Title Text

Title Text Title Text Title Text Title Text Title Text Title

Information Text – Text Features

What he/she currently does: text Text Text Text

Text here Text here Text here Text here Text here Text here Text here Text here Text here

Text Text Text

Information Text – Text Features