1 / 36

TEI for language resources: a missed chance or a coming opportunity ?

TEI for language resources: a missed chance or a coming opportunity ?. Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia. Overview. Some history Why TEI isn‘t used for LRs (as much as expected) MULTEXT-East and other case studies Conclusions.

lacy-buck
Télécharger la présentation

TEI for language resources: a missed chance or a coming opportunity ?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TEI for language resources: a missed chance or a coming opportunity? Tomaž ErjavecDept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia

  2. TEI for Language Resources Overview • Some history • Why TEI isn‘t used for LRs (as much as expected) • MULTEXT-East and other case studies • Conclusions

  3. TEI for Language Resources History At its inception TEI was meant to cover CL/NLP LRs, esp. corpora: • ACLone of the supporting associations • modules for corpora, linguistic analysis, feature-structures, graphs • BNC in TEI • At the time CL/NLP do not use SGML:clear playing field

  4. TEI for Language Resources The age of XML and LRs Release of XML (more or less) corresponds to the begining of the era of Language resources: 1998: XML 1.0, First LREC conference But developed LRs (mostly) did not use TEI. Why?

  5. TEI for Language Resources Reason 1: (X)CES • EAGLES Corpus Encoding Standard • „constraining or simplifying the TEIspecifications in order to ensure interoperability“(Ide 1998) • So, more compact and easier to apply than TEI • Almost TEI, but not quite • No methods for extension

  6. TEI for Language Resources Reason 2: Comp Sci attitude • I don‘t care about the data format, I want to develop algorithms! (... I even hate XML...) • If I use XML I will roll my own schema optimal for my experiments / application (...that‘s what ‚X‘ means...) • I won‘t spend weeks (months, years) just getting to know TEI (...I need only 4 different elements anyway...)

  7. TEI for Language Resources Reason 3: General gripes • Missing modules for syntactic analysis & lexical databases • Not perscriptive / precise enough • Too general elements • Too book oriented

  8. TEI for Language Resources Result • Project-local proposals: • TIGER treebank format • Concede lexical database format • GENIA NER format • ... • Semantic Web: DC, RDF, OWL • ISO TC 37 SC4: • LMF, isoCat, • LAF, MAF, SynAF, ...

  9. TEI for Language Resources MyTEI • MULTEXT-East: multilingual corpora and lexica • Fida(PLUS): Slovene Reference Corpus • IJS-ELAN, SVEZ-IJS: en-sl parallel corpora • jaSlo: Japanese-Slovene L2 dictionary • eZISS: Scholarly Digital Editions of Slovene Literature • JRC-ACQUIS: Parallel corpus of EC laws • SDT: Slovene Dependency Treebank • SBL: Slovene Biographic Lexicon • AHLib: DL/corpus of 19th century Slovene books • JOS: Slovene gold-standard corpus for HLT • MULTEXT-East...

  10. TEI for Language Resources MULTEXT-East • EU project 1995-97: MULTEXT sequel • Development of standardised language resources for Central and Eastern European languages + English hub • Corpora, lexica, morphosyn. specifications • V1: 1998, 7 languages, LaTeX + CES/SGML • V4: 2010, 16 languages, TEI P5 • http://nl.ijs.si/ME/

  11. TEI for Language Resources MULTEXT-East Version 4 by language and resource type

  12. TEI for Language Resources Why TEI for MTE? • Because I like TEI • Varied resources: • Metadata / Documentation • „Document“ corpus: rich annotation structure • Lingustically annotated „1984“ corpus • Sentence alignments: stand-off markup • Morphosyntactic specifications: book-like Either choose several (moving target) schemas or use TEI.

  13. TEI for Language Resources Documentation

  14. TEI for Language Resources TEI Header-v4-v3-v2-v1-eci-ota-soas-

  15. TEI for Language Resources Annotated 1984 <text xml:id="Osl." xml:lang="sl"> <body> <div type="part" xml:id="Osl.1"> <div type="chapter" xml:id="Osl.1.2"> <p xml:id="Osl.1.2.2"> <s xml:id="Osl.1.2.2.1"> <w xml:id="Osl.1.2.2.1.1" lemma="biti" ana="#Va-p-sm">Bil</w> <w xml:id="Osl.1.2.2.1.2" lemma="biti" ana="#Va-r3s-n">je</w> <w xml:id="Osl.1.2.2.1.3" lemma="jasen" ana="#Agpmsnn">jasen</w> <c xml:id="Osl.1.2.2.1.4">,</c> ← sorry! <w xml:id="Osl.1.2.2.1.5" lemma="mrzel" ana="#Agpmsnn">mrzel</w> <w xml:id="Osl.1.2.2.1.6" lemma="aprilski" ana="#Agpmsny">aprilski</w> <w xml:id="Osl.1.2.2.1.7" lemma="dan" ana="#Ncmsn">dan</w> <w xml:id="Osl.1.2.2.1.8" lemma="in" ana="#Cc">in</w> <w xml:id="Osl.1.2.2.1.9" lemma="ura" ana="#Ncfpn">ure</w> <w xml:id="Osl.1.2.2.1.10" lemma="biti" ana="#Va-r3p-n">so</w> <w xml:id="Osl.1.2.2.1.11" lemma="biti" ana="#Va-p-pf">bile</w> <w xml:id="Osl.1.2.2.1.12" lemma="trinajst" ana="#Mlc-pa">trinajst</w> <c xml:id="Osl.1.2.2.1.13">.</c>

  16. TEI for Language Resources Whitespace • A long time ago „1984“ lost its spaces • Whitespace is brittlebut important: • Retokenisation • Reading • TEI <space> no good! • So <mte:space> </mte:space>, 24:1? • Sitting on the fence JOS solution: </S> • <mte:g/>?

  17. TEI for Language Resources Sentence alignments In MTE V3: <?xml version="1.0" encoding="us-ascii"?> <!DOCTYPE cesAlign SYSTEM "xcesAlign.dtd"> <cesAlign version="4.1"> <linkList id="Oruen"> <linkGrp type="body" targType="s" domains="Oru Oen"> <link xtargets="Oru.1.1.1.1 ; Oen.1.1.1.1"/> <link xtargets="Oru.1.1.16.6 Oru.1.1.16.7 ; Oen.1.1.15.6"/> <link xtargets="Oru.1.3.4.1 ; Oen.1.3.4.1 Oen.1.3.4.2"/> <link xtargets=" ; Oen.1.3.4.3"/>

  18. TEI for Language Resources TEI P5 Alignments • TEI way is with two level indirection: 1st grouping, 2nd alignment • Too complicated, esp. as 98% alignments are 1-1 • Chose fence-sitting one-level: <linkGrp type="alignment" corresp="oana-mk.xml oana-sl.xml"> <link n="1:1" targets="oana-mk.xml#Omk.1.1.1.1 oana-sl.xml#Osl.1.2.2.1"/> <link n="2:1" targets="oana-mk.xml#Omk.1.1.2.6 oana-mk.xml#Omk.1.1.2.7 oana-sl.xml#Osl.1.2.3.6"/> <link n="1:2" targets="oana-mk.xml#Omk.1.1.2.8 oana-sl.xml#Osl.1.2.3.7 oana-sl.xml#Osl.1.2.3.8"/> <!--link n="0:1" targets="oana-sl.xml#Osl.4.12.2"/-->

  19. TEI for Language Resources Morphosyntactic specifications • Define categories (PoS) and their features • Map feature-structures to morphosyntactic descriptions (MSD tagsets) • Specify which languages have which features and tagsets • E.g. [Category=Adverb Type=general Degree=superlative] ≡ Rgs ∈ Tagsetsl • Complex morphology → complex specifications • MSD tagsets are grounded in lexicon and corpus

  20. TEI for Language Resources Example: common specifications <table n="msd.cat" xml:lang="en" xml:id="msd.cat.Q"> <head>Common specifications for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name">CATEGORY</cell> <cell role="value">Particle</cell> <cell role="code">Q</cell> <cell role="lang">ro</cell> <cell role="lang">sl</cell> ... </row> <row role="attribute"> <cell role="position">1</cell> <cell role="name">Type</cell> <cell> <table> <row role="value"> <cell role="name">negative</cell> <cell role="code">z</cell> <cell role="lang">ro</cell> </row> <row role="value"> <cell role="name">interrogative</cell> <cell role="code">q</cell> <cell role="lang">bg</cell> <cell role="lang">hr</cell>....

  21. TEI for Language Resources

  22. TEI for Language Resources Language particular specifications <div type="section" select="sl" xml:id="msd.Q-sl"> <head>Slovene Particle</head> <table n="msd.cat" select="sl" xml:id="msd.cat.Q-sl"> <head>Slovene Specification for Particle</head> <row role="type"> <cell role="position">0</cell><cell role="name" xml:lang="sl">besedna_vrsta</cell> <cell role="value" xml:lang="sl">členek</cell> <cell role="code" xml:lang="sl">L</cell> <cell role="name" xml:lang="en">CATEGORY</cell> <cell role="value" xml:lang="en">Particle</cell> <cell role="code" xml:lang="en">Q</cell> </row> </table> <p xml:lang="sl">Opombe: <list> <item>kot členki so označene le pojavnice, ki so navedene v leksikonu</item> </list> </p> <divGen xml:id="msd.Q-sl.lexicon" type="msd.lex" select="sl"/> </div> MTEsl = JOS

  23. TEI for Language Resources

  24. TEI for Language Resources Encoding • TEI provides needed elements, also for commentary, bibliography, ... • TEI XSLT used to render as HTML • Tables retained from MULTEXT • Several XSLT scripts for MSD conversions, e.g. to collating sequence, to fvLib and fsLib • Interesting challenge: conversion to isoCat (Adam P. for Polish tagset), OWL

  25. TEI for Language Resources MTE specifications in OWL(by Christian Chiarcos)

  26. TEI for Language Resources Morals, 1 • TEI good for in-place markup of richly annotated resources with varied structure: • Readable • Updatable (validation) • Not good for huge dataset with shallow annotation: • Processable • Read only → useful for (small, medium size) gold standard hand-corrected language resources / „new“ langauges → localisation /

  27. TEI for Language Resources IMPACT @ JSI • EU IP „Improving Access to Text“ • Make better OCR and IR for historical texts • JSI: Developing a lemmatisation (+ modernisation) module for XIX century Slovene • Background: Lexicon, Tagging and Lemmatisation for modern Slovene + FSA rewrite patterns • Current dataset: AHLib (~100 books) • AHLib marked up in TEI

  28. TEI for Language Resources AHLib Digital Library

  29. TEI for Language Resources IMPACT Lexicon

  30. TEI for Language Resources Mark-up challenges • Text-critical apparatus vs. linguistic annotation • „Parallel“ corpora of transcriptions and modernisations • Layered linguistic annotations: tokenisation, tagsets • Lexicon (+dictionary) encoding

  31. TEI for Language Resources Morals, 2 • Text-critical editions use TEI anyway • Ditto for DLs of historical texts • HLT increasingly applied also to such texts • TEI provides a good basis to join the two views

  32. TEI for Language Resources Current EU Projects: FlareNet • Fostering Language Resources Network (2008-11) • WG4 - Harmonisation of Formats and Standards • D4.1 Identification of problems in the use of LR standards and of standardisationneeds (M12): • „For academic purposes the TEI Guidelines (current version P5) has been a wellestablished and widely used resource of LR‐specific standards mainly for corpusanalysis, markup and annotation. But TEI is hardly known in industrial communities(with a few exceptions) and completely foreign to professional groups such as localizersand translators. We see great potential in using TEI Guidelines in industrial contexts.“ /underlined by T.E./ • D4.2 Proposal of a European Language Resource Standards Framework (M24 /2010-09-01)

  33. TEI for Language Resources Research Infrastructures for the Humanities • DG Research funded RIs; pilot phase, 2008-2010 • DARIAH  ask Lou... • EU RI CLARIN:Common Language Resources and Technology Infrastructure • WP5 Language Resources and Technologies Overview • D5C-3: Interoperability & Standards: „Due to the versatile nature of TEI, most of the following chapters include details on encodingdigital text by following the P5 guidelines and conversion methods.“

  34. TEI for Language Resources Morals, 3 • TEI is firmly acknowledged in current work on LR encoding standardisation • But is not perscriptive enough and lacks modules for many types of LRs → Need of constrained solutions & linkages to ISO/W3C standards: • Cross-walks • Roma & Schema „namespace“ catalogueto DC, LMF, MAF, ...

  35. TEI for Language Resources TEI for LRSWOT • Universality, Maturity, Community, Extensibility (compare ISO) • Vagueness, Learning curve, ISO/W3C linkage • HLT (Humanities Language Technologies), New languages • Marginalisation, Technical obsolescence

  36. TEI for Language Resources Conclusions • Frontiers: DL+HLT, Gold standard LRs • Priority: Instantiated connections to other standards and languages • Connection with linguistics? SIG will tell...

More Related