Establishing Morpho-syntactic Data Categories for Enhanced NLP Interoperability

Data Category Registry: Morpho-syntactic Profile Gil Francopoulo (TAGMATICA + CNRS-LIMSI, France), with the help of the rather active Morphosyntactic group: Nuria Bel (univ. Pompeu Fabra, Spain) Thierry Declerck (DFKI, Germany) Aida Khemakhem (Miracle/Sfax, Tunisia) Monte George (ANSI, USA) Sue Ellen Wright (Kent univ, USA) Chu Ren Huang (Polytechnical univ, Hong-Kong) Monica Monachini (CNR-ILC, Italy) Tokunaga Takenobu (TIT, Japan) Adam Przepiokowki (Poland) Tomaz Erjavec (JSI, Slovenia) Daniel Zeman (univ Karlova, Czech Rep) Gunta Nespore (univ of Latvia, Latvia) Karlheinz Mörth (Vienna, Austria) Karin Beck (Univ of Tübingen, Germany)

Introduction • Work in progress to define an initial set of morpho-syntactic data categories dedicated to NLP applications • The aim is to improve interoperability among language resources and to optimize the process leading to their integration in applications • The main point is to be sure that when a language resource makes use of a value, the other language resources and programs have the same interpretation for this given value • These values have been collected from existing lists, discussed, extended, and then recorded within a freely accessible data base: the ISO Data Category Registry (DCR)

Context • This work is done within the context of ISO-TC37 • The TC37 standards are currently elaborated as high level specifications and deal with word segmentation (ISO 24614), annotations (ISO-LAF, ISO-MAF, ISO-SynAF, i.e. 24611, 24612 and 24615), feature structures (ISO-FSD 24610), and lexicons (ISO-LMF 24613) • These standards rely on low level specifications dedicated to constants, namely data categories (revision of ISO 12620), language codes (ISO 639), scripts codes (ISO 15924), country codes (ISO 3166) and Unicode (ISO 10646)

Context (cont.) • This bi-level approach will form a coherent family of standards with the following common and simple rules: 1) the low level specifications provide the constants 2) the high level specifications provide structural elements that are decorated by the constants

Data model: notion of profile • The registryisdividedinto profiles • A profile is a set of data categories • Each profile isassociatedwith a team of experts (with a convenior) whocollectivelyrepresent a community of practice in the area of languageresources • There are currentlyfourteen profiles such as terminology, meta data etc. covering all activities of ISO-TC37. The currentpresentationfocuses on one profile dedicated to NLP: the morpho-syntactic profile • Note: many times, a DC belongs to only one profile, but some of thembelongs to several profiles (e.g. part of speech)

Methodology: phases • We proceeded in four phases: • Phase-1: collating of candidates data categories (2006) • Phase-2: grouping, discussing, structuring, and redaction of definitions (2007-2008) • Phase-3: global revision (2009) • Phase-4: welcome a group of new comers for another revision (2010)

Methodology: sources • For the morpho-syntactic profile, a long list has been collected from: • ISO-12620:1999 • Eagles and Multext-East • Some values for Semitic languages coming from Sfax Univ. • Some values needed for ISO-TC37 standards (MAF, synAF, LMF) were also added • Some isolated values were also coming from various remarks in 2010 • These values have been collected in close coordination with the syntactic profile in order to distinguish the morphosyntactic and the syntactic values. For the syntactic values, an initial list was collected, based on: • Eagles • Tiger (German project) • Technolangue/Easy (French project)

Methodology: detail of recording • Each DC has an identifier that is English based: use of camel case style (e.g. commonNoun), as specified in the revision of ISO-12620 • Each DC has a definition in English and French. The text respects the ISO rules for definitions. A definition may be complemented by a note. • A DC may be linked through a broader link to another DC. A DC may have a value domain. • Each DC has at least, a name in English and one in French, which may be used directly for display without any transformation (e.g. « common noun »)

Current registry • The 12620 revision work started in 2003 and a lot of energy has been spent in order to find an operational consensus • The model is implemented in a system called « isocat » which is currently running and located at: « http://www.isocat.org » • A dozen of people have entered values, mainly in the domain of metadata, terminology, morpho-syntax, and syntax. The other profiles are almost empty. • The number of values is rather huge (468), so in order to facilitate management, a series of sub-profiles were created

Practical organization of data Morpho-syntactic profile: Basics 61 These aregeneral purpose linguistic constants, like: comment, derivation, elision, foreignText, and label. Cases 33 Examples of values: ablativeCaseor dativeCase. FormRelated 36 These are constants for the specifications of forms like: spokenForm, writtenForm, abbreviation, expansionVariation, transliteration, romanization, transcription, script. Morphological Features excluding cases 82 Attributes includefor instance grammaticalGender, moodand tense. Values include,for instance,feminine,indicative, present. Operations 29 Constants includefor instance,addAffix, addLemma. Part of speech 120 Part of speech values arestructured with a top level set composed of 10 values like nounor verb. A very precise ontology is specified forgrammatical words. Most of parts of speech are common to lexicons and annotations but two set of values (i.e. punctuationand residual) are specific to annotation and are not usually used in lexical descriptions. Register, dating and frequency 19 Constants include,for instance,slangRegisterorrarelyUsed. Total 380

Extract: genitiveCase illativeCase inessiveCase instrumentalCase lativeCase locativeCase nominativeCase obliqueCase partitiveCase prolativeCase sociativeCase sublativeCase superessiveCase terminativeCase translativeCase vocativeCase Cases: abessiveCase ablativeCase absolutiveCase accusativeCase adessiveCase aditiveCase allativeCase benefactiveCase causativeCase comitativeCase dativeCase delativeCase elativeCase equativeCase ergativeCase essiveCase

Extract: native orthographyName patternType phoneticForm phoneticSeparator pinyin nonSpacedPinyin spacedPinyinAndTonereduplication root script stem stemRank symbol token writtenForm Form related values: affix infix prefix suffix affixRank allomorph apocope componentRank conjugated contextualVariation expansionVariation geographicalVariant graphicalSeparator homograph homonym homophone lemma lexicalType morpheme etymologicalRoot

Problems encountered • As said earlier, we started from existing lists that are rather stable like those for Eagles or Multext-East • The problems that we encountered were that we had to write definitions. We searched in various sources and found some definitions that apparently looked fine in isolation but they did not collectively constitute a coherent set of definitions • Linguistics is not a field with a common agreement on basic terms. Ex: paradigm, collocation, morpheme, ergative • As a matter of example, look at the entry « morphology » in Wikipedia • Another problem we faced was that we had to write definitions that are valid for both lexicon and annotation activities. Ex « word » • To deal with this problem, we carefully avoided some dangerous terms

Forthcoming data • The current database records values for West/East European languages and, to a certain extend, for Semitic languages • We know that it is clearly not enough • Two parallel tasks are currently being conducted • One task deals with Asian values within the NEDO project. A small set of DC has been entered in the database • The other task deals with the DCs specifically needed for African languages: a study is being conducted by the ISO South African delegation, but the values are not entered yet in the database

Conclusion • The registry is far from being complete but it begins to be used within different applications in order to be tested. • The idea is to progressively increase the number and coverage of these data categories • The ambition is that the registry will become the reference point when using linguistic terms and data elements in lexicons and annotations within NLP context • Thank you for your attention

Establishing Morpho-syntactic Data Categories for Enhanced NLP Interoperability

Establishing Morpho-syntactic Data Categories for Enhanced NLP Interoperability

Presentation Transcript

Category Profile Spend Analysis Template

Morpho Life Cycle

Morpho Tutorial

Data Category specifications

Unsupervised Syntactic Category Induction using Multi-level Linguistic Features

Data Category specifications

The ISO 12620 Data Category Registry

ISOcat Data Category Registry Defining widely accepted linguistic concepts

Principles of ISOcat , a Data Category Registry

Content of the Data Category Registry

TRT Registry Data 2004

Morpho

Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

A Data Category Registry- and Component-based Metadata Framework

Morpho Activity

Syntactic category acquisition

Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

LIRICS WP3: Morpho-syntactic and syntactic annotations

A Revised Data Model for the ISO Data Category Registry

Cancer Registry Pathology Report Profile (CPR)

Saudi Arabia Carbonates Category Profile 2014

Russia Energy Drinks Category Profile 2014