1 / 45

Towards portability and interoperability for linguistic annotation and language-specific ontologies

Towards portability and interoperability for linguistic annotation and language-specific ontologies. Robert Munro & David Nathan. Endangered Languages Archive, School of Oriental and African Studies. Outline. Introduction and motivation Linguistic ontologies and markups

ezekiel
Télécharger la présentation

Towards portability and interoperability for linguistic annotation and language-specific ontologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards portability and interoperability for linguistic annotation and language-specific ontologies Robert Munro & David Nathan Endangered Languages Archive, School of Oriental and African Studies

  2. Outline • Introduction and motivation • Linguistic ontologies and markups • Representing knowledge • Supporting fieldworkers • Supporting speakers • Conclusions

  3. 1. Introduction and motivation

  4. Introduction • The main goal of this paper: • how does GOLD meets the requirements of portability for language documentation and description (Bird & Simons, 2003) • Road-testing: • ability to meet the needs of archive users and contributors

  5. Motivation • The Endangered Languages Archive (ELAR) is part of the Hans Rausing Endangered Languages Project (HRELP) • HRELP supports: • the archive • grants for documentation projects • postgraduate programs focussing on language documentation

  6. Motivation • We (ELAR): • support a digital archive (preserve data and provide access to it) • We also train students and grantees in: • markup strategies • data management strategies • multimedia development • choice of recording equipment

  7. Motivation • There is concern that cataloguing metadata (IMDI / OLAC) has not yet been sufficiently extended (Nathan and Austin, 2004) • rich linguistic and contextual information is not being recorded in well-formed portable formats/structures • Common ontologies present a solution to this

  8. How does GOLD meet our needs • We find GOLD to be the most suitable ontology for supporting data portability • GOLD’s focus has been on ‘datanalysis sets’

  9. Summary • We suggest extending the focus to: • data acquisition • data access • Key extensions: • formalising the definitions of concepts by representing them as a set of formal properties • explicitly capturing the conventions and constraints for presentation (rendering) • modelling features that are inherently indeterminate and/or complex structures

  10. 2. Linguistic ontologies and markups

  11. Linguistic ontologies and markups • Ontology: • strictly, what we agree exists • Markup: • strictly, what we are certain about • Ontology and markup converge: • only with consensus and complete confidence • but there is rarely full confidence in the classification of new hard-to-classify phenomena in little-studied endangered languages

  12. Indeterminacy • Builders of ontologies outside of linguistics have been reluctant to accept inherent indeterminacy: In some cases, the incompatibilities [between ontologies] can be smoothed over by tweaking definitions of concepts or formalizations of axioms; in other cases, wholesale theoretical revision may be required. (Niles & Pease, 2001) • If we can identify the incompatibilities, we can model them

  13. Supporting linguistics • A theory-neutral model of linguistics is not possible: • Theories are poly-centric • They will change • We need a pan-theory model of linguistics

  14. Formulising definitions • Each concept in GOLD should be represented by a set of properties that describe that concept • Three possible values for a given property: • ‘Yes’, ‘No’, or ‘Undefined’ (default) • To accurately represent variance: • include enough properties to distinguish terms • For portability: • include as many properties as possible

  15. Formulising definitions • ‘Yes’ can potentially be expanded: • whether the property is mandatory or optional for the concept • dependencies between properties for a concept

  16. Example • ‘Noun’ in GOLD: Noun Definition: A noun is a broad classification of parts of speech which include substantives and nominals (Crystal 1997:371; Mish et al. 1990:1176). (http://emeld.org/gold-ns/description.html#Noun, last checked 23/05/2003) • How do I know if my definition is the same as Crystal or Mish et al? • Is it both definitions, or the common ground?

  17. Example • Will future users of GOLD have the same definition? • the core of ‘noun’ may have longevity • the boundaries with other concepts will not • COPEs can define extensions in terms of sets of properties, and add those properties to GOLD

  18. Example GOLD: NOUN COPEs: GerundNOUN NomVerbNOUN Can’t formally identify the similarities

  19. Example GOLD: NOUN + property: verb suffix + property: verb suffix COPEs: GerundNOUN NomVerbNOUN Can formally identify the similarities Definition of NOUN can grow

  20. 3. Representing knowledge

  21. Rendering • Separating form from content: • ideal for flexibility • not possible for some materials (esp. video)

  22. Rendering conventions / constraints • Some are well known: • italicize part-of-speech in dictionaries • align interlinear transcriptions • Some are not: • representation of language-specific kinship systems, ethnobotanical ontologies etc

  23. Solution 1 • Include a (written) description and/or example of the rendering conventions and constraints: • hard-code the interface

  24. Solution 2 • Include formal representations of the conventions within the data: • interface takes instructions from the data

  25. Solutions • These are two extremes • hard-coded and language specific • data driven and language independent • Database architectures and linguistic ontologies • not designed for navigation • ‘transparent’ access to such structures – who does it support?

  26. 4. Supporting fieldworkers

  27. Supporting indeterminacy • There are two kinds of indeterminacy in linguistics: • confidence in assigning a category (uncertainty) • phenomena that are inherently variable, probabilistic, gradient or continuous

  28. The most valuable information • The most valuable information that a field linguist learns may be the least likely to be annotated • Example: 7uhch in Lakanon Maya: • A temporal-modal deictic expressing participant frames and speaker's footings (Bergqvist 2005) • This term has been given the most thought by the researcher, but it is still not completely understood • The uncertainty (or the extent of certainty) should be recorded: all the properties we do know

  29. 5 reasons for modelling uncertainty • 1. To record our the extent of our knowledge • For example, we want everything known about 7uhch in Lakanon Maya to be recorded, even if we don’t yet have a category for it

  30. 5 reasons for modelling uncertainty • 2. For searchability • If an archive implementing an ontology with uncertain categories exists, then we can more easily find existing solutions to a problem • If a problem is truly new, then we can allow future researchers to find it

  31. 5 reasons for modelling uncertainty • 3. To reach certainty • Even an indeterminate markup can allow a corpus analysis that can inform a decision about assigning the appropriate category

  32. 5 reasons for modelling uncertainty • 4. To highlight problems with descriptive frameworks • A feature may only appear to belong to multiple (or no) categories because the descriptive framework does not yet account for it

  33. 5 reasons for modelling uncertainty • 5. Because the concept is inherently indeterminate • The concept may be inherently fuzzy but not previously encountered as a continuous / contiguous phenomena

  34. Inherently indeterminate features • Eg: cline, gradience, squish, continuities, contiguities, vague, fuzzy, probabilistic • Many prosodic, semantic and discourse features are inherently continuous • Growing arguments for probabilities to be part of our formal linguistic models for morphological and syntactic structures (Aarts, 2004; Bayen, 2003; Manning, 2003)

  35. Inherently indeterminate features • Representing categories by formal properties meets the current requirements of modelling gradience (Aarts, 2004) • Perhaps the “ContinuousObject” concept of SUMO (Niles & Pease, 2001) could also be used? • The problem is, currently, largely unresolved

  36. Incorporating new categories • How do we know that a given category is not the same as another one identified elsewhere? • Formal properties for concepts give us another means for comparison

  37. Incorporating structures • As well as inherently discrete phenomena and inherently indeterminate ones, there is a third kind: concepts that are complex structures • common in syntax and discourse semantics • How do we model a structure in an ontology?

  38. 5. Supporting speakers

  39. Users of EL archives • The largest (and growing) user group for endangered languages materials are the speakers of endangered languages • Rarely interested in linguistic categories or navigating a corpus or archive via them • Supporting language-specific ontologies means supporting information-rich structures for both navigation and analysis

  40. Case Study: Yolngu kinship • The Yolngu languages have an extensive kinship terminology called Gurrutu • 27 terms that identify individuals and sets of individuals in terms of moiety, generation, gender, and patriline or matriline. • The terms extend infinitely through cyclicity

  41. Case Study: Yolngu kinship • Speakers draw from the same sets of kinship relations to describe their relationship to the Yolngu lands • We cannot always annotate well-known linguistic concepts independently of language-specific ontologies

  42. 6. Conclusions

  43. Conclusions • Ontology building for endangered languages can be very different to other ontology projects • The uncertain is often more valuable than the certain • The local is often more interesting than the universal • … but will still need interoperability • We suggest extending the focus of GOLD to • data acquisition • data access

  44. Conclusions • Current GOLD does not need to be altered to incorporate our suggestions • except to remove assumptions of invariability • Key extensions • formalising the definitions of concepts by representing them as a set of formal properties • explicitly capturing the conventions and constraints for presentation (rendering) • modelling features that are inherently indeterminate and/or complex structures

  45. References Aarts, B 2004 Modelling linguistic gradience. Studies in Language, 28(1):1–49. Bateman, J 1992 The theoretical status of ontologies in natural language processing. In Text Representation and Domain Modelling – ideas from linguistics and AI, Technische Universität Berlin Bayen, H 2003 Probabilistic Approaches to Morphology In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press. Bergqvist, H 2005 Semantics of temporal deictics in Lakandon Maya. Presentation given at the ELAP-ELAR seminar series, SOAS, London. Bird, S & G Simons. 2003. Seven Dimensions of Portability for Language Documentation and Description, Language 79/3: 557-582. Christie, M & W Gaykamangu 2003. “Kinship, moiety, land & language in Arnhem Land”. In literacy link. Australian Council for Adult Literacy, vol 23, no 5 Oct 2003. Christie, M, W Gaykamangu & D Nathan. 2001. Yolngu Languages and Culture: Gupapuyngu. Faculty of Aboriginal and Torres Strait Islander Studies, NTU [Multimedia CD-ROM] Crystal, D. 1997 A dictionary of linguistics and phonetics. 4th edition. Cambridge, MA: Blackwell Cysouw, M, J Good, M Albu & HJ Bibiko 2005 Can GOLD “cope” with WALS? Retrofitting an ontology onto the World Atlas of Language Structures. Proceedings of the E-MELD 2005 Farrar, S. & D. T. Langendoen. 2003. A linguistic ontology for the Semantic Web. GLOT International 7 (3), 97-100. Farrar, S. 2003a Markup and the GOLD ontology. Proceedings of the EMELD 2003 Farrar, S. 2003b An ontological account of linguistics: extending SUMO with GOLD. Proceedings of the 2003 IEEE International Conference on Natural Language Processing and Knowledge Engineering. Beijing Foley, W A2003 Genre, register and language documentation in literate and preliterate communities. In Peter K Austin (ed.) Language Documentation and Description vol 1 Grinevald, C 2003 Speakers and documentation of endangered languages. In Peter K Austin (ed.) Language Documentation and Description volume 1 Gruber, T R. 1993 A translation approach to portable ontologies. Knowledge Acquisition, 5(2), 199-220 Himmelmann, N P 1998 Documentary and descriptive linguistics. Linguistics 36. 161-195. Berlin: de Gruyter. Holton, G 2003 Approaches to digitization and annotation: A survey of language documentation materials in the Alaska Native Language Center Archive. Proceedings of the EMELD 2003 Manning, C. 2003 ProbabilisticSyntax In Bod, R., Hay J. and Jannedy, S. (eds). Probabilistic Linguistics. MIT Press. Nathan, D. (ed) 1996. Australia’s Indigenous Languages. Adelaide: SSABSA Nathan, D and P K Austin (2004) Reconceiving metadata: language documentation through thick and thin. In Peter K Austin (ed.) Language Documentation and Description Volume 2. Niles, I & A Pease. 2001. Towards a standard upper ontology. Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001) Penton, D, C Bow, S Bird & B Hughes. 2004. Towards a General Model for Linguistic Paradigms. Proceedings of EMELD 2004

More Related