1 / 27

The Open Language Archives Community and Asian Language Resources Steven Bird Gary Simons Chu-Ren Huang

OLAC Aims. OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by:developing consensus on best current practice for the digital archiving of language resources;developing a ne

oakes
Télécharger la présentation

The Open Language Archives Community and Asian Language Resources Steven Bird Gary Simons Chu-Ren Huang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. The Open Language Archives Community and Asian Language Resources Steven Bird Gary Simons Chu-Ren Huang Penn SIL Academia Sinica

    2. OLAC Aims OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: developing consensus on best current practice for the digital archiving of language resources; developing a network of interoperating repositories and services for housing and accessing such resources.

    3. OLAC Organization Coordinators: Steven Bird & Gary Simons Advisory Board: Helen Aristar Dry, Susan Hockey, Chu-Ren Huang, Mark Liberman, Brian MacWhinney, Michael Nelson, Nicholas Ostler, Henry Thompson, Hans Uszkoreit, Antonio Zampolli Participating Archives & Services: LDC, ELRA, DFKI, CBOLD, ANLC, LACITO, Perseus, SIL, APS, Utrecht Prospective Participants: ASEDA, Academia Sinica, AISRI, INALF, LCAAJ, Linguist, MPI, NAA, OTA, Rosetta, Tibetan Digital Library (UVA) Individual Members: ~120 www.language-archives.org

    4. Types of Language Resource DATA: any information which documents or describes a language, such as a: monograph, data file, shoebox of index cards, unanalyzed recordings, heavily annotated texts, complete descriptive grammar TOOLS: computational resources that facilitate creating, viewing, querying, or otherwise using language data includes fonts, stylesheets, DTDs, Schemas ADVICE: any information about: reliable data sources, appropriate tools and practices

    5. The Gap

    6. Coordinated Approach

    7. OLAC

    8. The Foundation: 3 initiatives Dublin Core Metadata Initiative (DC) founded in 1995 (Dublin, Ohio) conventions for resource discovery on the web Open Archives Initiative (OAI) founded in 1999 (Santa Fe) interoperability of e-print services Open Language Archives Community (OLAC) founded in 2000 (Philadelphia) a partnership of institutions and individuals creating a worldwide virtual library of language resources

    9. Foundation 1: DC Elements 15 metadata elements: broad interdisciplinary consensus each element is optional and repeatable applies to digital and traditional formats Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights. dublincore.org

    10. Foundation 1: DC Qualifiers Encoding Schemes: a controlled vocabulary or notation used to express the value of an element helps a client system to interpret the element content e.g. Language = "en" (not "English", "Anglais", ...) Refinements: makes the meaning of an element more specific e.g. Subject.language, Type.linguistic

    11. Foundation 2: OAI Repository

    12. Foundation 2: OAI Standards To implement the OAI infrastructure, an archive must comply with two standards: 1. The OAI Shared Metadata Set Dublin Core interoperability across all repositories 2. The OAI Metadata Harvesting Protocol HTTP requests - 6 verbs: Identify, ListIdentifiers, ListMetadataFormats, ListSets, ListRecords, GetRecord XML responses

    13. Foundation 2: OAI Service Providers and Data Providers

    14. Foundation 3: OLAC & OAI Recall: OAI data providers must support: Dublin Core Metadata OAI Metadata harvesting protocol BUT: OAI data providers can support: a more specialized metadata format a more specialized harvesting protocol What OLAC does: specialized metadata for language resources specialized harvesting (extra validation)

    15. OLAC Standards Aside: standards = the protocols and interfaces that allow the community to function recommendations = "standards" for representing linguistic content OLAC has three primary standards: OLACMS: the OLAC Metadata Set (Qualified DC) OLAC MHP: refinements to the OAI protocol OLAC Process: a procedure for identifying Best Common Practice Recommendations

    16. The OLAC Metadata Set The three categories of metadata: Work language: describes information entities and their intellectual attributes e.g. names of works and their creators Document language: describes and provides access to the physical manifestation of information e.g. format, publisher, date, rights Subject language: describes what a document is about e.g. subject, description

    17. OLACMS and Controlled Vocabularies Language: A language of the intellectual content of the resource (OLAC-Language) Subject.language: A language which the content of the resource describes or discusses (OLAC-Language) OLAC-Language: A vocabulary for identifying the language(s) that the data is in, or that a piece of linguistic description is about, or that a particular tool can process

    18. Summary: With the software in place, we have a complete platform

    19. OLAC and Asian Languages TWO Issues Language Identification Is current OLAC/Ethnologue vocabulary rich enough to describe all Asian languages? Note:OLAC is expected to adopt Ethnologue codes for language identification. See www.ethnologue.org Multilingual Resources Is current OLACMS comprehensive enough to describe multilingual resources?

    20. Language Identification The DC two letter code (e.g. en for English) is not enough to describe all the languages in the world Ethnologue (http://www.ethnologue.org) is currently the most comprehensive description of the worlds languages Potential Problems of using Ethnologue (or any existing language list) over-splitting over-chunking omission

    21. Solution to LI Problems I Use controlled vocabulary for elaboration: <language code="x-sil-BNN">Northern/Takituduh</> <language code="x-sil-BNN">Northern/Takibakha</> <language code="x-sil-BNN">Central/Takbanuaz</> <language code="x-sil-BNN">Central/Takivatan</> <language code="x-sil-BNN">Southern/Isbukun</>

    22. Solution to LI Problems II Registering language groups with an OLAC registration service OLAC language classification server would house a comprehensive list of language family names (defined by users) and their extensional definitions (i.e. sets of Ethnologue codes) AS:Amis = {ALV, AIS}

    23. Multilingual Resources I Directionality is crucial in multilingual resources However, OLAC metadata is flat and unordered In MT systems: lost information but sufficient for resource harvesting Bi-directional MT <Language code= X/> <Language code= Y/> <Subject.language code= X/> <Subject.language code= Y/>

    24. Multilingual Resources II One-to-many MT: <Subject.language code= S/> <Language code= T1/> <Language code= T2/> <Language code= T3/> Many-to-one MT: <Subject.language code= S1/> <Subject.language code= S2/> <Subject.language code= S3/> <Language code= T/>

    25. Multilingual Resources III Text: language Bitext (bilingual aligned corpus) There is always an directionality Original: language Translation: Subject.language Language Description (Field Notes) Elicitation, transcription, translation, notes ?Multiple related resources

    26. Summary: Repositories completely bridge the gap, letting us consistently organize and archive our resources

    27. OLAC

    28. OLAC Launch

More Related