1 / 40

The Rosetta Project ALL Language Archive

The Rosetta Project ALL Language Archive. Presented by: Laura Buszard-Welcher The Rosetta Project / University of California, Berkeley. A Project of the Long Now Foundation & A National Science Digital Library www.rosettaproject.org. Primary Goals.

pmckillip
Télécharger la présentation

The Rosetta Project ALL Language Archive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Rosetta ProjectALL Language Archive Presented by: Laura Buszard-Welcher The Rosetta Project / University of California, Berkeley A Project of the Long Now Foundation & A National Science Digital Library www.rosettaproject.org

  2. Primary Goals • Support the documentation of the world’s nearly 7000 languages through building • A digital archive of language documentation • A linguistically sophisticated site that is also useful and interesting for the general public • Networks of speakers, educators, linguists • Contributes to the effort to document endangered languages • Promotes linguistic diversity by educating the public about languages with small numbers of speakers.

  3. Secondary Goals • Support metadata standardization and interoperability • OLAC • EMELD • Develop tools for collaborative linguistic research • Endangered Language Query Room • Wordlist Tool • Collaborative document editing/creation (new site)

  4. Roles • The Long Now Foundation • Parent organization of The Rosetta Project • Projects, seminars on topics that foster long term thinking • The National Science Digital Library • U.S. National Science Foundation Program • Goal is to bring online high quality STEM (Science, Technology, Engineering, and Math) resources for education • Sponsor of Rosetta Project (NSF 333727) • Stanford University • Online and offline storage of Rosetta materials

  5. The Long Now Foundation

  6. The National Science Digital Library

  7. Stanford University Libraries

  8. Project History:The 1000 Language Archive • Initiated by The Long Now Foundation • Wanted to experiment with new microetching technology, looking for suitable content • Decided to collect basic descriptive information for 1000 of the world’s approximately 7000 languages

  9. Why language information? • Most natural human languages are products of millenia of human history (therefore a good long term thinking project) • Repositories of cultural information • Languages showcase • Human intellectual sophistication • Cultural diversity • To draw attention to the critical issue of language endangerment

  10. The Rosetta Disk • Next generation microfiche • Micro-etched 2" nickel disk at densities of up to 200,000 page images per disk • Developed by Los Alamos Laboratories and Norsam Technologies • Reading the disk requires a microscope, either optical or electron, depending on the density of encoding

  11. The Rosetta Stone • Not us! (196 BC) • Parallel text written in three scripts: • Hieroglyphic • Demotic (script form) • Greek • The key to deciphering Egyptian Hieroglyphs

  12. Rosetta Stone Language LearningSoftware(Also not us!)

  13. Design of the Disk • Original design has human-eye readable text (Genesis text) and micro-etched text inside an index • New design has human-eye readable text (instructions) on one side and microetched images on the reverse

  14. In-House Scanning • HP CapShare Scanners • Scan printed page in multiple passes, any direction • Page is ‘assembled’ into one image • Stores about 50 pages at a time (300 dpi bitmap .tif) • Uploads numerically sequenced images to computer by infrared port

  15. In-House Scanning • Minolta PS 7000 Overhead • Bitmap and grayscale scans up to 600 dpi • Multiple sizes, orientations • Single page / double page spread (good for text collections with verso annotations) • Best for fragile books, manuscripts that would be damaged by hand scanning

  16. Categories of Collection (1)

  17. Categories of Collection (2)

  18. Language Curation

  19. Rosetta Project Web Site • Welcome • Search for a language • Language overview page • Browse (by name, family, country) • Wordlist tool

  20. Welcome

  21. Search

  22. Language Overview

  23. Browse

  24. Projects • Endangered Language Query Rooms • Digital Online Curation Services for Endangered Language Archives (DOCS) • Wordlist Tool • LangGator

  25. Endangered Language Query Rooms http://emeld.rosettaproject.org/

  26. Query Room Virtual Keyboard

  27. Potawatomi Query Room Re: Bozho by Donald Perrot (host) on July 9 2004, 8:53 PM Nmedagwe'ndan e'gi nebye'ge'yen ngom. Neaseno ndesh ne kas ge' nin, mine E'shkanabe' e'nda ge' nin. I like what you have written. I am called Neaseno (Southwind) myself, and I live in Escanaba, MI. Re: Bozho by Justin Neely on September 7 2004, 1:16 PM Bozho Neaseno mine Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi Bodewadmi ndaw. Shi shi ban nee yek ndebendagwes. Zego ndotem. Kansas City,Mo ndoch bya. Eskanabe edayen ge nin. Bama pi ngom Zagnenibi ndeznekas [Hello Neaseno and Lameen my name is Zagnenibi. I’m Native and Potawatomi. I belong to the Citizen Band. I’m Crane Clan . I’m from Kansas City, Missouri. I also live in Escanaba. Bye for now, Zagnenibi.]

  28. Taking Conversational Risks by [TL] on July 17 2004, 10:30 AM mbesuk onago ngi zhyamen . nseze wgi bye tot i jiman ewi nepamshkamen be gishek. wabek nuwi zhya men ibe eje shna mbesuk . ngi wabmak gode chemokmanuk demojgewat. wabek nin gezhe ni demojgeyan gnebech. bama mine mtego [I went to the lake yesterday. My brother brought a canoe so we could float around all day. Tomorrow we’ll go there to the lake. I saw the white folks fishing. Tomorrow I’ll fish too, maybe. So long for now, Mtego.] Re: onago egi zhejkeyak by [JN] on July 17 2004, 8:12 PM mbesek ndazhya ngom. Mbish ksenyak shode. Nedwendan ode Mbish gshatek. Megwa Nwinebyege ode bodewadmi kiktowenen bama. Megwetch Zagnenibi nin se [I should go to the lake today. The water is cold here. I wish the water were warm. I’ll write more of this Potawatomi conversation later. Thanks, yours truly Zagnenibi.]

  29. Factors in query room success

  30. DOCS Project • Digital Online Curation Services for Endangered Language Archives • Many small language archives are beginning to digitize their materials • Lack technical infrastructure to bring resources online • Goal is to provide access through Rosetta

  31. DOCS Project Archives • Endangered Language Fund (ELF) • Survey for California and Other American Indian Languages (SCOIL) • The Alaska Native Languages Center (ANLC) • Max-Planck Institute for Evolutionary Anthropology (Leipzig)

  32. Wordlist Tool • Swadesh lists (100, 200, 207 terms) from: • Tryon's Comparative Austronesian Dictionary (rekeyed) • Tim Usher's Indo-Pacific database (2002 version) • Paul Whitehouse's Australian and New Guinea database (2002 version) • George Starostin's Dravidian database • Ilya Peiros' Mon Khmer database • Total of 1,384 languages, 3,090 lists online • Additional 3000 lists, up to 1850 terms per list, most 300-500 words in length.

  33. LangGator • A linguistic “Wayback Machine” • Language resource location and aggregation • Use alternate language names, spellings • Deutsch, Hochdeutsch, High German, Allemande • Fadicca, Fadicha, Fedija, Fadija, Fiadidja, Fiyadikkya, and Fedicca • Character identification (inventory, distribution) • Dera (Chadic, Nigeria) • Dera (Trans-New Guinea, Indonesia) • Seed crawler with Wordlist terms (see previous slide), weighted towards longer terms • Archiving through Internet Archive • Serve results through the Rosetta site

  34. Collaborations • Electronic Metastructure for Endangered Languages Data (E-MELD) • General Ontology for Linguistic Description (GOLD) • Open Language Archives Community (OLAC)

  35. E-MELD • Electronic Metastructure for Endangered Language Data • School of Best Practice http://emeld.org/school/index.html • Guidelines and examples for putting linguistic data into best practice digital formats • XML with XML Schema or DTD • Mapping terminology to ontology (GOLD) • FIELD lexical database tool http://emeld.org/tools/field/beta/ • Online collaborative tool to build linguistic dictionaries, backed by ontology (GOLD)

  36. GOLD • General Ontology for Linguistic Description • Built in OWL (Web Ontology Language), linked to SUMO (Suggested Upper Merged Ontology) • Best practice resources should include a mapping between the researcher’s terms, and a standard set, known as the ‘profile’ • ‘independent’ (mine) = ‘main clause’ (GOLD) • ‘obviative’ (mine) = ‘fourth person’ (GOLD) • The standard terminology set can then allow sophisticated searches across disparate resources.

  37. GOLD Community Model

  38. OLAC • Open Language Archives Community • Set of 23 metadata elements and controlled vocabularies (based on Dublin Core) • Subject.language (language described, rather than audience language) uses SIL language codes • Type.linguistic (grammar, lexicon, text) • IMDI (Isle Metadata Initiative) has 135 elements • Recommended extensions (Discourse Types, Linguistic Field, Participant roles • Enables searches across a network of archives that use OLAC metadata http://www.language-archives.org/tools/search/

  39. URLs • Electronic Metastructure for Endangered Language Data (E-MELD) http://www.emeld.org (School of Best Practice, FIELD Tool). • Endangered Language Query Rooms http://rosettaproject.org:8080/emeldbase/. • The Ethnologue http://www.ethnologue.com. • General Ontology for Linguistic Description (GOLD) http://www.linguistics-ontology.org. • ISLE MetaData Initiative (IMDI) http://www.mpi.nl/IMDI/. • National Science Digital Library (NSDL) http://nsdl.org • Open Language Archives Community (OLAC) http://www.language-archives.org. • The Rosetta Project, http://www.rosettaproject.org/live. A preview of the new Web site (currently under construction) is available at http://preview.rosettaproject.org.

  40. Credits • This project is funded by the US National Science Digital Library (NSF 333727)

More Related