1 / 27

Digital data archives in the humanities

Digital data archives in the humanities. Issues for participating in the semantic web the case of PARADISEC. Linda Barwick, University of Sydney APAN Semantic Web workshop, Bangkok, 27 January 2005. Endangered languages. Over 2000 of the world’s 6000 languages in the Asia-Pacific region

rock
Télécharger la présentation

Digital data archives in the humanities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digital data archives in the humanities Issues for participating in the semantic web the case of PARADISEC Linda Barwick, University of Sydney APAN Semantic Web workshop, Bangkok, 27 January 2005

  2. Endangered languages • Over 2000 of the world’s 6000 languages in the Asia-Pacific region • Number likely to fall to a few hundred by 2100 (UNESCO) • Australian researchers active in region since 1950s - making unique recordings of unrepeatable events • Recordings now themselves endangered (format obsolescence, media deterioration, loss of metadata)

  3. PARADISEC’s mission • To preserve and make accessible Australian researchers’ field recordings of endangered languages and musics from the Asia-Pacific region • Preservation: to adopt world’s best practice standards and formats to maximise sustainability and future useability of the collection • Access: To take advantage of emerging information and communication technologies to maximise access to our collection by both researchers and cultural heritage communities

  4. PARADISEC structure CIs: Cliff Goddard Hugh de Ferranti CIs: William Foley Allan Marett Jane Simpson Audio Archiving Unit Director: Linda Barwick Audio: Frank Davey Project Liaison: Amanda Harris CIs: Andrew Pawley John Bowden Malcolm Ross Alan Rumsey CIs: Steve Bird Nick Evans Cathy Falk Janet Fletcher John Hajek Store account - web interface Stuart Hungerford Project Manager (Metadata guru) Nick Thieberger

  5. Networking • Main campuses (University of Sydney, University of Melbourne, Australian National University) connected by Grangenet (next generation research network, 10Gbps connections) • Pay subscription, not traffic costs • Satellite campus UNE connected by AARnet (Australian research and education network - currently billed traffic cost, 155Mbps connection) • Both with connections to APAN community (Asia Pacific Advanced Networks) - potential for linking to regional and international R&E networks - potential traffic costs an issue

  6. Storage • Australian Partnership for Advanced Computing National Facility Mass Data Storage System - Hierarchical Storage Manager system • Funded by consortium of Australian higher education bodies • Tape robot system - can handle 1.2PB • PARADISEC will add 2-3TB per year once satellite ingest commissioned • Current horizon of facility 2008 - project PARADISEC collection up to 9TB by then • Will need to apply to host material/share data from other collections

  7. Software • Initial metadata database in Filemaker Pro 6 with periodic XML dumps for OLAC static harvesting • Currently being ported to MySQL/PHP to allow dynamic harvesting and other functionality • Python software for managing repository and website (Stuart Hungerford, ANU) • Developing Java-based geographic search interface (TimeMap) • All based on Open Source tools

  8. Audio Ingest • Initially ingested as raw WAV on AudioCube 5 Dell 670 workstations running Wavelab (2005 will add remote Pyramix workstations) • Masters 24-bit 96khz Broadcast WAV Format (uncompressed audio with encapsulated metadata) • Some lower rate (e.g. if digital original 16bit 48khz from DAT) • WAV > BWF by Quadriga audio archiving software • derivatives produced by batch processing - CD-audio quality (16-bit, 44.1khz) and mp3 quality(128bps)

  9. Digital preservation • “Azoulay” server partitioned for working files and archive partition for sealed masters - current capacity 750GB (>3TB in 2005) • Sealed masters archived to 100GB data tapes on University of Sydney LTO Mass Data Storage System (high-low watermark script) - duplicate data tapes kept at 2 locations on campus • Sealed masters mirrored to Australian Partnerhsip for Advanced Computing national Store facility (Canberra) nightly • Password-protected online access to Store facility

  10. Data repository contents • Repository totals 21 January 2005 • total files: 2714 • total items: 847 • total size: 1.1TB • total hours audio: 668 hours • file types: .wav, .mp3 (1210); .tif, (171), .jpg (46), .pdf (34), .txt (3), .rtf (8), .xml (32)

  11. Data repository collections McElhanon (41hr) McIntyre (10hr) Margetts (17hr) Poignant (2hr) Rumsey (20hr) San Roque (1hr) Sam (6hr) Tepano (19hr) Thieberger (39hr) Toulmin (35hr) Voorhoeve (33hr) Wurm (11)* Evans (Hons thesis) Thieberger (PhD thesis) Bradley (5hr) Brotchie (15hr) Capell (9hr)* Corris (6hr) Crowther (2hr) Donohue (3hr) Dutton (266hr) Fedden (7hr) Foley (23hr) Gardner (56hr) Kartomi (2hr)* Loughnane (9hr) Lawton (3hr) Laycock (29hr) * Ingestion ongoing January 2005

  12. PARADISEC Repository Languages November 2004 PALAU Palauan PAPUA N. GUINEA Abau Ambonese Pidgin Angoram (Kanduanuin) Angoram (Moim dialect) Aomie Arapesh Arifama Aunalei Auwim Awomo Ba Balawaia Barai Baruga Barupu (Warapu) Be'anivia Biage Bibo Binandere Bodinumu Boera Boine Boku Boridi Bouxula BratMomire Buin Burum Chimba Chirima Daga Darava Dawawa Dedua Qld Pidgin Rabuka Raepa Tati Saliba Samo Sene Sepik Tok Pisin Sialum Sinaugoro Sona Suau Suku Surai Taboro Tairuma Tauade Tobo Tok Pisin Tolai Uberi Ubir Ubir Gonjoe Vesilogo Vioribaiwa Wamora Wangun Wiga Wosera Yele. Yewudu Yimas Yoba Dima Dimadima Dina Doga Domu Doromu Doura Efogi Efogi Dialects Emo Enivilogo Fore Fuyugey Gabadi Ginuman Gwedena Herei Hiae Motu Hiri Motu Hube Hula I'ai Ikega Ioma Isaka (Krisa) Kaipi Kairi Kambot Kanga Karama Karawari Lg (Ambinwari) Karukaru Kâte Kinalaknga Kimi Kiriwina Koiari Koita Koitabu Kokila Kokoro Komba Kopar Koriki Koriko Kosorong Kovai Kovio Kubuirubu Kuman Kumukio Kuni Kunimaipa Kwale Laimodo Mada'a Magi Mâgobineng Magore Maisin Maiwa Managalas Manam Manubara Manumu Mapei Mapena Mari Maria Mekeo Melpa Mian Mid-Wahgi Migabac Mindik Miniafa Mogoni Mom Mor Motu Muhiang Arapesh Nabak Naga Namanadza Naoro Nara New Ireland Pidgin Ngala Nomu Notu Ondoro One (Onne) Onjab Ono Opao Orokaiva Orokolo Ouma Paiwa Police Motu Porome SOLOMONS Babatana Ririo Ruviana Varese Lau Santa Cruz INDONESIA Asmat Brat Hatam Inanwatan Manikion Moi Ningrum Sahu Sebyar Tinam Todahe Tok Pisin Yahadian VANUATU South Efate Bislama Lelepa FIJI Lauan TONGA Tongan COOK ISLANDS Rarotongan Pukapuka FRENCH POLYNESIA Tahitian CHILE >>> Rapa Nui INDIA Rajbangsi NEW CALEDONIA Dehu .

  13. Sample item interface

  14. Sample item interface

  15. Sample catalog metadata

  16. Metadata January 2005 • 1800 items (recordings or theses) digitised or assessed for digitisation (1629 findable online via metadata repository) • 254 languages from 39 countries in Asia-Pacific • Cassettes: 1256 hours • Reel-to-reel tapes: 417,356 metres of tape • Video: 356 hours

  17. Sub-communityof Open Archives Initiative Worldwide virtual library of language resources PARADISEC one of 29 participating archives Open Language Archives Community (OLAC)http://www.language-archives.org AIMS • develop consensus on best current practice for digital archiving of language resources • develop network of interoperating repositories & services for housing & accessing such resources

  18. Metadata OLAC harvest

  19. www.mpi.nl/DOBES www.uaf.edu/anlc/ emeld.org lacito.vjf.cnrs.fr/archivage www.hrelp.org/archive/ www.ailla.utexas.org paradisec.org.au www.arts.auckland.ac.nz/ant www.aiatsis.gov.au DELAMAN connections www.delaman.org

  20. General Ontology for Linguistic Description

  21. Music Description Ontologies? • Much more complicated situation because of commercial music industry interests • Most ontologies designed for commercial music (albums, tracks, composers etc ) or Western music notation (diatonic scale etc) • Most recent ethnomusicological discourse concentrates on social context rather than description or analysis and suspicious of universalist approaches • Some current initiatives e.g. EU MusicNetwork

  22. Issues for semantic web • Small-scale specialist archive with few staff and precarious funding - not resourced for huge amount of work for RDF markup • Curator-intensive - cannot be readily automated • Need to motivate and involve researchers and communities in description as well as high-level ICT advisors • Present highest priority salvage of endangered media • Lack of appropriate ontologies especially for music

  23. But… • We have a good foundation - well-structured data and metadata (for whole-item level) conforming to international standards • We are in conversation with international disciplinary communities through OLAC, EMELD, DELAMAN • Our collection is of high cultural heritage and scholarly value, of interest to international community • We are motivated to learn more from other large-scale distributed digital data archives

  24. PARADISEC gratefully acknowledges support from: • Partner Universities (Sydney, Melbourne, ANU, UNE) • Australian Research Council LIEF scheme • Australian Partnership for Sustainable Repositories (SORRT testbed) • Australian Partnership for Advanced Computing • Grangenet • ANU Internet Futures

  25. Contact us • http://www.paradisec.org.au • Linda.Barwick@paradisec.org.au (Director) • Nicholas.Thieberger@paradisec.org.au (Project Manager)

  26. Relevant URLs • PARADISEC website http://paradisec.org.au/ • PARADISEC repository login http://store.apac.edu.au/cgi-bin/pdsc-v3.0.cgi/login • PARADISEC streaming trial http://paradisec.org.au/streamingtrial.html • Transcript page image trial http://www.austehc.unimelb.edu.au/~gavan/lana/hdms.htm • EMELD General Ontology for Linguistic Descriptionhttp://www.emeld.org/tools/ontology.cfm

More Related