220 likes | 312 Vues
Learn about digital archiving challenges, image standards, metadata, data migration, transcription methods, and more in the ESSSS digital archive field. Explore the use of handheld digital cameras, various image formats, web image management infrastructure, digital preservation, and crowdsourcing for transcription projects.
E N D
Technology Support for ESSSS Marshall Breeding Director for Innovative Technology and Research Vanderbilt University Library Founder and Publisher, Library Technology Guides http://www.librarytechnology.org/ http://twitter.com/mbreeding Progress, Issues, and Challenges ESSSS Digital Archive Workshop February 4, 2012
Turning Pages on Paper to Digital Images • Digitizing in the field involves many compromises compared to what can be done in more controlled settings • Access to archives may be of limited duration • Arbitrary and political • Materials deteriorating rapidly • Practices related to physical preservation tend to be minimal • Must be light, fast, and expensive
Achieve best results possible • Maximize quality and consistency • Handheld digital cameras • Rapid advancement in capabilities • Early images down at lower resolutions compared with what is possible today • Fixed camera stands • Consistency in orientation and framing • Organization of Images (folders / image names)
Image Standards • TIFF: Currently regarded as best image format for archiving images • RAW: Native proprietary format of a camera • JPEG: Compressed images for display on the Web • Data lost during compression: non-reversible • VU system creates multiple sizes of JPEG images • JPEG2000 • Lossless compression method • Not well supported on the Web
Bringing Images to the Web • Take advantage of infrastructure developed at by the Vanderbilt University Library to manage images • Digital Library framework: • Presentation and functionality created in Perl-based interface • Data and Metadata stored in MySQL relational tables • ODBC connectivity between presentation layer and MySQL • Microsoft Windows Server/IIS for Web server • Images reside on digital storage provided by the Vanderbilt University Library
Digital Preservation • Disaster Recovery • Ability to restore files in the case of any hardware, software, or human Error • Digital Preservation • Commitment and processes in place to preserve digital information for the very long term • Multiple replications • Migration of data into future formats as current standards become obsolete
Building structure through Metadata • Metadata structure based on Dublin Core • Volume-level descriptive metadata • Courtney Campbell designed metadata structure and is analyzing volumes to populate metadata for each volume • EXIF Data extracted from images into the individual records for each page • Page-level structure • Supports ability to select volumes and browse page images
Demonstration • Image management environment • Interface • Metadata • Page Images
Turning Pages into Data • The contents of the page images contain valuable data • Page images can be read by humans but do not support essential features: search, computer analysis, etc. • Full value of these collections can be realized through transcription
Challenges in transcription • Page characteristics • Hand written by many different hands • Many names and numbers • Spanish language • Varying contrast • Many defects: water damage, insects, etc
Human transcription • Scholars that work with pages of interest can create transcriptions manually • Optical character recognition? • Highly accurate for typescript • Not effective for handwritten manuscripts
Crowdsourcing • Find ways to have large numbers of persons create transcript snippets • Google uses crowdsourcing to improve transcripts for Google Books project.
Google ReCAPTCHA: • “Digitizing books one word at a time” • Each transaction transcribes one or two words • Each word is transcribed many times • Results compared to determine correct version
Crowdsourcing to Transcribe ESSSS • Scholars contribute any transcriptions created as they work with any given set of pages • Students assigned to create transcriptions • Language, history, LIS • Collaboration with some organization with ReCAPTCHA like infrastructure