150 likes | 276 Vues
This document discusses the critical ingest workflow issues faced by the North Carolina State University Libraries, particularly in the context of archiving geospatial data. Key challenges include inconsistent metadata, varied formats, missing information, and the management of data received from multiple agencies. It covers the decisions made to improve metadata quality and integrity and highlights tools and methods used for automated extraction and normalization. Attention is also given to rights encoding and ongoing engagement with stakeholders to address metadata deficiencies.
E N D
Steve Morris North Carolina State University Libraries Ingest Workflow Issues:MetadataNorth Carolina Geospatial Data Archiving Project
How the Data is Received • Data is delivered as is – no control over organization of received data • Contributing organizations • County and municipal agencies • State agencies • Regional councils of government • Data transfer modes • CD/DVD, External Drive • FTP or Web Download
Ingest Challenges: General Data consists of multi-file, multi-format objects Ancillary data files can be shared by datasets Some formats require conversion now Some format conversions involve one-to-many relationships Compressed archive files are common and behave unpredictably And all the usual challenges: format validation, validity checking, threat scanning,…
Ingest Challenges: Metadata • Metadata is encoded in a variety or ways • The FGDC content standard for metadata lacked an encoding standard (arrived pre-XML), will soon be addressed in ISO 19115/19139 FGDC implementation • XML (varied schemas), TXT, HTML • Metadata is missing • Only about 25% of local agencies use FGDC • Metadata is wrong • Metadata is commonly asynchronous with the data • Inconsistent use of dataset naming, etc.
Some Key Decisions • Capture “transfer set” metadata • Normalize, synchronize, and remediate existing metadata, and retain original metadata record • Treat contact information as archival • Update metadata with format conversions • Use ESRI Profile of FGDC • added technical and administrative elements • Has an XML schema • ArcCatalog tool support • Use simple rights encoding scheme • Record metadata in a workflow management database
What is Transfer Set Metadata? • Administrative and technical metadata associated with a transfer device or download • Propagates to individual data objects PHP Application Interface for Transfer Set Metadata Capture
If No Metadata, What Then? • Autoextract a subset of technical and descriptive metadata through ArcCatalog • Apply an agency-specific metadata template (many elements are static within the context of the agency) • Acquire information from the NC OneMap Inventory • Data Source • Contact Info • Datum, Coordinate System • Acquire information from agency web site • Avoid direct inquiries to local agencies (“contact fatigue”)
What Gets Remediated and Why? • Key technical elements that are wrong • Datum, coordinate system, format, … • Title • Qualify to the agency (e.g. “Streets” becomes “Henderson County Streets”) • Keywords • Add ISO keywords • NCSU GIS Lookup terms added later if needed for access These are basic requirements for accessand use
Metadata Tools • ArcCatalog • Automated metadata extraction • ArcGIS Toolbar • Metadata synchronization, normalization, templating • cns and mp • Raw text handling • Python classes • Ingest workflow
Source Metadata Translation • Hub-and-spoke model a la Echo DEPository • repository agnostic • modular conversion hub • facilitate repository software migration & inter-archive exchange
What is the Rights Encoding? • Purpose: Define a basic set of codes to hold dataset rights information in a script-actionable form. To assign related text for use in constructing brief rights statements. Propagates to individual data objects • Structure: Codes are assigned on a fixed string position basis. Rights assigned to particular user types are grouped after a flag character for that user group. • Initial User Groups: • NCSU Faculty/Staff/Students (Code “N”) • General Public (Code “P”) • Library of Congress (Code “L”) • Initial Rights Types: • Use • Redistribute • Commercial Use
Sample Rights Record M01N110P110L110 Interpretation: This dataset was acquired in a mediated transaction directly from the data producer (acquired on media or via arranged download). There is no data agreement but there is a data disclaimer. NCSU, General Public, and LC all can use and redistribute the data but commercial use is not allowed.
Deferred Activities • Implementing METS and PreMIS • Developing a serial object metadata scheme
Ongoing Challenges • When to automate and when not to • Learn first from human intervention • Minimizing risk of error related to human intervention • Accepting that ingest packages used will evolve over time (implications for archive?) • Handling post-ingest migrations
Engagement Opportunities • NCGDAP partner NCCGIA runs the NC OneMap Metadata Outreach Program • Provide feedback to spatial data infrastructure about metadata inconsistencies, lack of adherence to best practices • Partner with industry and standards organizations on addressing metadata issues such as poor standards support for versioned data (e.g., through OGC Data Preservation Working Group)