Data Curation and Citation at the California Digital Library

Data Curation and Citation at the California Digital Library 4 May 2011 John Kunze, Patricia Cruse, Stephen Abrams, Catherine Mitchell California Digital Library

Problem: data seems to be a second-class citizen in the scholarly record Research data Journal article

CDL and UC3: who we are California Digital Library Five programs, including UC Curation Center Digital repositories via Merritt Persistent id management (DOIs, ARKs, et al) via EZID Web Archiving Service (WAS) Open access publishing eScholarship online journals, with peer review Electronic texts Search and display tools (XTF) • Serving the 10 UC campuses • 226,000 students • 134,000 faculty and staff • Working collaboratively • libraries • data centers • museums, archives • faculty and researchers

Modernizing a long tradition • Formats: PDF/A, BagIt, Pairtree, WARC • Identifiers: ARK, URL • Metadata: UDFR, Dublin Kernel, Dublin Core • JHOVE format characterization tool • NOID nice opaque identifier minter/resolver tool

Data curation at CDL: an overview • Merritt: general-purpose data repository • EZID: scheme-agnostic and de-coupled creation, resolution, and management of persistent ids • DataONE global data network [NSF] • Web archiving service [Library of Congress] • DataCite consortium and citation standards • Data management plan generator • Open-source Excel add-in [MS Research & GBMF] • New “data paper” publishing model [GBMF]

Service overview Open to the UC community & more Discipline-agnostic Service delivery: hosted or local Easy to use UI or API • Why another repository system? • Composable micro-services • Filesystem-centric • Simple

EZID: easy identifiers for the long-term • A service to make and manage actionable ids • User interface • Programming interface for bulk ops (Dryad) • Ids for anything: digital, physical, living, abstract • Can manage identifiers under different schemes: • ARKs, DOIs, and more to come (LSIDs, ...) • Visit EZID at • http://n2t.net/ezid

Partners & Clients • Dryad • NASA HelioPhysics • Fred Hutchinson Cancer Research Center • Healthy Pathways UC Davis) • Open Context(UC Berkeley) • Purdue University • DataOne • USGS • ESIP • LabArchives LLC • CNDP-CRDP • 26 other organizations in pipeline

enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it engaging the scientist in the data curation process supporting the full data life cycle encouraging data stewardship and sharing promoting best practices engaging citizens developing domain agnostic solutions 1. Build on existing cyberinfrastructure 2. Create new cyberinfrastructure 3. Create new communities of practice

Web archiving service snapshot (2010) Stats: Since January 2007 17 organizations using service 4,316 sites captured 30,619 captures run 19.6 terabytes 200 + archives under construction 29 archives published In partnership with the IIPC consortium of national libraries.

Archiving the Gulf oil spillImproving support for collaboration 946 sites 8,400 + captures 1.3 TB Begun May 5, 2010

Data curation at CDL: an overview • Merritt: general-purpose data repository • EZID: scheme-agnostic & de-coupled creation, resolution, and management of persistent ids • DataONE global data network [NSF] • Web archiving service [Library of Congress] • DataCite consortium and citation standards • Data management plan generator • Open-source Excel add-in [MS Research & GBMF] • New “data paper” publishing model [GBMF]

Discovery: DataCiteconsortium • Technische Informationsbibliothek (TIB), Germany • Australian National Data Service (ANDS) • The British Library • California Digital Library, USA • Canada Institute for Scientific and Technical Information (CISTI) • L’Institut de l’Information Scientifique et Technique (INIST), France • Library or the ETH Zürich • Library of TU Delft, The Netherlands • Office of Scientific and Technical Information, US Department of Energy • Purdue University, USA • Technical Information Center of Denmark

DataCite/DataONE example DOI resolver and TIB registration 5. URL plus id EZID resolver and registration service 4. save full citation CDL 3. citation + URL + id 6. full citation DataONE Coordinating Node metadata catalog (eg, UNM or UCSB) DataONE Member Node data archive (eg, Dryad) 2. metadata + URL + id 7. full citation get unique id string data + metadata Research scientist (opt) CDL-hostedEZID id minting service get unique id string

UC3 data management guidelines Researchers seeking NSF funding are now required to submit data management plans UC received over $600 million from NSF in 2009 Best practices for managing and sharing data meeting funding agency requirements, developed in concert with UCLA, UCM, UCSD Working with U Va, Smithsonian, et al. on tool to generate data management plans.

Why Excel? Spreadsheets hold an important class of data • Cons: poor feature set and scaling compared to regular DB management systems • Pluses: ubiquity, familiarity, ease-of-use An open-source add-in that extends Excel functionality leverages a huge installed base

What an Excel add-in could do Some ideas to better publish, share, and archive: • Permit standardized column headers • Versioning and standard date formats • Auto-archiving and persistent id assignment • “Speed bumps” to discourage macros et al. Best ideas to be selected by project partners

Need to save data + processing Algorithms + Data Structures = Programs

Vision for a “data paper” Idea: wrap the unfamiliar in a familiar façade • Funded by the Gordon and Betty Moore Foundation • A “data paper” minimally consists of a cover sheet and a set of links to archived artifacts • Cover sheet contains familiar elements such as title, date,authors, abstract, and persistent identifier (DOI, ARK, etc.) • Just enough to permit basic exposure to and discovery of datasets by internet search engines, which in turn permits • Building a basic data citation • Indexing by services such as Google Scholar • Instilling confidence in the identifier’s stability

Hypothetical Data Paper

Vision for a “data journal” • Potential parallel emergence of a new kind of “data journal” • Like regular journals, data journals would spring up around disciplines and sub-disciplines as needed • Expect some of them to be peer-reviewed, as well as hybrids • Envisioned as “overlay” journal • A variety of data paper sources • Table of contents, editorial policies, submission guidelines, etc. • The “data paper” format itself should evolve • incorporating more general-purpose and discipline-specific elements to enrich discovery, re-use, and archiving

Return incremental value for incremental effort ... with room for nano-publications.

Data paper: envisioned outcomes • Data papers and citations look familiar to people and indexers • Data authors motivated to deposit for author credit • Datasets routinely re-used, annotated, and corrected, with stable storage and citable, trackable identifiers • Data products, including those resulting from synthesis efforts, enter the scientific record instead of being lost • Data journals spring up around disciplines, even if their “papers” are scattered in distributed repositories • Peer review optimized by authors’ ability to indicate which information is essential and which information is not • Relevant but non-essential information available for the interested reader but does not interfere with peer review

Data curation at CDL: an overview • Merritt: general-purpose data repository • EZID: scheme-agnostic & de-coupled creation, resolution, and management of persistent ids • DataONE global data network [NSF] • Web archiving service [Library of Congress] • DataCite consortium and citation standards • Data management plan generator • Open-source Excel add-in [MS Research & GBMF] • New “data paper” publishing model [GBMF]

Summary CDL is working on several fronts to make a complex problem smaller • Supporting research: data papers and data management planning tool • Technology: micro-services (EZID, Merritt, etc.) and the open-source Excel add-in project • Community: DataONE and DataCite

Questions? John.Kunze@ucop.edu California Digital Library http://www.cdlib.org “Data Paper” Paper: http://escholarship.org/uc/item/9jw4964t

Data Curation and Citation at the California Digital Library

Data Curation and Citation at the California Digital Library

Presentation Transcript

Wellcome Library Digital Curation Workflow Guide sheet

UK Digital Curation Centre : enabling research data management at the coalface

Pilots to Program: UC San Diego Research Data Curation Pilots and the Library Research Data Curation Program

Digital Library Technologies at the Grainger Library

Data Curation: Faculty Barriers and Library Opportunities

The Data Conservancy: A Digital Research and Curation Virtual Organization

Stephen Abrams Patricia Cruse John Kunze UC Curation Center California Digital Library

The UK Digital Curation Centre

California Digital Library

DATA MANAGEMENT AND CURATION AT TAIR

Digital Curation or Digital Data? The impact of Services and Federation

Data Curation and Citation at the California Digital Library

The California Digital Library and the Public Knowledge Project

Data Citation at UNSW

Data citation

Digital Learning Curation

Data Curation at NEES

Data Curation at NEES

The Data Conservancy: A Digital Research and Curation Virtual Organization

The California Digital Library and the Public Knowledge Project

Roy Tennant The California Digital Library

Digital Curation Services at CDL