280 likes | 376 Vues
Explore the importance of data curation for long-term preservation, using research and scholarly perspectives to enhance data value. Learn about digital preservation issues, data flows, and the role of the UK Digital Curation Centre in maintaining trusted digital information. Discover data curation practices, knowledge extraction through mining, analysis, and synthesis, as well as the significance of open access and collaboration in scholarly communications and business transactions. This work, licensed under a Creative Commons License, delves into repository metadata management and the integration of engineering workflows for effective data management and dissemination.
E N D
Looking to the longer term: some perspectives on data curation and preservation Dr Liz Lyon, DCC Associate Director Outreach Director, UKOLN, University of Bath, UK Funded by: This work is licensed under a Creative Commons LicenceAttribution-ShareAlike 2.0
About UKOLN • “a centre of expertise in digital information management” • Funding: Joint Information Systems Committee (JISC) + Museums, Libraries & Archives Council (MLA) • Portfolio of R&D projects Delos, DRIVER, Grand Challenge • 29+ staff based at the University of Bath • Inform the library, information, education and cultural heritage communities • Policy, advocacy at national level, build innovative Web-based systems & services, R&D, e-journal Ariadne, workshops and conferences. • http://www.ukoln.ac.uk/ Acknowledgement: Alex Ball, Grand Challenge Project
UK Digital Curation Centre • Digital Curation Centre • Funded by JISC & EPSRC • Development activities • Research agenda • Delivering services • Outreach Programme • http://www.dcc.ac.uk/
Overview • Data curation and digital preservation issues • Draw on research and scholarship perspectives • Data / information flows and the “business process” • UK Digital Curation Centre activities “maintaining and adding value to a trusted body of digital information for current and future use”
Reference datasets as infrastructure? Data-centric 2020 vision
(Very simple) Product Research Cycle & Data Curation (New) knowledge extraction: data mining, modelling, analysis, synthesis Formulate ideas / hypothesis, test, experiment, observe, design: data creation, collection & capture Data processing Data processing Data processing Data management storage & validation: description, deposit, self-archiving, preservation, certification e-Infrastructure Open ?? access Collaboration Adding value: Data linking, annotation, visualisation, simulation Data processing Data processing Scholarly communications & Business transactions: data disclosure, publication, citation, discovery, re-use This work is licensed under a Creative Commons LicenceAttribution-ShareAlike 2.0
RepoMMan: Repository Metadata and Management (Hull) using WS-BPEL • Are your engineering workflows identified and described? Workflow e-Scientist desktop? Slide: Carole Goble
repository repository repository repository repository heterogeneous - metadata formats, content formats, identifiers, packaging standards fusion layer ‘repository federator’ homogeneous - metadata formats, content formats, identifiers, packaging standards portal portal portal portal portal “JISC Vision”: a global landscape of federated repositories • e-Framework and Information Environment context • Define common + domain-specific + repository “services” • Interoperability based on open standards, software tools • Multi-disciplinary, cross-sectoral • National, institutional • Different platforms • Many format types: data, eprints, images, geospatial From Andy Powell: http://www.ukoln.ac.uk/distributed-systems/jisc-ie/arch/presentations/jiie-jcs-2005/
Pilot Engineering Repository Xsearch PerX http://www.engineering.ac.uk/
Interoperability??? STEP ISO10303
Repositories and OAIS Reference Model“an archive consisting of an organisation of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community..an identified group of potential consumers who should be able to understand a particular set of information”
Assuring permanence: digital preservation • Trusted DR Audit Checklist for Certification Draft Research Libraries Group-NARA Taskforce 2005 Defined criteria: • Organisation • Functions, processes & procedures • Designated community & usability • Technologies & technical infrastructure • Revised Checklist based on feedback and pilot audits (KB, BADC) • Self-certification: DINI-Zertifikat: requirements & recommendations: • Server policy / Guidelines • Author support • Legal issues • Authenticity and integrity • Cataloguing • Access statistics • Long-term sustainability • Has your repository / PLM been audited?
Interdisciplinary discovery • Validation, publication & discovery of data models & schema • Harmonisation and normalisation of metadata and semantics • Packaging standards: METS, MPEG-21 DIDL • Formal high-level and domain ontologies • ePrints DC Application Profile http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_Application_Profile • eBank Application Profile crystallography data http://www.ukoln.ac.uk/projects/ebank-uk/schemas/ • What data models and metadata schema are in place?
Persistent identifiers for data citation • How will they be used? We need use cases: depositor, author, service provider, researcher, publisher? • Schemes: DOI, Handle, ARK, PURL • Global identification: express as http URIs • Data citation (human and machine-actionable) • Publication & citation of scientific primary data project National Library for Science & Technology (TIB), University of Hanover, Germany. STD-DOI Project DOI registry for datasets http://www.std-doi.de • Is there a data citation policy? • What persistent identifiers have been assigned to your data?
Discovering data: eBank Project • Domain identifier: International Chemical Identifier (INChI) code • Google molecule using INChI • Slide from Simon Coles Coles, S.J., Day, N.E., Murray-Rust, P., Rzepa, H.S., Zhang, Y., Org. Biomol. Chem., 2005, (10),1832-1834. DOI: 10.1039/b502828k Domain identifiers for engineering?
Format migration challenges? CAD Program Compatibility Chart http://www.okino.com/conv/filefrmt_cad.htm
Development: Representation Information Registry Repository • “DCC Approach to Digital Curation” based on OAIS • Representation Information Registry Repository • Prototype demonstrator: based on 2 key concepts to facilitate sharing of the curation effort • Curation Persistent Identifier (CPID) • Descriptive “label” (structural, semantic, other metadata) • Development of (M2M) tools and interfaces for creating, using and re-using representation information • http://dev.dcc.ac.uk Wiki and email list • EU CASPAR Integrated Project • Task Force on the Permanent Access to the Records of Science http://www.casparpreserves.info/pages/1/index.htm http://tfpa.kb.nl/
Allows applications to talk to many different registry implementations e.g. GDFR, PRONOM, UDDI Registry API • GUI Access and via Web browser http://registry.dcc.ac.uk
Research at the University of Edinburgh • Scientific databases: Annotation scoping report • New annotation model + prototype MONDRIAN • Intuitive visual interface iMONDRIAN • Annotate sets of values • Support for querying annotations Adding value through annotation
NaCTeM http://www.nactem.ac.uk/ Emerging tools: TerMine, GENIA, Cafetiere • Knowledge extraction: • Mining (data, text, structures) • Modelling (economic, climate, mathematical, biological…) • Analysis (statistical, lexical, gene….) Nature 23 March 2006 OTMI: Open Text Mining Interface
Supporting the community: Services • HELPDESK@dcc.ac.uk • legal - technical guidance • Curation Manual 45 chapters planned • Metadata (umbrella) • Open Source • Archival metadata • Preservation metadata • Selection & appraisal • Curating emails • Briefing Papers • Curating emails • Digital repositories • Geospatial data • Data protection • eScience data • Case studies
Supporting the community: Outreach & Services • Workshops: • Geospatial data, NeSC, 27 October • OAIS 5 year Review, October • Audit & Certification Forum, October • Records Management, L’pool 30 Nov • Curation & Preservation Training, Dec • 2007 Preservation of journals tbc • 2007 Legal environment tbc • 2007 Preparing for audit tbc • Information Days British Library L’pool UCL • 2nd International DCC Conference 21-22 November, Glasgow • Keynotes: Hans F. Hoffmann, CERN, Clifford Lynch, CNI
DCC Phase 2: 2007-2010 • Working more closely with data centres, e-Science Programmes and Research Councils • SCARP Project: disciplinary approach • JISC Digital Repository Programme collaboration • RepInfo Registry service migration • Define self-assessment procedures and tools • Collaborate with CASPAR, DPE and PLANETS (EU-funded Digital Preservation Projects) • Workshop Programme, International Conference 2007
Thank you.Questions? e.lyon@ukoln.ac.uk Join the DCC Associates Network at www.dcc.ac.uk