140 likes | 230 Vues
This research provides high-level guidance for developing Data Management and Preservation plans for large collaborative projects in fields like astronomy and particle physics. It covers challenges related to big data, software preservation, and more.
E N D
MaRDI-Grossresearch data guidance for big science • Juan Bicarregui, Norman Gray, Roger Jones, • Simon Lambert and Brian Matthews • (STFC e-science, Glasgow, & Lancaster) • JISC MRD wrap-up, London, 2012 March 23
The goal: to provide high-level guidance for the strategic and engineering development of Data Management and Preservation plans for ‘Big Science’ data. Following: http://purl.org/nxg/projects/mrd-gw
big science • big money: ~20 year history, and millions of $/€/£ (LHC budget is €3bn + detectors, hardware and people) • big author lists: collaborations of 100s of people (LIGO is 800 authors, ATLAS 3000) • big data: petabytes per year (1LHC=10PB/yr) • big admin: MOUs, councils, workshop series • big careers: PhD to tenure on a single project
lots of data • ATLAS/CMS at LHC: 10 PB/yr • LIGO: ~1PB/yr • SKA (by 2020): 1 TB/min or 0.5 EB/yr intercontinentally (this is 0.05% of 1 ZB/yr total worldwide 2015 IP traffic) • Not a problem kilo ➛ mega ➛ giga ➛ tera ➛ peta ➛ exa ➛ zetta ➛ yotta
software • Very large custom data-analysis software suites • ...which are hard to use • ...and require lots of tacit knowledge (ie gained from officemates, and maybe written into wikis) • A major software preservation challenge
data longevity Astronomy data lasts for 1000 years Particle physics data becomes unintelligible about 30 times faster than astronomy data
things that make it easy • Big science projects are often well-resourced, with IT experience, engineering management and clear collaboration infrastructure • Historical experience of ‘large’ data volumes mean everyone knows ad hoc doesn’t work • Always shared facilities, so documented interfaces and SLAs are natural • Confidentiality concerns are well understood (professional priority rather than family secrets)
target reader • Those (senior and/or über-techie) with the responsibility (voluntary or not) for developing a DMP plan for a large collaboration • ...or other many-person, multi-institutional or multi-national project • ...or funders evaluating such plans
the advice “Here is a copy of CCSDS 650; be creative” but Do The Right Thing
backing that up... • OAIS rationale (what and why) • Policy background: RCUK and STFC data policies; why share data?; issues about openness • Technical background: OAIS as terminology, the DCC model, CASPAR • Planning: turning OAIS into practice; release planning; validation and assessment; modelling costs • Case studies
put another way • A framework for approaching the problem exists, in OAIS • ...which is not just waffle • ...so read X, Y and Z to become the local expert • ...so X’, Y’ and Z’ are the questions to ask, or critical approaches to take, if you’re a funder
the document http://purl.org/nxg/projects/mardi-gross/report Comments on v0.1 by 13 April would me most appreciated v0.2 and possibly a v0.3 in the spring, then a final version after collaboration meetings over the summer