Better Software,Better ResearchWhy reproducibility is important for your researchhttp://dx.doi.org/10.6084/m9.figshare.1126304EMCSR14, St. Andrews, 5 August2014Neil Chue Hong (@npch), Software Sustainability InstituteORCID: 0000-0002-8876-7606 | N.ChueHong@software.ac.uk Project funding from Supported by Original slides licensed under CC-BY as indicated
The Software Sustainability Institute A national facility for cultivating world-class research through software • Better software enables better research • Software reaches boundaries in its development cycle that prevent improvement, growth and adoption • Providing the expertise and services needed to negotiate to the next stage • Developing the policy and tools tosupport the community developing andusing research software Supported by EPSRC Grant EP/H043160/1
Four Paradigms of Research Empirical Theoretical Computational Data Exploration
“Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correctPapers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension.”Jill MesirovAccessible Reproducible ResearchDOI: 10.1126/science.1179653
Raise standards for preclinical cancer research Begley, Ellis. Nature, 483, 2012 doi:10.1038/483531a 47 out of 53 “landmark” publicationscould not be replicated
SIGMOD Reproducibility • SIGMOD conference offered to attempt to repeat/reproduce papers accepted at conference • 2008-2012 • “High burden on reviewers when setting up experiments” • Use of VMs advocated Bonnet et al, SIGMOD Record, June 2011 (Vol. 40, No. 2)doi: 10.1145/2034863.2034873
Water Swap Reaction Coordinate Long Time Scale GPU Dynamics Reveal the Mechanism of Drug Resistance of the Dual Mutant I223R/H275Y Neuraminidase from H1N1-2009 Influenza Virus Biochemistry, (2012), vol. 51, pp 4364-4375 http://dx.doi.org/10.1021/bi300561n A water-swap reaction coordinate for the calculation of absolute protein-ligand binding free energies Woods CJ, Malaisree M, Hannongbua S, Mulholland AJ J. Chem. Phys. (2011) vol. 134, pp. 054114 http://dx.doi.org/10.1063/1.3519057
Computational science is hardto make truly re-***-ble(perhaps impossible?)
repeat replicate same experiment same lab same experiment different lab test different experiment some of same same experiment different set up reuse reproduce Figure by Carole Goble adapted from Drummond C, Replicability is not Reproducibility: Nor is it Good Science, online and PengRD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
Can I repeat & defend my method? Can I review / replicate and certify your method? Design Peer Review Prediction Collection Peer Reuse Execution Publish Result Analysis Can I review / reproduce and compare my results / method with your results / method? Can I transfer your results into my research and reuse this method? Figure by Carole Goble adapted from: Mesirov, J. Accessible Reproducible Research ScienceDOI:10.1126/science.1179653
Group Exercise 1 • Pick a respectable journal from your field (or use arXiv) • Choose a research article from the journal: • What makes you think that you could repeat it? • What makes you think that you could extend it? • What do you think makes a research article more or less reproducible?
Software Infrastructure and Environments for Reproducible and Extensible Research • Open licensing should be used for data and code • Workflow tracking should be carried out during the research process • Data must be available and accessible • Code and methods must be available and accessible • All 3rd party data and software should be cited Stodden V and Miguez S, (2014) Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research, DOI: 10.5334/jors.ay
10 Simple Rules for Reproducible Computational Research • For Every Result, Keep Track of How It Was Produced • Avoid Manual Data Manipulation Steps • Archive the Exact Versions of All External Programs Used • Version Control All Custom Scripts • Record All Intermediate Results, When Possible in Standardized Formats • For Analyses That Include Randomness, Note Underlying Random Seeds • Always Store Raw Data behind Plots • Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected • Connect Textual Statements to Underlying Results • Provide Public Access to Scripts, Runs, and Results Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple Rules for Reproducible Computational Research. DOI:10.1371/journal.pcbi.1003285
Group Exercise 2 • Now that you’ve seen what others think are important, go back to your chosen paper and decide: • Whether you can identify the software and data used • Whether you can download the software and data used • Whether you could describe to someone else the steps needed to install, configure and run the software to do something similar to the experiment presented in the paper • What are the challenges you face?
The reproducibility spectrum Peng RD (2011) Reproducible Research in Computational Science. DOI:10.1126/science.1213847
The Ladder of Academic Software Reusability Brown CT (2013) http://ivory.idyll.org/blog/ladder-of-academic-software-notsuck.html
5 Stars of Research Software • Community • There is a community infrastructure • Open • Software has permissive license • Defined • Accurate metadata for the software • Extensible • Usable, modifiable for my purpose • Runnable • I can access and run software C R O E D • c.f. • 5 Stars of Linked Data (Berners-Lee) • 5 Stars of Online Journals (Shotton) http://www.software.ac.uk/blog/2013-04-09-five-stars-research-software “Golden Star” Originally by Ssolbergj CC-BY
Group Exercise 3 • Now that you’ve looked at someone else’s paper let’s think about your own work • For the piece of research you’re currently doing • What objections might your supervisor/boss make if you said you wanted to make your research reproducible? • What might your supervisor/boss misunderstand about reproducible research? • What would be the biggest barriers to making your research reproducible? • What would be your main motivation for making your research reproducible?
, it’ Victoria Stodden, AMP 2011 http://www.stodden.net/AMP2011/, Special Issue Reproducible Research Computing in Science and Engineering July/August 2012, 14(4) Howison and Herbsleb (2013) "Incentives and Integration In Scientific Software Production" CSCW 2013.
Five selfish reasons to make your research reproducible • It will make it easier to build up your own research group • You won’t be panicking so much about writing about your results near to that deadline • You are less likely to let mistakes get through to the published article • You’ll get more collaborators • It will make you more productive
Reproducibility isn’t about making other peoples lives easier, it’s about making you a more productive researcher… be selfish!
What you can do now • Read the Best Practices for Scientific Computing • http://dx.doi.org/10.1371/journal.pbio.1001745 • Make the code and data you use available through a repository, under version control • http://software.ac.uk/resources/guides/choosing-repository-your-software-project • http://www.software.ac.uk/blog/2013-09-30-top-tips-version-control • Publish your software in a journal • http://bit.ly/softwarejournals • Ask for software and data if you’re reviewing a paper • Forge a career in research, and change it for those coming behind you • The DOI for this presentation: 10.6084/m9.figshare.1126304 • Acknowledgements: the SSI team, particularly Steve Crouch, Dave De Roure, Carole Goble, Mike Jackson; C Titus Brown; Dan Katz; Jennifer Schopf; Victoria Stodden; Arfon Smith; Greg Wilson; Robin Wilson. • The Software Sustainabilty Institute is a collaboration between universities of Edinburgh, Manchester, Oxford and Southampton. Supported by EPSRC Grant EP/H043160/1.
Reproducibility isn’t just a set of things to do… it’s about instillingknowledge http://bit.ly/datasharingpanda
Purposes Achieve legal compliance Create heritage value Enable continued access to data Encourage software reuse Manage systems and services Purpose http://www.software.ac.uk/attach/SoftwarePreservationBenefitsFramework.pdf
Approaches Preservation (techno-centric) Emulation (data-centric) Migration (functionality-centric) Transition (process-centric) Hibernation (knowledge-centric) Approach
Software Sustainability: preservation vs sustainability Sustainability? Image courtesy of London Permaculture under CC-by-nc-sa license Image courtesy of RGB Kew – not for reuse Preservation?
Software Preservation vs Software Sustainablity • There are several approaches we have identified that could be classed as “sustainability” • The choice depends on a number of factors, which change through time Preservation Emulation Migration Cultivation Hibernation Deprecation Procrastination http://www.software.ac.uk/resources/approaches-software-sustainability