1 / 64

Reproducible Research

Reproducible Research. Jonathon LeFaive, University of Michigan Big Data Summer Institute June 18, 2019. Replicating vs. Reproducing. Replication. “Replication is the ultimate standard by which scientific claims are judged”—Roger Peng Come to same conclusions with different data/protocols

gpeterson
Télécharger la présentation

Reproducible Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reproducible Research Jonathon LeFaive, University of Michigan Big Data Summer Institute June 18, 2019

  2. Replicating vs. Reproducing 2019 Big Data Summer Institute

  3. Replication • “Replication is the ultimate standard by which scientific claims are judged”—Roger Peng • Come to same conclusions with different data/protocols • Replicable = results are consistent and reliable 2019 Big Data Summer Institute

  4. Reproduction • “An attainable minimum standard for assessing the value of scientific claims, particularly when full independent replication of a study is not feasible” –Roger Peng • Obtain same results with same data and same protocols • Reproducible = methods are consistent and reliable 2019 Big Data Summer Institute

  5. Reproducible != True 2019 Big Data Summer Institute

  6. Motivation for reproducibility 2019 Big Data Summer Institute

  7. The reproducibility crisis 2019 Big Data Summer Institute

  8. Why make my work reproducible? • Transparency • Save time • Keep track of complex projects http://science.sciencemag.org/content/334/6060/1226.full • Higher impact • More efficient collaboration 2019 Big Data Summer Institute

  9. The Reproducibility Toolkit • Well-organized projects • Documentation • Version control • Capturing the computational environment • Automation 2019 Big Data Summer Institute

  10. Organizing your projects 2019 Big Data Summer Institute

  11. Exercise: Evaluate the structure of past projects • On your own: • Find a project you have worked on in the past • Sketch out the directory structure • (If applicable) sketch out the flow of information • Write down any naming conventions you used for files/folders • With a partner: • A collaborator wants to reproduce your project—what instructions would you give them? • What would you change about your organizational structure to make this easier? http://pgbovine.net/research-directory-structure.htm 2019 Big Data Summer Institute

  12. Challenges of organization • Developing an intuitive directory structure • Coming up with good names for things • How long/when/where to keep intermediate data files? • Dealing with clutter of old stuff/temp files • Keeping track of dependencies • Managing backups/previous versions • Collaborating effectively with shared project content • Different local versions, determining structure of shared directory, content spread across computers, cloud apps, etc. 2019 Big Data Summer Institute

  13. Tips for organizing your projects • Develop your own system and be consistent! • Think about where to put files you haven’t yet created • Separate raw from processed data • Separate code from data • Use file shortcuts to avoid unnecessary duplicates • File/folder names and paths should be self-explanatory • Code as verbs, data as nouns 2019 Big Data Summer Institute

  14. Documentation 2019 Big Data Summer Institute

  15. Documentation techniques • Including README files in project directories • Source code comments • Self-documenting code • Literate programming 2019 Big Data Summer Institute

  16. README project-name/+- data/| +- README.md| +- genotypes.data| +- phenotypes.data+- scripts/| +- README.md| +- normalize-input.sh| +- run-association.sh+- README.md 2019 Big Data Summer Institute

  17. Documenting code n = (icnt + gs - 1) / gs; uintceil_divide(uintx, uinty){return (x + y - 1) / y;}group_count = ceil_divide(item_count, max_group_size); /* Calculatesnumber of groupsbyceildividingitemcountbymaxgroupsize. */group_count = (item_count + max_group_size - 1) / max_group_size; group_count = (item_count + max_group_size - 1) / max_group_size; 2019 Big Data Summer Institute

  18. Doxygen #' Adds together two numbers.#' #' @param x Left hand number.#' @param y Right hand number.#' @return The sum of x and y.#' @examples#' add(1, 1)#' add(10, 1)add <- function(x, y) {x + y} 2019 Big Data Summer Institute

  19. “Literate programming is a methodology that combines a programming language with a documentation language, thereby making programs more robust, more portable, more easily maintained, and arguably more fun to write than programs that are written only in a high-level language.” --Donald Knuth 2019 Big Data Summer Institute

  20. Literate programming • Literate programming allows you to encapsulate & share every aspect of your analysis in an interactive and descriptive way: • Data provenance • Dependencies • Code • Documentation • Tables/Figures/References etc. http://www.datacarpentry.org/rr-literate-programming/02-literate-programming/ 2019 Big Data Summer Institute

  21. R Markdown • Literate programming language implemented in RStudio • Integrates standard R code with Markdown text formatting (+other stuff!) • knitr: Typesetting package to render your R Markdown document as a web page, PDF, Word document, etc. 2019 Big Data Summer Institute

  22. Structure of R Markdown documents YAML-formatted header Markdown-formatted text https://blogs.uoregon.edu/rclub/2016/04/26/r-markdown-resources/ Code chunk 2019 Big Data Summer Institute

  23. A rendered R Markdown document (HTML) Information from YAML header Text https://blogs.uoregon.edu/rclub/2016/04/26/r-markdown-resources/ Evaluated R code 2019 Big Data Summer Institute

  24. R Notebooks • Like R Markdown documents, but you can evaluate code chunks independently, without needing to render the entire document 2019 Big Data Summer Institute

  25. Create a new R Markdown document or an R Notebook Create a new document/notebook & try commands listed in tutorial 2019 Big Data Summer Institute

  26. Jupyter notebooks • Similar to R Notebooks, a literate & interactive programming environment combining Markdown documentation with code chunks • popular with Python users • Supports many different kernels (e.g., you can run R code in Jupyter notebooks) 2019 Big Data Summer Institute

  27. Version Control 2019 Big Data Summer Institute

  28. Version control gone wrong 2019 Big Data Summer Institute

  29. Version control system (VCS) • System for managing changes to files • No more duplicate files! • Full history of revisions is accessible • Revisions can be compared, restored, and merged • Multiple team members can edit files; VCS handles merging 2019 Big Data Summer Institute

  30. Git: a flexible VCS • Git is a distributed version control system • i.e., does not require central server • Flexible branching design • Faster than non-distributed systems 2019 Big Data Summer Institute

  31. Cloud-based VCS hosting platform • Central repository for storing Git-enabled projects • Issue tracking, wiki, package hosting, … • Extremely popular and community-oriented! • >36 million users, >100 million code repositories 2019 Big Data Summer Institute

  32. Setting up Github • Create a GitHub account: https://github.com/join • Windows users: install GitHub Desktop https://desktop.github.com/ • This will install Git Shell on your PC—check that you have this in your programs • Mac/Linux users: type git in your terminal (Windows users use Git Shell)—should see usage info; if not, should be prompted to install • Detailed instructions:https://help.github.com/articles/set-up-git/ 2019 Big Data Summer Institute

  33. Git terminology • A repositoryis a directory containing your project files and the Git metadata • A branchis a development arm of a project • A commit is a checkpoint for changes you have made • Your commits are added to a remote repository by pushing • You can grab someone else’s revisions by pulling commit branch repository 2019 Big Data Summer Institute

  34. History tracking • Git tracks every committed change—you can access and revert to any commit in the repository’s history • No need to manually save old versions! 2019 Big Data Summer Institute

  35. Basic Git commands git initgit add <source_file>git statusgit diff <file|directory>git commitgit log git remote add origin <github_url>git push –u origin master git pushgit pull 2019 Big Data Summer Institute

  36. Git demo 2019 Big Data Summer Institute

  37. Controlling the Computational Environment 2019 Big Data Summer Institute

  38. Capturing the computational environment • Computational projects for big data can be extremely complex, and make use of multiple languages, libraries, and dependencies • Reproducibility encompasses the entire computational environment, not just your data & code • Debugging dependency issues can be time-consuming & impede reproducibility 2019 Big Data Summer Institute

  39. Package managers name: my-project channels: - conda-forge - bioconda dependencies: - cyvcf2=0.8 - pyfaidx=0.5 - joblib=0.11 • Language specific • Pip, cget, Cargo, npm, etc. • Built-in (R, Julia, etc.) • System specific • apt, rpm, MacPorts, etc. • Conda 2019 Big Data Summer Institute

  40. Conda environments • Conda is a cross-platform package manager, very popular with Python users • All dependencies (OS applications, Python libraries, R packages, etc.) can be specified in a YAML file conda env create --name "my-env" -f env.yml conda activate my-env conda deactivate • Provides an easy way for other users to reproduce your analyses with exactly the same dependencies 2019 Big Data Summer Institute

  41. Containers (Singularity/Docker) • Frameworks for isolating computational environments & applications in standalone virtual “containers” • Containers include a barebones operating system and all dependencies specified by the creator • Containers can run on nearly any system, ensuring your code is widely reproducible 2019 Big Data Summer Institute

  42. Conda & Singularity demo 2019 Big Data Summer Institute

  43. What about hardware? • Conda environments & Docker containers only control the OS & software environment/dependencies • Hardware requirements are harder to control—e.g., not everyone can reproduce analyses that require 100s of CPUs • Best practices: • Document specific hardware environment you used (e.g., processor model, available RAM, required disk space) • Keep track of program performance (e.g., RAM usage, runtime, CPUs required) • Always provide your code, even if it’s not runnable on every system 2019 Big Data Summer Institute

  44. Automation 2019 Big Data Summer Institute

  45. Why automate your workflow? • Complex projects are often spread across several stages of data (pre)processing and analyses • Not only must each stage be reproducible, but also the flow of information between stages • Automation tools ensure complex workflows are reproduced in the proper order • When data or code changes, automation ensures all steps are followed in proper order and all downstream data is updated properly 2019 Big Data Summer Institute

  46. Tools for automation • GNU make (build system) • Workflow engines • Ssnakemake, Nextflow, CWL, etc. • A series of “recipes” for controlling the flow of your analyses and managing data dependencies 2019 Big Data Summer Institute

  47. Snakemake Example rulestep_one: output: "tmp/{letter}.txt"shell: "echo 'Hello' > {output}"rulestep_two:input:"tmp/a.txt", "tmp/b.txt", "tmp/c.txt"output:"out/merged.txt"shell:"cat {input} > out/merged.txt" 2019 Big Data Summer Institute

  48. Snakemake demo 2019 Big Data Summer Institute

  49. Summary • Starting with a mindset of reproducibility will pay off in the future! • Pay attention to project organization • Convey the ideas behind your code with documentation • Take advantage of version control systems • Capture environment and software dependencies • Automate whenever possible 2019 Big Data Summer Institute

  50. Things you can do today • Think through how to effectively organize your projects and create a folder structure • Install Rstudio, Conda & Jupyter and try to reproduce some examples • Create a Github account and add your notes, code, etc. to a repository • Install Singularity or Docker and try to containerize your software • Install Snakemake (or other engine) and explore how you can port existing pipelines into a portable workflow 2019 Big Data Summer Institute

More Related