Managing the NextGen data pipeline

Managing the NextGen data pipeline Jim Denvir, Ph.D.

NextGen data challenges • NextGen Sequencing produces very large data sets • Order of Terabytes (1012 bytes) per run • Data analysis requires considerable computing power and specialist management • Main challenge is in distilling useful information from raw data WV-INBRE West Virginia IDeA Network of Biomedical Excellence

Core Facility support • Bioinformatics and Genomics core facilities provide support for investigators needing to have NextGen Sequencing data analyzed • Perform analysis from early part of pipeline • Perform downstream analysis, or provide support and software for individual investigators • Depending on needs and expertise of investigator WV-INBRE West Virginia IDeA Network of Biomedical Excellence

NextGen Analysis Pipeline Real Time Analysis performed by RTA software on sequencer Image Analysis Automated Base Calling CASAVA (Illumina) or open source (Tuxedo Suite, R/Bioconductor) * May require custom scripts Demultiplexing* Core Facility Alignment SNP calling or RNA Read Counting Statistical Analysis Partek or R/Bioconductor Investigator Functional Analysis IPA WV-INBRE West Virginia IDeA Network of Biomedical Excellence

Commercial Tools • Examples: RTA, CASAVA, Partek, IPA • Pros: • Short learning curve • Potentially can be used by individual investigators • Usually come with technical support and training • Cons: • Expensive • Closed, proprietary source code WV-INBRE West Virginia IDeA Network of Biomedical Excellence

Open Source • Examples: R/Bioconductor, Tuxedo suite • Pros: • Free • Open source • Enables rapid, community-led improvement • Potentially more academically reviewable • Cons: • Steeper learning curve • Typically prohibitive for individual investigators • Sparse technical support WV-INBRE West Virginia IDeA Network of Biomedical Excellence

Tools developed on site • Pros: • Can fill in missing functionality from available tools • Customized exactly to our needs • Potential for a revenue source • Cons: • Development is very time consuming WV-INBRE West Virginia IDeA Network of Biomedical Excellence

Roadmap • Experience from microarray data analysis suggests: • Start with commercial tools • Rapid start-up enables us to focus on learning scientific basis for the analyses • Transition to open-source tools for some parts of pipeline • Probably mid 2012-mid 2014 • Provides for financial saving further down the road • Sometimes better received by journal reviewers • Initial steps of analysis pipeline and functional analysis will still be managed by commercial software • Develop custom solutions only when needed WV-INBRE West Virginia IDeA Network of Biomedical Excellence

Storing Data • Archiving data from NextGen experiments requires a large amount of disc space • Once analysis is complete, some raw image data will be deleted • Storage of data is more expensive than re-running an experiment! • Will consider exceptions for experiments which cannot be repeated WV-INBRE West Virginia IDeA Network of Biomedical Excellence

NextGen analysis server • Genomics Core has a Linux server for managing analysis and storing data • Housed in Drinko library and managed by central campus IT staff • Has 42 Terabytes of usable disc space • Uses redundant system to allow for potential of drive failures without losing data • Additionally, IT will back up data off site WV-INBRE West Virginia IDeA Network of Biomedical Excellence

Things to remember • Core facilities are there to help! • At experimental design stage, be sure you understand what analysis the core facility will perform • Would you prefer to have IPA done by the core, or would you prefer control over that stage • If so, do you need training and/or support? WV-INBRE West Virginia IDeA Network of Biomedical Excellence

Questions Presentation available at http://users.marshall.edu/~denvir/presentations.html ? WV-INBRE West Virginia IDeA Network of Biomedical Excellence

Managing the NextGen data pipeline

Managing the NextGen data pipeline

Presentation Transcript

Managing Data

NextGen

NextGen

Managing Data

Managing the Global Pipeline

Data Pipeline: Finance

Managing Data Flow Through the Barcoding Pipeline

Data Pipeline

Data-pipeline using ALSPAC data

Data Pipeline Project

GCOD Data Pipeline

Managing data

Data Pipeline to Data Use

Managing the Data

NextGen

KFPA Data Pipeline

Managing Data

Managing Data

NextGen

Data Pipeline to Data Use

NextGen

Data Pipeline Project