Bioinformatics in the CDC Biotechnology Core Facility Branch

Bioinformatics in the CDC Biotechnology Core Facility Branch • Computational Lab • Scott Sammons • Kevin Tang • Chandni Desai • Sequencing Lab • Mike Frace • Missy Olsen-Rasmussen • Marina Khristova • Lori Rowe

Genome Sequencing Lab sequencing platforms – current and upcoming AB 3730XL Roche 454 Titanium + Illumina GA IIx Ion Torrent Personal Gene Machine Pacific Biosciences SMRT sequencer

Building 23 Server Room – Main ISLE

High Performance Computing Cluster (Aspen) • What is it? • 35 compute nodes each with 12 processor cores, 48GB of memory, and 2 Tesla 2050 GPU cards • Currently in the final stages of development in preparation for code-freeze and C&A • What can it do today? • 25 cluster applications are currently enabled for our phase-one deployment including MatLab, Geneious, Beast, Blast, and PacBio • Collaboration with NCI via IAA will GPU scientific applications even further • How fast is it? • By example, a Blast job that takes over 60 hours to complete on our old cluster takes 2 hours on the new cluster* • *NOT GPU OPTIMIZED CODE

Isilon • What is it? • High speed, scalable, and redundant Network Attached Storage • Currently in the process of being integrated with applications • Connected to both the CDC network and the Aspen HPC cluster utilizing Infiniband • What can it do today? • It provides user workspace for end-users and HPC applications • Solves the problem of being out of disk space on individual servers • What are we doing with it? • Data warehouse for all scientific equipment • Central network share for all scientific users • Integrating directly with ITSO’s Active Directory forest

Private Cloud • What is it? • Support science through front-end and back-end services • Implementation of virtualized infrastructure. • Currently in the process of being deployed. • What can it do today? • Provide test environments for scientific projects • Lay the foundation for hardware consolidation and migration • What are we doing with it? • Standardize platforms • Centralize management • Support ongoing growth within the scientific computing community while enabling science

Scientific Computing InfrastructureThe Server Room • 2 Linux High Performance Computing Clusters (~40 nodes each) • 1 Genomics Cluster • 4 Solaris Servers • 12 Stand-Alone Linux Servers • 1 Stand-Alone Database Server • 5 Stand-Alone Windows Servers • Virtualized Cluster with 15 VMs • 3 NAS Devices • 2 Tape Libraries • 2 Dedicated IP Subnets • One C&A addressing all legacy production hardware (NCEZID) with several in-process for systems currently under development (NCIRD)

GSL sequencing 2011 INFLUENZA NCIRD NCEZID CGH Guineaworm Taeniasolium Angiostrongylus Haemophilusinfluenzae Legionella pneumophila Legionella spp. Mycoplasmapneumonia Water cooling tower metagenomics Respiratory filter metagenomics Bat metagenomics Vibrio cholera Vibrio spp Cyclospora Bacillus anthracis Listera Yersinia pestis Brucella spp. Klebsiella pneumonia Junin virus Rift Valley Fever virus Lujo virus Marburg virus CCHF virus Lassa Fever virus Clinical sample metagenomics Tick metagenomics Soil metagenomics

Sequencing: extended PCR Position of E-PCR overlapping amplicons A3 A5 End-R A9 A15 A17 A7 A13 A11 A1 End-L A10 A2 A4 A6 A8 A12 A14 A16 A18 D P O C E R K H M L I F N A S J B G Q HindIII map • Primers designed using VAR-BSH and VAC-CPN sequences • Primers target genes involved in reproduction & host response • Sequence sample: primers 40 sites, 1 enz. RFLP ~120 sites • PCR uses minimal DNA amounts, often no need to grow virus • PCR uses hifi expand long-template Taq & Pwo enzymes (Roche)

fold redundancy 16 12 8 4 First Pass Assembly: Seqmerge

Sequencing Assembly: Phred/Phrap/Consed

Gene Prediction • Heuristic algorithm to assign quality scores to ORFs (from 1 to 100) • Quality scores are based on a number of factors including • Gene Predictions (glimmer, genemark, getorf) • Primary sequence homology to known genes (BLAST) • Presence of predicted promoter (MEME/MAST) • Size of predicted ORF • Presence of transcription terminal signals

Visualizing Gene Predictions and Differences

ITR ITR crm-D ORFs of CPVXs from 4 different clades

45 Smallpox Strains C-1. non-West-African-African int CFR ~10% C-2. non-West-African African minor CFR <1% A. West African int. CFR ~10% C. Asian major CFR ~5 - 35% B. American alastrim minor CFR <1%

Taterapox Camelpox Cowpox clade IV CPXV90_ger2 Variola BRZ66 gar AF375130 AF375142 AF375138 JAP46 yam AF375129 AY902260 AF375143 AY009089 AF375141 AF375081 AF375135 AF377877 AF375093 L22579 X65516 AY902269 AF377878 AY902277 AF375085 X69198 AF377886 AF375090 AF375083 AY902289 AY902294 AF482758 AY902301 AY902295 AY902274 AY902272 AY902283 Ectromelia Z99054 AY902275 AY902286 AY902299 AY902276 AF012825 AY902303 AY902257 AY902304 AY902256 AY902268 AY902300 Cowpox clade III (CPXV91_ger3) AF375086 AY298785 AY298785 AY902298 AY902252 AY902270 AY902271 AF375087 AF375084 AY902253 CPV91 ger3 AY902308 X94355 AY902287 AY366477 VACLS1 AY902297 Cowpox clade II AY603355 AF377885 Z99045 NC 001559 AF375088 AY902288 AF375123 CPV90 ger2 AY243312 AF375077 AY902296 AF375078 AF375119 AF377884 Cowpox clade I AF375118 M14783 Vaccinia AF229247 AY523994 AF095689 AF375102 AF375112 AF375096 AF375098 Z99052 AF375099 AF375113 AF375095 Monkeypox Unrooted tree phylogenetic relationships of ORF encoding the hemagglutinin protein

Next-Gen Diagnostic Sequencing Applications Shotgun / Paired-End Sequencing: random shearing of DNA, even sequence coverage over entire genome. ‘Massively parallel’ sequencing not only produces throughput, it provides sequences of potentially millions of individual molecules (instant cloning). By sequencing a PCR reaction it allows the detailed search for low expression quasi-species or mutations which may signal growing drug or vaccine resistance – a process called ultra-deep or amplicon sequencing. Example: clinical case of poxvirus infection with samples exhibiting a reduced sensitivity to an antiviral drug. Complex clinical, laboratory or environmental samples can be sequenced to provide a diagnostic ‘snapshot’ of the resident organisms - an approach called metagenomic sequencing. Examples: tissue culture, soil

Shotgun / Paired-End Sequencing • De novo Assembly • Newbler • CLCBio • Mira • Geneious • Velvet • Celera • Reference Mapping • Newbler • CLCBio • Mosaik • Mira • Geneious • BWA

Genome Assembly Visualization

Amplicon (deep) sequencing project Li, Damon- NCZEID/DVRD/PRB • Clinical case of progressive vaccinia infection from smallpox vaccination of an immune compromised patient • Pox antiviral ST-246 administered which targets pox gene F13L, a major envelope protein which mediates production of extracellular virus • Oral ST-246 given daily and vaccination site sampled over 3 week period

A region of gene F13L was amplified from clinical samples, deep sequenced, and compared to the smallpox vaccine reference sequence (Acambis 2000) Control swab prior to ST-246

2 weeks after ST-246 T > A 943 C > T 869

3 weeks after ST-246 C > T 869 T > A 943

What is Metagenomics? • Is the genomic study of DNA from uncultured microorganisms, generally from environmental samples • Related • Metatranscriptomics • Metaproteomics

Sample CoverageRarefaction Curves Samples Wooley JC, Godzik A, Friedberg I, 2010 A Primer on Metagenomics. PLoSComputBiol 6(2)

Classification Techniques • Supervised Taxonomic Classification • Homology-based • Database searching by similarity (BLAST, SW) • BLAST, BLASTX: genbank, specialized DBs: NCBI-ENV-NT, NCBI-ENV-NR • Composition-based • N-mer frequency • Markov Models, Support Vector Machines (SVM), need training set • Unsupervised Taxonomic Classification • Clustering methods • SOM - self-organizing maps • PCA – principal component analysis

Remove redundant sequences Unique sequences Mask repetitive and low complexity seqs Good sequences Non-human sequences BLASTN vsnt BLASTX vs nr Viral Metagenomic Pipeline (Wash U scripts implemented at CDC) Contigs, Reads Sample Collection DNA Library Construction BLASTN against Human Genome (e ≤ 1e-10) Sequencing Basecalling Vector Trimming BLASTN vs GB-viral Assembly Report Generation, Display in MEGAN, inspect top hits

Software for Taxonomic Classification • MEGAN – GUI interface for classification based on blast searches • CARMA web-based classification using pFam database and HMMER alignment of protein families • MG-RAST classification system utilizing protein encoding databases and several ribosomal DBs. Can analyze user provided datasets, web use only • Geneious – commercial product • NextGENe – commercial product • Phymm, PhymmBL – composition based classification system

Software for Comparative Metagenomics • Megan – can display two metagenome populations on the same phylogenetic tree, uses BLAST file as input • STAMP – calculates statistical differences between sets of metagenomes • XIPE-TOTEC – performs pairwise comparisons of every metagenome in the two sets, creates a distance matrix which is then used for clustering and PCA analysis to calculate statistical values of relatedness

Megan

Ugandan Outbreak Samples • 4 patients • Total RNA from patient sera • 2 samples per 454 run • ~ 565,000 reads/sample, avg length = 235nt • Sequences were screened for random library amplication primers and low quality • Assembled each run de novo using the 454 gsAssembler • Performed a blastx database search using the assembled contigs (overnight) • Visualized the blast output using MEGAN.

MEGAN (MetaGenomeANalyzer)

Ugandan Outbreak - results • Run1 - 5 contigs (out of 2463 > 100nt) matched YF virus, covering 98% of the genome (10,441 of 10,823bp) • Mapped each sample from Run1 using an Ethiopian YF virus as reference. 3229 individual reads from Sample 1 indentified as YF. • Run 2 – no YF reads found

Phylogenetic analysis of yellow fever virus sequences Laura McMullan (DHPP/VSPB)

Comparative Metagenomics – current work • One 454 run • Two samples • Sample 1 – ~578,000 reads, avg read length 438 bases • Sample 2 – ~550,000 reads, avg read length 425 bases • Total number of bases sequenced - ~488,000,000

Sample 1 – Rarefaction Curve

Sample 1 Taxa tree (collapsed at the Order level)

Comparison of Sample 1 and 2

Bioinformatics Tools • Bioinformatics Packages • EMBOSS • BioInquiry • General Tools • Java/BioJava • Perl/BioPerl • BLAST Suite • BioEdit • GFFtoPS • Genome Comparison/Alignment Tools • Mavid • Mauve • Clustal • Muscle • Gene Prediction • Glimmer • GeneMark • Assembly/Mapping Tools • 454 Suite • Mosaik Tools • Mummer • CLC Bio • BWA • Velvet • AHA (pacbio) • Functional Annotation • Manatee • Phylogenetics • Paup • Phylip • MrBayes • Beauti/Beast • MEGA • DnaSP • Metagenomics • MEGAN • Galaxy • Carma • In-House • WAMS • POCs/VOCs

Challenges Data Management – image files are large (1 run ~25G) moving these files around the network is slow Assembly/Mapping Software – Some are provided with the instrument, but additional methods and algorithms are needed Finishing Tools – gap filling, primer design Visualization Tools – tools to graphically display contigs on reference sequence as well as genome multiple alignments Generic Robust Annotation Tools – Researchers need tools to intelligently choose predicted ORFs as genes, assign function, and submit to GenBank

What are the weaknesses of current next-gen sequencers? Complicated and time consuming library preparation • Requires micrograms of DNA to begin 3 days to prepare library Requires amplification of library Low copy number polymorphisms may be missed Emulsion PCR is an inefficient, time consuming, oily mess Potential to introduce PCR bias into sample Instruments require repetitive sequential ‘flows’ of reagents Repetitive flows of nucleotides, blocking/unblocking chemistry, washing out reaction byproducts all slow synthesis and hinder read-length Consumes liters of reagents ($) Repetitive flows and imaging extend sequence runs to days (or weeks)

Pacific Bioscience SMRT sequencer (single-molecule sequencer) Ion Torrent Personal Gene Machine (solid-state sequencer) Nanopore sequencing

Pacific Biosciences SMRT sequencer Sponsor: Influenza Research Agenda

Pacific Biosciences SMRT Technology Individual ZMW with attached polymerase and DNA strand Laser excitation/detection volume glass  Functional volume (red) is in zL! ~ 50 nm SMRTcell array = 1.5 million ZMW SMRTcell = 160,000 ZMW

Nucleotide incorporation is a realtime data movie 100 ms

Pacific Biosciences Advantages • Read lengths of 1,000 – 10,000 bases • No reagent ‘flows’ =10-fold increase in sequencing speed • Substitute reverse transcriptase for polymerase and sequence RNA directly • Bacteria genomes sequenced in hours • Sequence run costs 99$; take 15 minutes to complete 4

Bioinformatics in the CDC Biotechnology Core Facility Branch

Bioinformatics in the CDC Biotechnology Core Facility Branch

Presentation Transcript

Bioinformatics Facility of the Biotechnology

Bioinformatics in Cancer Biotechnology

The Role of Bioinformatics in Cancer Biotechnology

Core 2: Bioinformatics

Bioinformatics Facility at the Biotechnology/Bioservices Center

Cornell University Bioinformatics Facility

Cornell University Bioinformatics Facility

Gladstone Bioinformatics Core

Biostatistics Bioinformatics Core

The SSP Core Facility

Core 2: Bioinformatics

Bioinformatics Core

Bioinformatics Core for Genomic Medicine and Biotechnology Development

Bioinformatics Core Facility

Bioinformatics Facility of the Biotechnology

Bioinformatics Core Facility Ernesto Lowy

Bioinformatics and Computational Biology Core Facility

The SSP Core Facility

Microscopy Core Facility

Bioinformatics Core

Biotechnology and Bioinformatics: Medicine

Core 2: Bioinformatics