Introduction to Next Generation Sequencing

Introduction to Next Generation Sequencing

Overview • Day 1: AM - Basicbiology recap and Intro to NGS • Day 1:PM - Intro to Data Analysis • Format(s), Quality checking, Trimming • Day 2: AM - General procedures and strategies in NGS • Day 2: PM - Exome sequence analysis practical (Galaxy) • Day 3: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) • Day 3: PM - RNA-Seq practical (Tophat + Cuffdiffpipelineon Galaxy) Note: practical write-ups = assessment assignment

Overview • Day 4: AM – NGS in the wild (case studies) • Clinical genomics • Human microbiome • Day 4: PM - Candidate filtering and prioritization • Mostly SNP based • Little bit of functional and pathway enrichment analysis • Day 5: AM - Knowledge-driven methods for finding ‘causative’ genes & wrap-up • Day 5: PM – Free or wrap up practical

Next Generation Sequencing Day 1: Introduction

Full genome sequencing

Day 1 - Overview • Central Dogma Review • History of DNA Sequencing • First Generation (Sanger) Sequencing • Next Generation Sequencing Introduction • NGS Opportunities and Challenges • NGS Applications • NGS Study Design and Technology Choice

History 1866 Gregor Mendel published the results of his investigations of the inheritance of "factors" in pea plants.

DNA was first isolated by the Swiss physician Friedrich Miescher in 1869.

1950's • Maurice Wilkins (1916-2004), Rosalind Franklin (1920-1957), Francis Crick (1916-2004) and James Watson (1928- ) discover chemical structure of DNA • Starts a new branch of science - molecular biology.

The Central Dogma of Molecular Biology Reverse Transcription

Structure of the DNA molecule • DNA is shaped like a double helix • It is like a spiral staircase • Another way to think of it is a twisted ladder

Connecting the DNA molecule • Rails of the DNA ladder are alternating sugar & phosphates • Rungs are composed of pairs of bases • A bonds with T • G bonds with C

Connecting the DNA molecule • The two strands of DNA are different • One is called the sense strand and it is the plan to make a protein • The other strand is the antisense strand

Connecting the DNA molecule • The two strands of DNA are said to be antiparallel • One strand is oriented in a 5’ to 3’ direction • The other strand is oriented in the opposite 3’ to 5’ direction 5’ 3’ antisense sense 3’ 5’

Replication of DNA

DNA sequencing exploits the physicochemical properties of DNA and the enzymes involved in its replication (more later…)

Introns and Exons • Introns– non-codingsequences in the DNA that are NOT used to make to make a protein • Exons–coding sequences in the DNA that are expressed or used to make mRNA and ultimately are used to make a protein

Introns and Exons

Transcription

Translation

Sanger Method Fred Sanger, 1958 Was originally a protein chemist Made his first mark in sequencing proteins Made his second mark in sequencing RNA 1980 dideoxy sequencing

Sanger Method: Dideoxy Chain Termination 300-500 bases

Capillary Method - Fluorescent Dyes 800-1000 bases

Automated Sequencing • Leroy Hood developed fluorescent color labels for the 4 terminator nucleotide bases (late 80s). • This allowed all 4 bases to be sequenced in a single reaction and sorted in a single gel lane. • Hood also pioneered direct data collection by computer • Improvements in this technology now enabled sequencing of billion base genomes in < 1 year.

Automated sequencing machines use 4 colors, so they can read all 4 bases at once.

TG..GT TC..CC AC..GC CG..CA TT..TC TG..AC AC..GC GA..GC CT..TG AC..GC GT..GC AC..GC AA..GC AT..AT TT..CC Short DNA sequences ACGTGGTAACGTATACAC TAGGCCATAGTAATGGCG CACCCTTAGTGGCGTATACATA… ACGTGGTAATGGCGTATACACCCTTAGGCCATA ACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACCTCT... Sequenced genome Genome Sequencing Genome Short fragments of DNA 28

-2001 The HGP consortium publishes its working draft in Nature (15 February), and Celera publishes its draft in Science (16 February).

2001: Human Genome Project 2.7G$, 11 years 2007: 454 1M$, 3 months 2008: ABI SOLiD 60K$, 2 weeks 2001: Celera 100M$, 3 years 2010: 5K$, a few days? 2009: Illumina, Helicos 40-50K$ 2012: 100$, <24 hrs? 2000 Sequencing the Human Genome 10 8 6 Log10(price) 4 2 2005 2010 Year

Sequence Database Size Year Exponential Data Increase NAR. 2007 September; 35(18): 6227–6237.

1870 Miescher: Discovers DNA Avery: Proposes DNA as ‘Genetic Material’ 1940 Watson & Crick: Double Helix Structure of DNA 1953 Holley: Sequences Yeast tRNAAla 1965 Wu: Sequences  Cohesive End DNA 1970 Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation 1977 1980 Messing: M13 Cloning Hood et al.: Partial Automation 1986 1990 • Cycle Sequencing • Improved Sequencing Enzymes • Improved Fluorescent Detection Schemes 2002 • Next Generation Sequencing • Improved enzymes and chemistry • Improved image processing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) History of DNA Sequencing Efficiency (bp/person/year) 1 15 150 1,500 15,000 25,000 50,000 200,000 50,000,000 100,000,000,000 2008

Sanger vs NGS • ‘Sanger sequencing’ has been the only DNA sequencing method for 30 years but… • …hunger for even greater sequencing throughput, at lower cost • NGS has the ability to process millions of sequence reads in parallel rather than 96 at a time (at a smallfractionof the cost)

Next Generation Sequencing: Why Now? • Motivation: HGP and its derivatives, personalized medicine • Short reads applications: (re-)sequencing, other methods (e.g. gene expression) • Advancements in technology

“Paradigm Shift” • Standard ABI “Sanger” sequencing • 96 samples/day • Read length ~650 bp= 450,000 bases • 454 was the game changer! • ~400,000 different templates (reads)/day • Read length ~250 bp • Total = 100,000,000 bases of sequence data!!!

Solexa ups the Game • Solexa (Illumina GA) • 60,000,000 different sequence templates (yes that is an insane 60 million reads) • 36 bp read length (much longer now) • 4 billion bases of DNA per run (3 days)

Next Generation Sequencing • 454 Life Sciences/Roche • Genome Sequencer FLX: currently produces 400-600 million bases per day per machine • Published 1 million bases of Neanderthal DNA in 2006 • May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage) • Solexa/Illumina • 10 GB per machine/week • May 2008 published complete genomes for 3 hapmap subjects (14x coverage) • ABI SOLID • 20 GB per machine/week

Nanotechnology • Each system works differently, but they are all based on a similar principals: • Shear target DNA into small pieces • bind individual DNA molecules to a solid surface, • amplify each molecule into a cluster • copy one base at a time and detect different signals for A, C, T, & G bases • requires very precise high-resolution imaging of tiny features • (Solexa has 800 images @ 4 megapixels each)

Sequencing by Synthesis (SBS)

Problem: Huge Amount of Image Data • Raw image data huge: 1-2 TB for the Solexa, more for ABI-SOLID, less for 454 • The images are immediately processed into intensity data (spots w/ location and brightness) • Intensity data is then processed into basecalls (A, C, T, or G plus a quality score for each) • Basecalldata is on the order of 5-10 GB per run (or a week of runs for 454)

From John McPherson, OICR Next-gen sequencers 100 Gb AB/SOLiDv3, Illumina/GAII short-read sequencers (10+Gb in 50-100 bp reads, >100M reads, 4-8 days) 10 Gb 454 GS FLX pyrosequencer 1 Gb (100-500 Mb in 100-400 bp reads, 0.5-1M reads, 5-10 hours) bases per machine run 100 Mb ABI capillary sequencer 10 Mb (0.04-0.08 Mb in 450-800 bp reads, 96 reads, 1-3 hours) 1 Mb 10 bp 100 bp 1,000 bp read length

Adapted from John McPherson, OICR 2009/10 AB SOLiDv3 120Gb, 100 bp reads 100 Gb Illumina HiSeq 100Gb, 150bp reads 10 Gb 1 Gb 454 GS FLX Titanium bases per machine run 0.4-0.6 Gb, 100-400 bp reads 100 Mb 10 Mb ABI capillary sequencer (0.04-0.08 Mb, 450-800 bp reads 1 Mb 10 bp 100 bp 1,000 bp read length

Stein Genome Biology 2010 11:207

Storage is becoming a real problem Kahn, 2011, Science

Lower Cost = More Innovation • As sequencing becomes cheaper, more investigators can use it for routine assays • Leads to variations and absolutely novel applications

Lower Cost = More samples • More patients in GWAS studies • More replicates (or the use of some replicates and statistical approaches) in all other assays

Bioinformatics is the Bottleneck • Sequencing is a commodity – can easily be outsourced • Bioinformatics is the essential point of the science • Data analysis and discovery of meaning in results • As the data throughput increases, the cost and time spent on analysis increase more than linearly

More Investigators = Less Informatics skill • Sequencing is a readout for many different types of laboratory experiments • Clinical and basic science investigators from all areas of biology can make use of this technology • Many are completely naïve about bioinformatics • Informatics tools for NGS are very challenging

Challenging Bioinformatics Environment • Very rapid change in technology platform • New file formats, new data types • Different “standards” from different vendors • Very rapid evolution of new methods • Very rapid ‘release’ of methods as ‘software’ via unsupported open source distribution • Large data sizes (both experimental and reference)

The key Automation, automation, automation…

Introduction to Next Generation Sequencing