Next Generation Sequencing Technologies Rob Mitra Lecture 02/17/09
Forward Genetics Genotype Phenotype Hypothesis Test Hypothesis By Genetic Manipulation
Forward Genetics Mutation in APC Gene Two groups: 1. Develop Colorectal cancer At Young Age 2. Do not Genotype Phenotype Hypothesis APC is a Tumor Supressor Gene Test Hypothesis By Genetic Manipulation Delete APC in Mouse Control: Isogenic APC+
The Cycle of Forward Genetics In 2005 $9 million/genome Not feasible ?Sequencing? Genotype Observation Thinking Phenotype Hypothesis Test Hypothesis By Genetic Manipulation Gene Deletion/Replacement Recombinant Technology
End Runs • Linkage Studies (Humans, Model Organisms) • Association Studies (GWAS) BUT, these end runs have a cost! 1. Requires a large family (many crosses in model organisms); very difficult to analyze multi-factorial traits 2. Common variants But, these end runs will not be needed in 5-10 years. Why?
The Problem with Forward Genetics Currently $60,000 /genome Cost is rapidly dropping Sequencing Genotype Observation Thinking Phenotype Hypothesis Test Hypothesis By Genetic Manipulation Gene Deletion/Replacement Recombinant Technology
Bp/US dollar: increases exponentially with time Adapted from Shendure et al 2004
Two questions: • How was this dramatic acceleration achieved? • What will it mean?
How was this achieved? • Integration (Think about sequencing pipeline) • Parallelization • Miniaturization Same concepts the revolutionarized integrated circuits Plus one additional insight
Read Length is Not As Important For Resequencing Jay Shendure
Two Short Read Techologies • Illumina GA • ABI SOLID
Technology Overview: Solexa/Illumina Sequencing http://www.illumina.com/
Immobilize DNA to Surface Source: www.illumina.com
ABI Solid Dressman 2003
Sequencing By Ligation Shendure et al
ABI SOLID This allows for error correction: See board Raw error rate = ~3% Corrected error rate = ~0.1%
Paired End Reads are Important! Known Distance Read 1 Read 2 Repetitive DNA Unique DNA Paired read maps uniquely Single read maps to multiple positions
Paired Ends are Important Part 2 Deletion Insertion Inversion Shendure et al 2005
How can we generate paired end reads? • Amplify Large Fragments and Sequence From Each End (some trickery required – see board) • Length is limited (150bp – 1kb). • Jumping Library
Jumping Library Contruction From Shendure et al
Other Second Generation Technologies • 454 • Emulsion PCR • Polymerase • Natural Nucleotides • 20-100Mb for 5-15k • 1% error rate • Homopolymers
Helicos • No Amplification Single molecule detection • Homopolymer (solved) • Expensive Detection
Pacific Biosciences: A Third Generation Sequencing Technology Eid et al 2008
How did they do? • 150 bp circular template • ~93% raw accuracy • 15x coverage 99.3% accuracy • Still early days
Where are they going • Phi29 so long read lengths possible • Ease of sample prep • Camera costs
Summary • Sequencing will become very inexpensive in 3-5 years • So now what?
Areas of Broad Impact • Understanding Common Diseases • Cancer
Why don’t we understand common traits or diseases? • GWAS is relatively new • But, this method can only analyze common variants • If rare variants play a significant role in common traits then we need to sequence. (Board) SO DO THEY?
Studies on human height • Heritability of height is 0.8 (80% of variation in height is due to genetic factors) • 3 studies genotyped 63,000 individuals at 500,000 loci (biggest cohort analyzed to date) • 54 loci explain ~4% of the variance. WHAT!?
Do rare variants matter? • What is the genetic basis of variation in blood pressure? • Lifton and colleagues sequenced 1000 individuals at these 3 loci (SLC12A3, SLC12A1, and KCNJ1) and correlated the observed genetic variation with blood pressure measurements. • 20 individuals had heterozygous, rare mutations that caused a significant decrease in blood pressure. Each rare mutation had a relatively large effect, and these mutations also protected individuals against developing clinical hypertension. • Although only about 2% of the population has a functional mutation in one of these three genes, Lifton and colleagues hypothesize: “Because these three genes comprise only a small fraction of those in which mutations are known to alter blood pressure, and because there are likely to be many more genes yet to be discovered, it seems probable that the combined effects of rare independent mutations will account for a substantial fraction of blood pressure variation in the population.” Ji et al 2008
Conclusions • CDCV may not hold for many common traits • Rare variants may cumulatively play a big role in common traits, but sequencing candidate genes isn’t getting it done. • Whole genome sequencing.
Cancer and Whole Genome Sequencing • Cancer is a disease of the genome • Acquisition of somatic mutation • The genome records a history of disease
Complete genome sequence of AML genome • 32.7 fold haploid coverage • 14 fold coverage of normal skin • Remove SNPs, check for non-synonymous somatic mutations in coding DNA • 10 mutations found (2 known to be involved in cancer progression)
We need more genomes! • Complete genomics ($5000) • ABI ($10,000) • Illumina (?) • Intelligent Biosystems (<$1000)
Sequencing coverage calculations • Let’s say you need a base to be sequenced 5x for an accurate base call • How much average coverage do you need to ensure that 95% of the genome is sequenced at least 5 times?
Poisson Distribution Originally derived for time. Average coverage = lambda Probability of getting k reads from a base given the average coverage lambda
Example • Average coverage = 5x • Probability of a given base being sequenced 10 times is: 510e-5/10! = 0.018 or about 2% of bases will have 10x coverage.