190 likes | 304 Vues
This summary presents key insights from the Informatic Workshop on next-generation sequence analysis, focusing on SNP calling techniques and advancements in sequencing technologies such as Illumina/Solexa and AB/SOLiD. Discussions covered genome resequencing, the importance of accurate base calling, error rates, and data management challenges in SNP detection workflows. The workshop highlighted future applications in genomics, including structural variation detection and epigenetic analysis, while emphasizing the need for robust validation methods and approaches in analyzing diverse genetic isolates.
E N D
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008
Read length and throughput Illumina/Solexa, AB/SOLiD short-read sequencers 1Gb (1-4 Gb in 25-50 bp reads) bases per machine run 100 Mb 454 pyrosequencer (20-100 Mb in 100-250 bp reads) 10 Mb ABI capillary sequencer 1Mb read length 10 bp 100 bp 1,000 bp
Current and future application areas • Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP • De novo genome sequencing • Short-read sequencing will be (at least) an alternative to micro-arrays for: • DNA-protein interaction analysis (CHiP-Seq) • novel transcript discovery • quantification of gene expression • epigenetic analysis (methylation profiling)
3. Alignment of billions of reads Fundamental informatics challenges (I) 1. Interpreting machine readouts – base calling, base error estimation 2. Dealing with non-uniqueness in the genome: resequenceability
Informatics challenges (II) 4. SNP and short INDEL, and structural variation discovery 5. Data visualization 6. Data storage & management
Read mapping Read alignment Paralog identification SNP detection + inspection Resequencing-based SNP discovery genome reference sequence
SNP calling workflow • read alignment • SNP detection • visual checking
A A A A A C C C C C G G G G G T T T T T polymorphic combination monomorphic combination Bayesian posterior probability i.e. the SNP score Base call + Base quality Polymorphism rate (prior) Base composition Depth of coverage Bayesian detection algorithm
base quality values help us decide if mismatches are true polymorphisms or sequencing errors • accurate base qualities are crucial, especially in lower coverage Base quality values for SNP calling
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 strain 1 AACGTTCGCATA AACGTTCGCATA strain 2 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA individual 2 strain 3 AACGTTAGCATA AACGTTAGCATA individual 3 Priors for specific resequencing scenarios
A A/C C C/C A A/A Consensus sequence generation (genotyping) AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 strain 1 AACGTTCGCATA AACGTTCGCATA strain 2 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA strain 3 AACGTTAGCATA AACGTTAGCATA individual 3
iso-1 reference 46-2 454 read 46-2 ABI reads (2 fwd + 2 rev) • 92.9 % validation rate (1,342 / 1,443) • 2.0% missed SNP rate (25 / 1247) SNP calling in low 454 coverage DNA courtesy of Chuck Langley, UC Davis • with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) • 10 different African and Americanmelanogaster isolates • 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total) • can we detect SNPs in survey-style 454 read coverage?
SNP calling in short-read coverage • SNP calling error rate very low: • Validation rate = 97.8% (224/229) • Conversion rate = 92.6% (224/242) • Missed SNP rate = 3.75% (26/693) SNP • INDEL candidates validate and convert at similar rates to SNPs: • Validation rate = 89.3% (193/216) • Conversion rate = 87.3% (193/221) INS C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs)
SNP calling in AB/SOLiD color-space reads A C G G T C G T C G T G T G C G T A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error
Mutational profiling: deep 454/Illumina/SOLiD data Pichia stipitis reference sequence Image from JGI web site • collaboration with Doug Smith at Agencourt • Pichia stipitis converts xylose to ethanol (bio-fuel production) • one mutagenized strain had especially high conversion efficiency • determine where the mutations were that caused this phenotype • we resequenced the 15MB genome with 454 Illumina, and SOLiD reads • 14 true point mutations in the entire genome • In about 15X nominal coverage each technology can find every point mutation with essentially no false positives
Our software is available for testing http://bioinformatics.bc.edu/marthlab/Beta_Release
Credits Elaine Mardis (Washington University) Andy Clark (Cornell University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Michael Stromberg Chip Stewart Michele Busby Aaron Quinlan Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang http://bioinformatics.bc.edu/marthlab