1 / 48

High throughput sequencing : informatics & software aspects

High throughput sequencing : informatics & software aspects. Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013. Traditional DNA sequencing. Genetics of living organisms. Chromosomes. DNA. Radioactive label gel sequencing. Four-color capillary sequencing.

dyanne
Télécharger la présentation

High throughput sequencing : informatics & software aspects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013

  2. Traditional DNA sequencing

  3. Genetics of living organisms Chromosomes DNA

  4. Radioactive label gel sequencing

  5. Four-color capillary sequencing ~1 Mb ~100 Mb >100 Mb ~3,000 Mb ABI 3700 four-color sequence trace

  6. Individual human resequencing

  7. Next-generation DNA sequencing

  8. New sequencing technologies…

  9. … vast throughput, many applications Illumina, SOLiD 1 Tb 100 Gb 10 Gb 454 1 Gb bases per machine run 100 Mb 10 Mb ABI / capillary 1 Mb 10 bp 100 bp 1,000 bp read length

  10. Sequencing chemistries DNA base extension DNA ligation Church, 2005

  11. Template clonal amplification Church, 2005

  12. Massively parallel sequencing Church, 2005

  13. Chemistry of paired-end sequencing Double strand DNA is folded into a bridge shape then separated into single strands. The end of each strand is then sequenced. (Figure courtesy of Illumina)

  14. Paired-end reads • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency

  15. Features of NGS data • Short sequence reads • 100-200bp • 25-35bp (micro-reads) • Huge amount of sequence per run • Up to gigabases per run • Huge number of reads per run • Up to 100’s of millions • Higher error as compared with Sanger sequencing • Error profile different to Sanger

  16. Application areas of next-gen sequencing

  17. Application areas • Genome resequencing • variant discovery • somatic mutation detection • mutational profiling • De novo assembly • Identification of protein-bound DNA • chromatin structure • methylation • transcription binding sites • RNA-Seq • expression • transcript discovery Mikkelsenet al. Nature 2007 Cloonanet al. Nature Methods, 2008

  18. SNP and short-INDEL discovery

  19. Structural variation detection • copy number (for amplifications, deletions) from depth of read coverage • structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

  20. Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007)

  21. Novel transcript discovery (genes) Mortazavi et al. Nature Methods • novel exons • novel transcripts containing known exons

  22. Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006

  23. Expression profiling gene gene aligned reads aligned reads Jones-Rhoads et al. PLoS Genetics, 2007 • tag counting (e.g. SAGE, CAGE) • shotgun transcript sequencing

  24. De novo genome sequencing Lander et al. Nature 2001 short reads read pairs longer reads assembled sequence contigs

  25. The informatics of sequencing

  26. IND (ii) read mapping (iv) SV calling (iii) SNP and short INDEL calling IND (i) base calling (v) data viewing, hypothesis generation Re-sequencing informatics pipeline REF

  27. The variation discovery toolbox • base callers • read mappers • SNP callers • SV callers • assembly viewers

  28. Raw data processing / base calling • These steps are usually handled well by the machine manufacturers’ software • What most analysts want to see is base calls and well-calibrated base quality values Trace extraction Base calling

  29. Sequence traces are machine-specific Base calling is increasingly left to machine manufacturers

  30. …where they give you the cover on the box Read mapping… Is like a jigsaw puzzle…

  31. pieces that look like each other… …pieces with unique features Some pieces are easier to place than others…

  32. Repeats  multiple mapping problem Lander et al. 2001

  33. Paired-end (PE) reads fragment length: 1 – 10kb fragment length: 100 – 600bp PE reads are now the standard for whole-genome short-read sequencing Korbelet al. Science 2007

  34. 0.8 0.19 0.01 Mapping quality values

  35. SNP calling

  36. SNP calling: what goes into it? Base qualities sequencing error true polymorphism Base coverage Prior expectation

  37. A A A A A C C C C C G G G G G T T T T T polymorphic permutation monomorphic permutation Bayesian posterior probability Base call + Base quality Expected polymorphism rate Base composition Depth of coverage Bayesian SNP calling

  38. The PolyBayes software http://bioinformatics.bc.edu/~marth/PolyBayes • First statistically rigorous SNP discovery tool • Correctly analyzes alternative cDNA splice forms Marth et al., Nature Genetics, 1999

  39. SNP calling (continued) -----a----- -----a----- -----c----- -----c----- P(G1=aa|B1=aacc; Bi=aaaac; Bn=cccc) P(G1=cc|B1=aacc; Bi=aaaac;Bn= cccc) P(G1=ac|B1=aacc; Bi=aaaac;Bn= cccc) P(B1=aacc|G1=aa) P(B1=aacc|G1=cc) P(B1=aacc|G1=ac) -----a----- -----a----- -----a----- -----a----- -----c----- Prior(G1,..,Gi,.., Gn) P(Gi=aa|B1=aacc; Bi=aaaac; Bn=cccc) P(Gi=cc|B1=aacc; Bi=aaaac;Bn= cccc) P(Gi=ac|B1=aacc; Bi=aaaac;Bn= cccc) P(Bi=aaaac|Gi=aa) P(Bi=aaaac|Gi=cc) P(Bi=aaaac|Gi=ac) -----c----- -----c----- -----c----- -----c----- P(Bn=cccc|Gn=aa) P(Bn=cccc|Gn=cc) P(Bn=cccc|Gn=ac) P(Gn=aa|B1=aacc; Bi=aaaac; Bn=cccc) P(Gn=cc|B1=aacc; Bi=aaaac;Bn= cccc) P(Gn=ac|B1=aacc; Bi=aaaac;Bn= cccc) “genotype likelihoods” “genotype probabilities” P(SNP)

  40. Insertion/deletion (INDEL) variants • These variants have been on the “radar screen” for decades • Accurate automated detection is difficult • Different mutation mechanisms • Often appear in repetitive sequence and therefore difficult to align • Often multi-allelic • Deleted allele has no base quality values

  41. Alignment methods became more refined Original alignment After left realignment After haplotype-aware realignment

  42. Medium length INDELs still a problem Guillermo Angel

  43. Structural variation detection Feuket al. Nature Reviews Genetics, 2006

  44. Structural variant detection (cont’d)

  45. Read Depth: good for big CNVs Detection Approaches Reference Sample • Paired-end: all types of SV Lmap • Split-Readsgood break-point resolution read contig • deNovo Assembly~ the future SV slides courtesy of Chip Stewart, Boston College

  46. SV detection – resolution Expected CNVs Karyotype Micro-array Sequencing Relative numbers of events CNV event length [bp]

  47. Standard data formats Reads: FASTQ Alignments: SAM/BAM Variants: VCF

  48. Tools for analyzing & manipulating 1000G data Alignments: SAM/BAM • samtools: http://samtools.sourceforge.net/ • BamTools: http://sourceforge.net/projects/bamtools/ • GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit Variants: VCF • VCFTools: http://vcftools.sourceforge.net/ • VcfCTools: https://github.com/AlistairNWard/vcfCTools

More Related