Download
discovery tools for human genetic variations n.
Skip this Video
Loading SlideShow in 5 Seconds..
Discovery tools for human genetic variations PowerPoint Presentation
Download Presentation
Discovery tools for human genetic variations

Discovery tools for human genetic variations

252 Vues Download Presentation
Télécharger la présentation

Discovery tools for human genetic variations

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Discovery tools for human genetic variations Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA 02467

  2. sequence variations make our genetic makeup unique SNP • Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important Sequence variations • Human Genome Project produced a reference genome sequence that is 99.9% common to each human being

  3. diverse sequence resources can be used EST WGS BAC How do we find variations? • comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage)

  4. Sequence clustering Cluster refinement Multiple alignment SNP detection Steps of SNP discovery

  5. Two innovative ideas: 1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources 2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing error true polymorphism Computational SNP mining – PolyBayes

  6. 1. Fragment recruitment (database search) 2. Anchored alignment 3. Paralog identification 4. SNP detection SNP discovery with PolyBayes genome reference sequence

  7. Sequence clustering • Clustering simplifies to search against sequence database to recruit relevant sequences • Clusters = groups of overlapping sequence fragments matching the genome reference genome reference fragments cluster 1 cluster 2 cluster 3

  8. (Anchored) multiple alignment • The genomic reference sequence serves as an anchor • fragments pair-wise aligned to genomic sequence • insertions are propagated – “sequence padding” • Advantages • efficient -- only involves pair-wise comparisons • accurate -- correctly aligns alternatively spliced ESTs

  9. Challenge • to differentiate between sequencing errors and paralogous difference Sequencing errors Paralogous difference Paralog filtering • The “paralog problem” • unrecognized paralogs give rise to spurious SNP predictions • SNPs in duplicated regions may be useless for genotyping

  10. Bayesian discrimination algorithm Paralog filtering • Pair-wise comparison between fragment and genomic sequence • Model of expected discrepancies • Orthologous: sequencing error + polymorphisms • Paralog: sequencing error + paralogous sequence difference

  11. Paralog filtering

  12. sequencing error polymorphism SNP detection • Goal: to discern true variation from sequencing error

  13. A A A A A C C C C C G G G G G T T T T T polymorphic permutation monomorphic permutation Bayesian posterior probability Base call + Base quality Expected polymorphism rate Base composition Depth of coverage Bayesian-statistical SNP detection

  14. Distribution of SNPs according to minor allele frequency • Distribution of SNPs according to specific variation • Sample size (alignment depth) Priors • Polymorphism rate in population -- e.g. 1 / 300 bp

  15. SNP score polymorphism specific variation

  16. African Asian Caucasian Hispanic CHM 1 Validation – pooled sequencing

  17. Validation -- resequencing

  18. Properties of SNP detection algorithm • frequent alleles are easier to detect • high-quality alleles are easier to detect

  19. The PolyBayes software http://genome.wustl.edu/gsc/polybayes • First statistically rigorous SNP discovery tool • Correctly analyzes alternative cDNA splice forms • Available for use (~70 licenses) Marth et al., Nature Genetics, 1999

  20. overlap detection SNP analysis candidate SNP predictions SNP mining: genome BAC overlaps inter- & intra-chromosomal duplications known human repeats fragmentary nature of draft data

  21. 507,152 high-quality candidate SNPs (validation rate 83-96%) Marth et al., Nature Genetics 2001 BAC overlap mining results ~ 30,000 clones >CloneX ACGTTGCAACGT GTCAATGCTGCA >CloneY ACGTTGCAACGT GTCAATGCTGCA 25,901 clones (7,122 finished, 18,779 draft with basequality values) 21,020 clone overlaps (124,356 fragment overlaps) ACCTAGGAGACTGAACTTACTG ACCTAGGAGACCGAACTTACTG

  22. 2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries Sachidanandam et al., Nature 2001 SNP mining projects 1. Short deletions/insertions (DIPs) in the BAC overlaps Weber et al., AJHG 2002

  23. Genotyping by sequence • SNP discovery usually deals with single-stranded (clonal) sequences • It is often necessary to determine the allele state of individuals at known polymorphic locations • Genotyping usually involves double-stranded DNA  the possibility of heterozygosity exists • there is no unique underlying nucleotide, no meaningful base quality value, hence statistical methods of SNP discovery do not apply

  24. homozygous peak heterozygous peak Genotyping