1 / 27

The 1000 Genomes Project Lessons From Variant Calling and Genotyping

The 1000 Genomes Project Lessons From Variant Calling and Genotyping. October 13 th , 2011 Hyun Min Kang University of Michigan, Ann Arbor. Overview of Phase 1 CALL SET . 1000 Genomes integrated genotypes. Low-pass Genomes. Low-pass Genomes. Low-pass Genomes. Low-pass Genomes.

chyna
Télécharger la présentation

The 1000 Genomes Project Lessons From Variant Calling and Genotyping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The 1000 Genomes ProjectLessons From Variant Calling and Genotyping October 13th, 2011 Hyun Min Kang University of Michigan, Ann Arbor

  2. Overview of Phase 1 CALL SET

  3. 1000 Genomes integrated genotypes Low-pass Genomes Low-pass Genomes Low-pass Genomes Low-pass Genomes Low-pass Genomes Low-pass Genomes Low-pass Genomes Low-pass Genomes Deep Exomes Low-pass Genomes SVs 15k SNPs 38M INDELs 4.0M Integrated Genotypes ~42M

  4. Methods for integrated genotypes

  5. From PILOT to PHASE1 • PHASE1 • 36.8M SNPs • Ts/Tv 2.17 • Includes 98.9% HapMap3 • PILOT • 14.8M SNPs • Ts/Tv 2.01 • Includes 97.8% HapMap3 Autosomal chromosomes only

  6. From PILOT to PHASE1 • PHASE1-only • 23.8M SNPs • Ts/Tv 2.16 • Includes 1.2% of HapMap3 • PILOT ∧ PHASE1 • 13.1M SNPs • Ts/Tv 2.18 • Includes 97.7% of HapMap3 • PILOT-only • 1.7M SNPs • Ts/Tv 1.11 • Includes 0.15% HapMap3

  7. From PILOT to PHASE1 100k monomorphic SNPs in 2.5M OMNI Array (>1,000 individuals) • PHASE1-only • 23.8M SNPs • Ts/Tv 2.16 • Includes 1.2% of HapMap3 • PILOT ∧ PHASE1 • 13.1M SNPs • Ts/Tv 2.18 • Includes 97.7% of HapMap3 • PILOT-only • 1.7M SNPs • Ts/Tv 1.11 • Includes 0.15% HapMap3

  8. From PILOT to PHASE1 : Improved SNP calls 100k monomorphic SNPs in 2.5M OMNI Array (>1,000 individuals) • PHASE1-only • 23.8M SNPs • Ts/Tv 2.16 • Includes 1.2% of HapMap3 • PILOT ∧ PHASE1 • 13.1M SNPs • Ts/Tv 2.18 • Includes 97.7% of HapMap3 0.3% of OMNI-MONO • PILOT-only • 1.7M SNPs • Ts/Tv 1.11 • Includes 0.15% HapMap3 1.2% of OMNI-MONO 59.6% of OMNI-MONO OMNI-MONO information was not used in making phase1 variant calls

  9. improvement in methodssince pilot

  10. 1000 Genomes’ engines for improved variant calls and genotypes • INDEL realignment • Per Base Alignment Quality (BAQ) adjustment • Robust consensus SNP selection strategy • Variant Quality Score Recalibration (VQSR) • Support Vector Machine (SVM) • improved Genotype Likelihood Calculation • BAM-specific Binomial Mixture Model (BBMM) • Leveraging off-target exome reads

  11. INDEL Realignment : How it works… • Given a list of potential indels… • Check if reads consistent with SNP or indel • Adjust alignment as needed • Greatly reduces false-positive SNP calls Ref: AAGCGTCGG AAGCGT AAGCGTC AAGCGTCG AAGCGC AAGCGCG AAGCG-CGG AAGCGTCG AAGCGT AAGCGTC AAGCGTCG AAGCG-C AAGCG-CG AAGCG-CGG Read pile consistent with the reference sequence With one read, hard to choose between alternatives Read pile consistent with a 1bp deletion Neighboring SNPs? Short Indel? Eric Banks and Mark DePristo

  12. Per Base Alignment Qualities Short Read GATAGCTAGCTAGCTGATGA GCCG 5’-AGCTGATAGCTAGCTAGCTGATGAGCCCGATC-3’ Reference Genome Heng Li

  13. Per Base Alignment Qualities Should we insert a gap? Short Read GATAGCTAGCTAGCTGATGAGCC-G 5’-AGCTGATAGCTAGCTAGCTGATGAGCCCGATC-3’ Reference Genome Heng Li

  14. Per Base Alignment Qualities Compensate for Alignment Uncertainty With Lower Base Quality Short Read GATAGCTAGCTAGCTGATGAGCCG 5’-AGCTGATAGCTAGCTAGCTGATGAGCCCGATC-3’ Reference Genome Heng Li

  15. Per Base Alignment Qualities Compensate for Alignment Uncertainty With Lower Base Quality Short Read GATAGCTAGCTAGCTGATGAGCCG 5’-AGCTGATAGCTAGCTAGCTGATGAGCCCGATC-3’ Reference Genome Improves quality near new indels and sequencing artifacts Heng Li

  16. Producing high-quality consensus call sets Ryan Poplin

  17. Consensus SNP site selection under multidimensional feature space Feature 3 Feature 2 • Filter PASS • Filter FAIL Feature 1 Goo Jun – Joint variant calling and … - Platform 192, Friday 5:30 (Room 517A)

  18. Improved likelihood estimationproduces more accurate genotypes Evaluation in April 2011 KEY IDEA in BBMM: Re-estimate the genotype likelihood by clustering the variants based on the read distribution HET-discordance(MAQ model) HET-discordance(BBMM) Fuli Yu, James Lu – An integrative variant analysis… - Poster 842W

  19. Off-target exome reads improves genotype quality Integrated on-target coding genotypes are also more accurate than low-coverage-only or exome-only platforms

  20. Genotype Qualities in SVs and INDELs Bob Handsaker

  21. More in-depth view ofphase 1 integrated genotypes

  22. Sensitivity at low-frequency SNPs 0.1% 0.5% 1.0%

  23. >96% SNPs are detected compared to deep genomes

  24. Genotype discordance by frequency Genotype Discordance

  25. Impact of sequencing depth on genotype accuracy(interim integrated panel, chr20)

  26. Highlights • The quality of phase 1 call set is much more improved compared to pilot call set • 1000G engines for phase1 variant calls produced high-sensitivity, high-specificity variant calls • >99% of genotypes are concordant with array-based genotypes • Likelihood-based integrated improves off-target & on-target genotyping qualities

  27. Acknowledgements The 1000 Genomes Project 1000 Genomes Analysis Group

More Related