1 / 24

Completion of the 1000 Genomes Project

Completion of the 1000 Genomes Project. Goncalo Abecasis University of Michigan School of Public Health. Project Goals (2008). >95% of accessible genetic variants with a frequency of >1% in each of multiple continental regions

Télécharger la présentation

Completion of the 1000 Genomes Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Completion of the 1000 Genomes Project Goncalo Abecasis University of Michigan School of Public Health

  2. Project Goals (2008) • >95% of accessible genetic variants with a frequency of >1% in each of multiple continental regions • Extend discovery effort to lower frequency variants in coding regions of the genome • Define haplotype structure in the genome

  3. Pilot Projects (2010) • 2 deeply sequenced trios • 179 whole genomes sequenced at low coverage • 8,820 exons deeply sequenced in 697 individuals • 15M SNPs, 1M indels, 20,000 structural variants

  4. Phase I (2012) • More diverse set of populations sequenced • Total >1,092 individuals (EUR, ASN, AFR, AMR groupings) • >38.5 million SNP • 8.5M sites discovered before project (dbSNP 129) • 30M sites newly discovered • 98.9% of HapMap III sites rediscovered • Transition/transversion ratio of 2.16 vs 2.04 in pilot • ~1.5M insertion deletion polymorphisms • ftp://ftp.1000genomes.ebi.ac.uk • ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ The 1000 Genomes Project (Nature, 2012)

  5. Phase 1 samples Bubble size = sample size

  6. Samples in the final phase Bubble size = sample size

  7. 1000 Genomes data generation 80 1000 Genomes Data Total Dataset: 84 TB of BAM Files Data Generation Complete: May 2013 60 Terabases Phase 1 data freeze 40 Pilot data 20 0 2010 2011 2012 2013

  8. Multiple input variant callsets Short Variants (SNPs, indels) • Baylor College of Medicine - SNPTOOLs • Sanger - SAMtools • Sanger - SGA-Dindel • Broad - UnifiedGenotyper • Broad - HaplotypeCaller • Boston College - Freebayes • Umich-GotCloud • Oxford - Platypus • Oxford - Cortex • Stanford - RealtimeGenomics Structural Variants • Variation Hunter • MD Anderson TIGRA • Delly • Invy • Genome STRiP • Breakdancer • Dinumt • CNVnator • Boston College ALUs • Pindel • UMaryland MELT • UW Deletions

  9. Phase 3 variant calling pipeline Raw data 24 initial callsets Consensus callsets Final callset Integration Phasing Genotyping arrays Callset 1 SNPs and high confidence indels Low Coverage and Exome Read Data Integrated Haplotypes Phasing of multi-allelic variants onto haplotype scaffold Callset 2 Multi-allelic SNPs, indels, and MNPs Callset 3 Callset 4 Structural Variants Quality assessment and filtering Callset 5 Short Tandem Repeats PCR-free data Callset n 10 SNP/INDEL callsets, 2 STR callsets, 12 SV callsets

  10. More and longer indels 4bp 4bp 8bp 8bp 12bp 12bp

  11. Quality Control of Short Variants • For short variants, the high coverage PCR-free data from 26 individuals was used to assess the false discovery rate for each variant type. • An allele is considered ‘validated’ if multiple supporting reads can be identified in PCR-free data. • Sites included in the Phase 3 haplotypes have been selected to control the allele False Discovery Rate at 5%.

  12. Quality Control of Structural Variants • CNVs (deletions, duplications) • Custom 400K aCGHarray • Custom 1M aCGH array • Omni 2.5 probe intensities • Affy6.0 probe intensities • Complete Genomics data • Mobile element insertions (MEIs) • PCR (N ~ 100 per MEI class) • Custom aCGH, incl. targeted breakpoint probes • NUMTS (Nuclear mitochondrial sequence insertions) • PCR (N= 50) • PacBioAmplicon-Seq. • Inversions • PCR (N~ 50) • PacBioAmplicon-Seq • Multiple SV types • PacBioNA12878 WGS data to resolve breakpoints & verify sites • PCR-free trios to verify sites and genotypes

  13. Verification & further characterization of inversions by PacBio sequencing Regular (“simple”) inversion Inversion with flanking deletion Complex SVs with inverted sequences

  14. Contribution of the 1000G to dbSNP 61.9 million novel variants discovered by 1000G (62% of dbSNP) 22.4 million variants validated by 1000G (58.4% of non-1000G variants)

  15. Private vs. Shared Variation (Population View)

  16. Private vs. Shared Variation (Individual View) 160,000 5 million Variants per Genome (Zoom) Variants per Genome

  17. Variants per genome

  18. Population histories

  19. Biases in Variation Databases? ClinVar Variants per Genome Genetic Distance (Fst) to CEU

  20. Imputation Accuracy TODO: Multiallelic SNPs and indels to be renamed

  21. AMD Imputation Example #1

  22. Imputation Example #2 del443ins54

  23. Parting Thoughs • Variation is extremely rare • In any one genome, nearly all variation is shared … • But almost all variants are unique to a population or continent • Great benefits to integrated analyses • But analyses still requires time comparable to data generation • Major improvements in genome coverage, variant quality and integration • Advances can be transferred to disease studies through imputation

  24. Phase 3 release • The Phase 3 final release is available at www.1000genomes.org .

More Related