The 1000 Genomes Project

The 1000 Genomes Project • Current 1April 2009

Overview • Making personal genomics possible • A short history of the effects of disruptive sequencing technology • From 1 to 1001: • The goals and science of the 1000 Genomes Project • Bioinformatics infrastructure requirements (hint: they are large) • Project progress: Where are we today? • Collecting information and making it accessible to researchers • The 1000 Genomes DCC • EBI, European Read Archive, Ensembl and the public domain

Evolution of Large-scale Genome Analysis • 2000: The year of the human genome • Working drafts completed • All data freely released (eventually) • Project took about 10 years and cost about $3 billion • 2007: The year of the personal genome • Craig Venter; James Watson • 2008: 1000 Genomes project launched • First Solexa genomes published Yanhuang Project

2009 • Major genome centers produce the same number of base pairs every 6 hours at a cost approaching $50,000 • Dense genotyping assays are more than two orders of magnitude less expensive compared to the HapMap project • Making the 1000 Genomes Project and massive case-control studies of human variation possible • 1000 Genomes project is an international collaboration funded by the Wellcome Trust, the National Human Genome Research Institute, the Chinese Academy of Science, the German Science Federation and several corporate partners • Details at www.1000genomes.org

1000 Genomes Project: Primary goals • Overall: Create a deep catalogue of human variation to provide a better baseline to underpin human genetics • Discover shared variation (shared = not private to individual) and characterise by allele frequency • Aim for effectively all (not just a lot of) common variation • Structural variants as well as SNPs • Accessible because the project will used paired-end sequencing reads • Deeper discovery in gene regions, down to 0.5% to 0.1% MAF • Place variants in their haplotype context • Call and phase the variants on the sampled individuals • This will support tagging by genotyping and imputation based approaches • Synergistic with large-scale genotyping projects and whole genome association studies

1000 Genomes Project Details • Three pilots during 2008 with analysis currently ongoing • Pilot 1: 3x60 samples at 2x (6Gb) per person: • European CEU, African YRI, East Asian CHB/JPT • Pilot 2: CEU and YRI trios at 20x • Pilot 3: 1000 genes in 1000 people • Multiple platforms/protocols • Develop and evaluate methods for data collection and analysis • Two year main project • Population components now be finalised • All data has appropriate consent for full and unrestricted public release • This is a tremendous amount of raw data (between 500 terabytes and 1 petabyte for the main project)

Bioinformatics Requirements for 1000 Genomes • Basic Requirements • File formats • SRF: New technology sequencing does not produce the same type of raw data as Sanger-style sequencing • SAM/BAM: Alignment formats must be efficient if one is mapping half a trillion reads • Initial analysis tools • Most short read aligners incorporate the quality scores into the mapping • Advanced Requirements • Genome likelihood format for representing an individual genome with appropriate uncertainty • Advanced analysis tools (mostly under development) • SNP calling that are trio aware and population based • Assembly & Search

1000 Genomes data production • Over 150 terabytes of data deposited so far • Approximately 6,000,000,000,000 bp (6 terabases) have already been submitted • Perspective: During September and October the 1000 Genomes project produced the equivalent of EMBL/GenBank every week • Approaching 2000x total coverage • Raw and processed data is freely available • ftp://ftp.1000genomes.ebi.ac.uk • ftp://ftp-trace.ncbi.nih.gov/1000genomes/ • Data also available from the EBI’s European Read Archive and the NCBI’s Short Read Archive • Millions of new SNPs submitted to dbSNP

1000 Genomes Data Availability • Three data releases • Late December (trio SNPs) • Early February (first set of low coverage SNPs) • Late February (trio alignments) • 1000 Genomes Browser • Based on Ensembl • Visualization and portal for project data • Complement to 1000genomes.org website • Deep coverage trios plus other individuals will be available from main Ensembl browser this spring

1000 Genomes Browser

Data Transfer Infrastructure • Data transfer requirements are enormous • FTP does not work well for terabytes of data • “Old fashioned” solutions • Copy the data onto a hard drive and mail the hard drive around the world • (Significant personnel costs) • Infrastructure solutions • Create/buy dedicated lines for point to point transfer or direct connection to faster points on the backbone • Expensive to do collaborative analysis, but will probably be part of the solution • Advanced technology solutions • Asperasoft • Uses udp to transfer files to avoid tcp • Can quickly saturate connections

Conclusions • The 1000 Genomes Project will construct a human polymorphism reference resource • True baseline information for human genetics • New sequencing technologies make this possible, but high accuracy requires high coverage • A picture of population variation appears achievable with low coverage sequencing of many individuals • Good down to 1% • Poor for private/rare variants • Human disease studies will be major beneficiaries • The bioinformatics challenges are large and nearly everywhere in the project

Acknowledgements • Vertebrate Genomics Team • Mario Caccamo, Ilkka Lappalainen, Jonathan Hinton, Vasudev Kumanduri (European Genotype Archive) • Zam Iqbal, Laura Clarke (1000 Genomes DCC); Fiona Cunningham, Yuan Chen, Will McLaren(Ensembl Variation) • Nathan Johnson, Steven Wilder, Damian Keefe (Ensembl Functional Genomics) • Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Albert Vilella, Leo Gordon, Benoit Ballester (Ensembl Comparative Genomics) • NCBI portion of the 1000 Genomes DCC • Steve Sherry, Martin Shumway, Justin Paschall, Hoda Khouri • Ensembl Systems, Web and Genebuild Teams • EMBL Sequence Archive / European Trace Archive Team • EBI’s The Protein and Nucleotide Database Group • EBI Systems Group • 1000 Genomes Project • Funding: Wellcome Trust, EMBL

The 1000 Genomes Project

The 1000 Genomes Project

Presentation Transcript

Structural Variation in the 1000 Genomes Project

1000 Genomes Project Data Tutorial

Lessons learnt from the 1000 Genomes Project about sequencing in populations

The 1000 Genomes Project

The 1000 Genomes Project Lessons From Variant Calling and Genotyping

Towards Completion of the 1000 Genomes Project

1000 Genomes SV detection Boston College

1000 Genomes Project Haplotype Integration

The 1000 Genomes Project Tutorial

Accessing the 1000 Genomes Data

Released 1000 Genomes indels : 328,528

The Human Genomes

The 1000 Genomes project, Data Availability and Accessibility

The 1000 Genomes Project Advanced Information Laura Clarke

1000-films project

The 1000 Genomes Project Advanced Information Laura Clarke

The 1000 Genomes Project The Phase 1 Variant Set and Future Developments Laura Clarke

1000 Genomes Project Phase III Tutorial Structural Variants (SVs ) Eugene J. Gardner

The 1000 Genomes Project: A Tutorial

An introduction to the Y-chromosomal data from Phase 3 of the 1000 Genomes Project

1000 Genomes Tutorial

Disease, natural selection and the 1000 Genomes Project