1000G Pilot 3 Progress in silico analysis and comparison to experimental validation

1000G Pilot 3 Progressin silico analysis and comparison to experimental validation Gabor Marth (Boston College) + A + L KiranGarimella (Broad Institute) + C February 2, 2010

Acknowledgements Boston College AmitIndap Wen Fung Leong Gabor Marth Cornell Andy Clark Stanford Simon Gravel Carlos Bustamante Michigan Tom Blackwell Baylor Matthew Bainbridge Fuli Yu Donna Muzny Richard Gibbs Broad Chris Hartl KiranGarimella Carrie Sougnez Mark DePristo WUGSC Dan Koboldt Bob Fulton WTSI AarnoPalotie

Data • Capture targets: • Started with ~1,000 genes / ~10,000 exons / 2.3Mb • 1.43Mb of total target length shared between 4 data centers used for this analysis • Samples: • 697 total samples • 7 populations • Sequence coverage: • Goal was deep per-sample coverage • Effective coverage somewhat reduced by fragment duplications • Capture technologies: • Nimblegen solid phase • Agilent liquid phase • Sequencing technologies: • SLX • 454 • Data producers: • BCM • BI • WTSI • WUGSC

Pipelines All 697 samples CEU CHB JPT YRI SNP calling TSI LWK CHD CEU CHB JPT YRI All 697 samples TSI LWK CHD SNP statistics Segregating sites in each population sample Union of all called sites in all 697 samples

BC and BI call sets are converging All called sites Called sites per population (BC/BI intersection)

SNP calls (per population)

SNP calls (all samples) BI: 18,149 SNPs BC: 14,502 SNPs 1,741 SNPs 79 dbSNPs dbSNP=4.54% 12,761 SNPs 3,869 dbSNPs dbSNP=30.32% 5,388 SNPs 172 dbSNPs dbSNP=3.19% BC U BI = 19,890

Genotype call accuracyrelative to HapMap3 Data quality in CHB and JPT samples seems consistently lower Statistics only include genotype calls at SNP sites in BC∩BI

Genotype calls • Filtering: • BC filters on genotype call quality • BI reports a genotype for any site where at least one read covers • Nominally, BI makes more calls than BC, and has, on average, higher AF # SNP sites=3,489 r=0.9921 # SNP sites=3,075 r=0.9979 The Broad caller does not filter on genotype quality All SNP sites considered Only SNP sites with >= 80% called genotypes • Good allele frequency concordance between BC and BI • At genotype calls that passes BC filter, and BI also makes a call, no discordance was found

1KG validation executive summary • Evaluated BI and BC calls against validation • 1KG chip1 • 312/697 samples across 7 populations represented • ~300 sites (150 novel) overlap with Pilot 3 target region • Concordance with 1KG chip is very high • Where covered (> 5 reads): • 302/312 (97%) of samples have >90% variant sensitivity • 269/312 (86%) of samples have >90% genotype sensitivity • Remaining disparities between 1KG chip and Pilot 3 calls can be explained by data quality issues • Later sequencing has far greater concordance with chip than earlier sequencing 1. Details in Appendix

Nearly all samples in call-set overlap have high sensitivity and specificity All but one sample with low PPV (false-positive rate > 10%) are among the earliest-sequenced samples (JPT/CHB/CHD) These 10 low-sensitivity samples have strange allele balances and are likely contaminated Pilot 3 individual (312 individuals total after eliminating low-coverage samples)

Mean sensitivity/PPV per population is good, and improves on more recently-sequenced populations 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 BI/BCM 8/2008 ILMN/454 BI/BCM 1/2009 454 BCM 10/2008 ILMN BI/SC 2008/2009 ILMN/454 All Ctrs 13 N Samples: 69 27 102 69 3 24

Low-frequency / singleton validation: executive summary • Low-frequency Sequenom assay1 • Chose 105 putative novel singletons from early Pilot 3 46-CEU-sample callsets (called in at least 2/4 callers) • Validated sites in those 46 individuals • 89/105 are true singletons • 16/105 are false-positive singletons (hom-refs and two non-singletons) • Concordance with low-frequency assay is very high • Callsets today (January 2010) • In BI and BC overlap, recovered 71/89 (80%) of assayed singletons with 0 false-positives and 0 non-singletons • In BI and BC union, recovered all 89 singletons with 3 false-positives and 0 non-singletons 1. Details in Appendix

Callers are able to detect most singletons with very low false-positive rate Joint calls find every singleton in the assay, with exceedingly few false positives.

Conclusions / future directions • Data quality has improved significantly over the life of the project • Both BC and BI pipelines produce high-quality call sets • Good agreement between call sets • intersection highly concordant with experimental validation data • Estimated FP rate below 5% • The current Pilot 3 release is the BC∩BI (intersection) call set • We are proceeding with validations • Dual focus: accuracy and functional classes • Results will inform future releases

APPENDIX

Population spectrum of called SNPs

Population-spectrum of called SNPs • Observation: BC call more SNPs on the population level, but less SNP sites overall • Reason: BC tends to call the same site in more populations…

BC/BI SNP calls per population (more detail)

SNP calls (per population)

Broad & BC calls: CEU BC Broad 613 122(19.90%) 0.92 3,489 2,300(65.92%) 3.47 327 52(15.90%) 1.32 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: CHB BC Broad 925 247(26.70%) 1.23 3,415 1,795(52.56%) 3.74 557 32(5.75%) 1.37 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: CHD BC Broad 3431 1,724(50.25%) 3.64 450 31(6.44%) 1.33 831 200(24.07%) 1.68 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: JPT BC Broad 983 271(27.57%) 1.54 2,900 1,679 (57.90%) 3.67 1819 31(1.70%) 0.74 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: LWK BC Broad 580 136(23.45%) 2.09 5,459 2,736(50.12%) 3.67 911 89(9.77%) 1.56 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: TSI BC Broad 448 105(23.44%) 0.71 3,281 2152(65.59%) 3.54 1,004 48(4.78%) 0.85 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: YRI BC Broad 716 112(15.64%) 0.95 5,175 2,785(53.82%) 3.56 694 71(1023%) 1.48 SNP #dBSnp(%) Ts/Tv

BC vs. BI allele frequency comparisons per population at SNPs in the BC∩BI call set

BC/BI genotype calls (CHB & CHD) #sites=3415 r=0.9925 #sites=3028 r=0.9993 CHB SNPs with >= 80% called genotypes All SNPs #sites=3310 r=0.9991 #sites=3431 r=0.9941 CHD All SNPs SNPs with >= 80% called genotypes

BC/BI genotype calls (TSI & JPT) #sites=2370 r=0.9991 #sites=2900 r=0.9922 JPT SNPs with >= 80% called genotypes All SNPs #sites=3108 r=0.9973 #sites=3281 r=0.9912 TSI All SNPs SNPs with >= 80% called genotypes

BC/BI genotype calls (LWK & YRI) #sites=5459 r=0.9924 #sites=5337 r=0.9984 LWK SNPs with >= 80% called genotypes All SNPs #sites=5175 r=0.9917 #sites=4276 r=0.9978 YRI All SNPs SNPs with >= 80% called genotypes

Low frequency / singleton validation design

Per population PPV and sensitivity

1000G Pilot 3 Progress in silico analysis and comparison to experimental validation

1000G Pilot 3 Progress in silico analysis and comparison to experimental validation

Presentation Transcript

Memories and the future: From experimental to in silico physical chemistry

In silico Toxicology

Structure Comparison, Analysis and Validation

Pilot sites - progress

VHTR Modeling and Experimental Validation Studies

1000G Pilot 3 Progress ( in silico analysis and comparison to experimental validation)

Experimental validation

3. Validation (and Qualification)

Inter-comparison and Validation Task Team

Experimental design and analysis

Experimental design and analysis

ANALYSIS AND EXPERIMENTAL VALIDATION OF VARIOUS PHOTOVOLTAIC SYSTEM MODELS

Experimental design and analysis

Flight Validation and Pilot Training

Surface Comparison and Validation Metric

Time series analysis methods in vivo and silico

Validation Progress and Plans

Experimental design and analysis

In silico cis -analysis

Experimental studies of IBS in RHIC and comparison

Argos-3 Pilot Project Drifter Comparison Study

In silico cis -analysis