1 / 17

Resolving membership in a study in shared aggregate genetics data

Resolving membership in a study in shared aggregate genetics data. David W. Craig, Ph.D. Investigator & Associate Director Neurogenomics Division dcraig@tgen.org. Genome-wide Association Studies.

duke
Télécharger la présentation

Resolving membership in a study in shared aggregate genetics data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Resolving membership in a study in shared aggregate genetics data David W. Craig, Ph.D.Investigator & Associate DirectorNeurogenomics Divisiondcraig@tgen.org

  2. Genome-wide Association Studies • Genome-wide Association Studies (GWAS) genotype millions of Single Nucleotide Polymorphisms (SNPs) across 1000’s of individuals. • SNPs are typically biallic and diploid: • CC/CT/TT • 00/01/11 • Due to ancestral meiotic recombination, SNPs are not independent from neighboring variants. They are often in linkage disequilibrium. • The concept of LD means that a SNP may be associated with disease, due to underlying correlation with a different functional variant. • Summary stats for a SNP across hundreds/thousands of individuals: • 33% C / 77% T for cases and 45% C / 55% T • P=10-8 • CC=508 / CT=250 / TT= 108 • OR=1.8 Nature Reviews Genetics

  3. Resolving Identity from aggregate genetics data • GWAS are expensive, requiring genotyping of 1000’s of individuals. • Often require consortiums of consortiums. • Sharing individual-level data was and is a challenge. • Sharing meta-data is a reasonable option. • In 2007, summary allele frequency and genotype counts were routinely placed on the web for all SNPs. • In 2008, after broad deliberation with the scientific community we published a forensics paper showing that one could have crude estimates of allele frequency, yet still resolve individuals. • Resolve is the term we purposely use. Identify has multiple meanings, particularly in GWAS study

  4. Example Aggregate Data % A allele ~500 cases % A allele ~500 controls • rs903252 25% 26% • rs232323 15% 15% • rs323555 29% 29% • rs232343 73% 75% • rs233432 21% 22% • rs234312 5.1% 5.1% • rs163232 3.1% 2.8% • rs8392731 15% 16% • rs238764 7.3% 7.1% • rs383745 45% 54% Other SNP Aggregate Data Types: Genotypes, odds ratios, p-values, etc.

  5. Visual example (SNP data as visualized) 250,000 pixels AA=1.0 AB=0.5 BB= 0

  6. Merge 96 independent data images equally

  7. After merging, individual images still resolvable No Adjustment Auto Contrast & Smooth Filter

  8. Conceptual Approach Directionalscore Reference Data Set Data Set of Question Person Of Interest SNP • Rs903252 25% 35% 100% +10 • Rs232323 15% 13% 50% -2 • Rs323555 29% 39% 100% +10 • Rs232343 73% 51% 0% +22 • Rs233432 21% 32% 100% +11 • Rs234312 5% 15% 50% +10 • Rs163232 3% 0% 0% +3 • ….. ….. ….. ….. …..

  9. Equations (one approach of many!!) Directionalscore Reference Data Set Data Set of Question Person Of Interest SNP • Rs903252 25% 35% 100% +10 • Rs232323 15% 13% 50% -2 • Rs323555 29% 39% 100% +10 • Rs232343 73% 51% 0% +22 • Rs233432 21% 32% 100% +11 • Rs234312 5% 15% 50% +10 • Rs163232 3% 0% 0% +3 • ….. ….. ….. ….. ….. D = 9.1 sd(D) = 7.4 s= 7 T = D / ( sd(D)/√s) 3.2 = 9.1 / ( 7.4/√7 )

  10. Resolving Individuals in Aggregate Data Sets

  11. Results on pooled samples

  12. Impact • NIH policy was changed • Summary-level data is no longer freely available on the web in a distributed unrestrictive manner. • Additional papers refined the math and described limitations

  13. Managing Risk • Distributing results of studies on human subjects inherently increases the the risk of a person being identifiable.. • Context is important. The concept of Positive Predictive Value (PPV) can provide a measure. • PPV can also account for ‘at-risk’ populations. • Currently, working with NIH on guidance for measuring risk with a given dataset • The approaches leveraged a critical concept of directionality, specific to genotype data and frequency tables. • P-values represent a fundamentally different datatype with low information content

  14. A new era

  15. The era of whole-genome sequencing is approaching • SNPs are common and usually defined as greater than 1% • Whole-genome sequencing and exome sequencing inherently measure rare variants. • Rare variants can be highly informative, particularly in combination. • Approaches need to be explored for summarizing results without revealing identity.

  16. Acknowledgements • Lab • Jennifer Dinh • Szabolcs Szelinger • Holly Benson • Meredith Sanchez-Castillo • Brooke Hjelm • Informatics • Nils Homer, Ph.D. • Tyler Izatt • Jessica Aldrich • Alexis Christoforides • Ahmet Kurdoglu • James Long • Shripad Sinari • Funding • NINDS U24NS051872 • State of Arizona • NHGRI U01HG005210 • This work: ENDGAME (NHLBI U01 HL086528 )

  17. Thank you

More Related