Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Which SNP genotyping errors are most costly and when?

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Which SNP genotyping errors are most costly and when?**Stephen J. Finch Stony Brook University**Acknowledgments**• Joint work • Derek Gordon (Rockefeller University) • Sun Jung Kang (Duke University) • Five papers are the material for this talk with additional coauthors • Michael Nothnagel and Jurg Ott in paper 1 • Mark Levenstien and Jurg Ott in paper 2 • Abe Brown and Jurg Ott in paper 4**Acknowledgments**• Colleagues: • Nancy Mendell • Kenny Ye • Stony Brook students (work in progress) • Nathan Tintle (repeated sampling) • Qing Wang (LRT for mixtures) • Kwangmi Ahn, Rose Saint Fleur • Undergraduates: Alex Borress, Josh Ren, Jelani Wiltshire**First Paper**• Gordon, D., Finch, S.J., Nothnagel, M., Ott, J. (2002). Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Human Heredity, 54, 22-33.**Second Paper**• Gordon D., Levenstien M.A., Finch S.J., and Ott J. (2003). Errors and linkage disequilibrium interact multiplicatively when computing sample sizes for genetic case-control association studies. Pacific Symposium on Biocomputing: 490-501.**Third Paper**• Kang, S.J., Gordon, D., Finch, S.J. (2004). What SNP Genotyping Errors Are Most Costly for Genetic Association Studies. Genetic Epidemiology, 26, 132-141.**Fourth Paper**• Kang, S.J., Gordon, D., Brown, A.M., Ott, J., Finch, S.J. (2004). Tradeoff between No-Call Reduction in Genotyping Error Rate and Loss of Sample Size for Genetic Case/Control Association Studies. Pacific Symposium on Biocomputing:**Fifth Paper**• Kang, S.J., Finch, S.J., Gordon, D. (2004). Quantifying the cost of SNP genotyping errors in genetic model based association studies. Human Heredity, In press.**PAWE Web Site**http://linkage.rockefeller.edu/pawe/pawe.cgi**Review Paper**• Gordon, D., Finch, S.J. (2004). Factors affecting statistical power to detect genetic association. Submitted for publication.**Background**• Definition of SNPs • SNP genotyping measurements • Specification of error models • Tests of association • Two supplementary measurement approaches**Definition of SNP**• A gene with two possible alleles (here A and B) • A is the more common allele in the controls • Three possible genotypes • AA, index=1 (more common homozygote) • AB, index=2 (heterozygote) • BB, index=3 (less common homozygote)**Measure of Cost**• The percentage increase in the minimum sample size necessary to maintain constant Type I and Type II error rates associated with an increase of 1% in a genotyping error rate is our measure of the cost of a genotyping error. • %MSSN is our abbreviation for this measure.**SNP Genotyping Measurements**• Two die intensities are measured: R and G. • Measurements are typically taken at two or three time points. • Ratio F=R/(R+G) is used to classify into genotypes. • Genotyping error – event in which an observed genotype is different from the true genotype.**Approaches to Replication**• Sutcliffe studied the reclassification of subjects using the same classification procedure at all remeasurements. • Tenenbein studied the reclassification of subjects using a virtually perfect instrument for the second reclassification.**Regenotyping Results**• There is a common perception that genotyping error is negligible. • One test is to regenotype a set of data. • COGA provided such data to last GAW. • Tintle et al. (2004) analyzed it.**Observations on Table**• Homozygote to homozygote inconsistencies are extremely rare. • CIDR “missing rate” is 6.7%. • Affymetrix “missing rate” is 6.1% • Double missing rate is 1.7%, much higher than the 0.4% expected under independence, suggesting some subjects may be consistently more difficult to genotype.**Regenotyping Definitions**• Consistency: Two genotypes on a SNP for a regenotyped subject exist and are the same. • Nonreplication: One genotype on a SNP for a regenotyed subject exists, and data is “missing” for the other genotype. Note that we treat two missing genotypes as replicated. • SNP nonreplication rate: the number of non-replications divided by the sum of the number of replications and the number of non replications.**Critical assumptions about errors**• Regardless of nature of errors, they are random and independent • Error model is same for cases (affecteds) and controls (unaffecteds)**True Genotype**Observed Genotype AA AB BB AA AB BB Mote-Anderson Model [1965] Penetrance Table (most general)**Simple but Realistic Error Model**• Homozygote to homozygote error rates set to zero • All other error rates set to equal error rate**Three Component Normal Mixture**• Given AA, F is normal(-Δ, 1) • Given AB, F is normal(0, 1) • Given BB, F is normal(Δ,1) • Symmetric cutpoints create an error model that has equal error rates for all errors except homozygote to homozygote errors.**Tests of Association**• Case-control study. The ratio of number of controls to number of cases is k. • We use the 2x3 chi-squared test of independence (simplest non-trivial case). • Mitra found the noncentrality parameter of the chi-squared test of association which is needed for power and sample size calculations • Recommended (Sasieni) test is test of trend (Armitage).**Test Statistic**• Pearson’s on 2 × 3 tables Example Table**Effect of Misclassification Errors on Tests of Association**• Bross found that level of significance is unchanged when the same error mechanism affects cases and control and that parameter estimates are biased. • Mote and Anderson found that the power is reduced (level of significance constant) when there are misclassification errors.**Notation**Count parameters: NA = number of cases in the absence of errors NU = number of controls in the absence of errors NA* = number of cases in the presence of errors NU* = number of controls in the presence of errors**Genetic model free parameterization**• Specify the genotype probabilities directly Assuming Hardy Weinberg Equilibrium (HWE), all probabilities specified with two parameters ( p, q ):**Genetic model free parameterization**• Specify the genotype probabilities directly • Not assuming HWE, can specify all probabilities with four parameters:**Genetic Model Specification**• p1 = allele frequency of SNP marker 1allele • p2 = allele frequency of SNP marker 2 allele = 1- p1 • pd = allele frequency of disease locus d allele • p+ = allele frequency of disease wild-type allele = 1- pd**Genetic Model Specification**• D= disequilibrium (non-scaled as defined in Hartl and Clark • DMAX= min (p1 pd, p2 p+) • D’=D/ DMAX**Results**Demonstrate analytic solution of asymptotic power using standard chi-square test of genotypic association**Noncentrality Parameter**We assume NU = kNA. Using Mitra’s work (1958),**Noncentrality Parameter**• Let λ=kNAg, where g is the bracketed function for genotypes measured without error. • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error.**To maintain constant asymptotic power**We choose NA* so that λ* = λ.**Paper 1 Findings**• Noncentrality parameter for the 2x3 chi-squared test of independence from Mitra to describe asymptotic power. • Increase in error rate (three error models) requires a corresponding increase in sample size to maintain Type I and Type II error rates. • Regression analysis of increase in %MSSN as function of error rate in a number of published models. • Interaction of linkage disequilibrium (D) and measure of overall error rate (S).**Paper 2 Findings**• Linkage Disequilibrium (LD) and errors interact in a non-linear fashion. • The increase in sample size necessary to maintain constant asymptotic power and level of significance as a function of S (sum of error rates) is smallest when D’ = 1 (perfect LD). • The increase grows monotonically as D’ decreases to 0.5 for all studies.**Paper 3 Method**• Saturated error model (called Mote-Anderson in PAWE software). • Taylor series expansion of the ratio of sample sizes expressed with the non-centrality parameters. • The coefficients of each error parameter give the %MSSN for a 1% increase in that error rate.**Recall the Noncentrality Parameters:**• Let λ=kNAg, where g is the bracketed function. • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error. • Then, when λ= λ* (that is, equal power for both specifications), NA*/NA=g/g*.**%MSSN Function**( NA*/ NA )~ 1+ C12ε12+C13ε13 + C21ε21+C23ε23+ C31ε31+ C 32ε32. Suppose C13 = 7. Then every 1% increase in ε13 requires a 7% increase in sample size to maintain constant power**%MSSN Coefficients**• The %MSSN coefficient associated with the error rate of misclassifying the more common homozygote as the heterozygote is given by**%MSSN Coefficients**• Similar expressions hold for the other five %MSSN coefficients.**Example of Sample Size increase in presence of errors**Suppose we have: