Which SNP genotyping errors are most costly and when? Stephen J. Finch Stony Brook University
Acknowledgments • Joint work • Derek Gordon (Rockefeller University) • Sun Jung Kang (Duke University) • Five papers are the material for this talk with additional coauthors • Michael Nothnagel and Jurg Ott in paper 1 • Mark Levenstien and Jurg Ott in paper 2 • Abe Brown and Jurg Ott in paper 4
Acknowledgments • Colleagues: • Nancy Mendell • Kenny Ye • Stony Brook students (work in progress) • Nathan Tintle (repeated sampling) • Qing Wang (LRT for mixtures) • Kwangmi Ahn, Rose Saint Fleur • Undergraduates: Alex Borress, Josh Ren, Jelani Wiltshire
First Paper • Gordon, D., Finch, S.J., Nothnagel, M., Ott, J. (2002). Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Human Heredity, 54, 22-33.
Second Paper • Gordon D., Levenstien M.A., Finch S.J., and Ott J. (2003). Errors and linkage disequilibrium interact multiplicatively when computing sample sizes for genetic case-control association studies. Pacific Symposium on Biocomputing: 490-501.
Third Paper • Kang, S.J., Gordon, D., Finch, S.J. (2004). What SNP Genotyping Errors Are Most Costly for Genetic Association Studies. Genetic Epidemiology, 26, 132-141.
Fourth Paper • Kang, S.J., Gordon, D., Brown, A.M., Ott, J., Finch, S.J. (2004). Tradeoff between No-Call Reduction in Genotyping Error Rate and Loss of Sample Size for Genetic Case/Control Association Studies. Pacific Symposium on Biocomputing:
Fifth Paper • Kang, S.J., Finch, S.J., Gordon, D. (2004). Quantifying the cost of SNP genotyping errors in genetic model based association studies. Human Heredity, In press.
PAWE Web Site http://linkage.rockefeller.edu/pawe/pawe.cgi
Review Paper • Gordon, D., Finch, S.J. (2004). Factors affecting statistical power to detect genetic association. Submitted for publication.
Background • Definition of SNPs • SNP genotyping measurements • Specification of error models • Tests of association • Two supplementary measurement approaches
Definition of SNP • A gene with two possible alleles (here A and B) • A is the more common allele in the controls • Three possible genotypes • AA, index=1 (more common homozygote) • AB, index=2 (heterozygote) • BB, index=3 (less common homozygote)
Measure of Cost • The percentage increase in the minimum sample size necessary to maintain constant Type I and Type II error rates associated with an increase of 1% in a genotyping error rate is our measure of the cost of a genotyping error. • %MSSN is our abbreviation for this measure.
SNP Genotyping Measurements • Two die intensities are measured: R and G. • Measurements are typically taken at two or three time points. • Ratio F=R/(R+G) is used to classify into genotypes. • Genotyping error – event in which an observed genotype is different from the true genotype.
Approaches to Replication • Sutcliffe studied the reclassification of subjects using the same classification procedure at all remeasurements. • Tenenbein studied the reclassification of subjects using a virtually perfect instrument for the second reclassification.
Regenotyping Results • There is a common perception that genotyping error is negligible. • One test is to regenotype a set of data. • COGA provided such data to last GAW. • Tintle et al. (2004) analyzed it.
Observations on Table • Homozygote to homozygote inconsistencies are extremely rare. • CIDR “missing rate” is 6.7%. • Affymetrix “missing rate” is 6.1% • Double missing rate is 1.7%, much higher than the 0.4% expected under independence, suggesting some subjects may be consistently more difficult to genotype.
Regenotyping Definitions • Consistency: Two genotypes on a SNP for a regenotyped subject exist and are the same. • Nonreplication: One genotype on a SNP for a regenotyed subject exists, and data is “missing” for the other genotype. Note that we treat two missing genotypes as replicated. • SNP nonreplication rate: the number of non-replications divided by the sum of the number of replications and the number of non replications.
Critical assumptions about errors • Regardless of nature of errors, they are random and independent • Error model is same for cases (affecteds) and controls (unaffecteds)
True Genotype Observed Genotype AA AB BB AA AB BB Mote-Anderson Model  Penetrance Table (most general)
Simple but Realistic Error Model • Homozygote to homozygote error rates set to zero • All other error rates set to equal error rate
Three Component Normal Mixture • Given AA, F is normal(-Δ, 1) • Given AB, F is normal(0, 1) • Given BB, F is normal(Δ,1) • Symmetric cutpoints create an error model that has equal error rates for all errors except homozygote to homozygote errors.
Tests of Association • Case-control study. The ratio of number of controls to number of cases is k. • We use the 2x3 chi-squared test of independence (simplest non-trivial case). • Mitra found the noncentrality parameter of the chi-squared test of association which is needed for power and sample size calculations • Recommended (Sasieni) test is test of trend (Armitage).
Test Statistic • Pearson’s on 2 × 3 tables Example Table
Effect of Misclassification Errors on Tests of Association • Bross found that level of significance is unchanged when the same error mechanism affects cases and control and that parameter estimates are biased. • Mote and Anderson found that the power is reduced (level of significance constant) when there are misclassification errors.
Notation Count parameters: NA = number of cases in the absence of errors NU = number of controls in the absence of errors NA* = number of cases in the presence of errors NU* = number of controls in the presence of errors
Genetic model free parameterization • Specify the genotype probabilities directly Assuming Hardy Weinberg Equilibrium (HWE), all probabilities specified with two parameters ( p, q ):
Genetic model free parameterization • Specify the genotype probabilities directly • Not assuming HWE, can specify all probabilities with four parameters:
Genetic Model Specification • p1 = allele frequency of SNP marker 1allele • p2 = allele frequency of SNP marker 2 allele = 1- p1 • pd = allele frequency of disease locus d allele • p+ = allele frequency of disease wild-type allele = 1- pd
Genetic Model Specification • D= disequilibrium (non-scaled as defined in Hartl and Clark • DMAX= min (p1 pd, p2 p+) • D’=D/ DMAX
Results Demonstrate analytic solution of asymptotic power using standard chi-square test of genotypic association
Noncentrality Parameter We assume NU = kNA. Using Mitra’s work (1958),
Noncentrality Parameter • Let λ=kNAg, where g is the bracketed function for genotypes measured without error. • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error.
To maintain constant asymptotic power We choose NA* so that λ* = λ.
Paper 1 Findings • Noncentrality parameter for the 2x3 chi-squared test of independence from Mitra to describe asymptotic power. • Increase in error rate (three error models) requires a corresponding increase in sample size to maintain Type I and Type II error rates. • Regression analysis of increase in %MSSN as function of error rate in a number of published models. • Interaction of linkage disequilibrium (D) and measure of overall error rate (S).
Paper 2 Findings • Linkage Disequilibrium (LD) and errors interact in a non-linear fashion. • The increase in sample size necessary to maintain constant asymptotic power and level of significance as a function of S (sum of error rates) is smallest when D’ = 1 (perfect LD). • The increase grows monotonically as D’ decreases to 0.5 for all studies.
Paper 3 Method • Saturated error model (called Mote-Anderson in PAWE software). • Taylor series expansion of the ratio of sample sizes expressed with the non-centrality parameters. • The coefficients of each error parameter give the %MSSN for a 1% increase in that error rate.
Recall the Noncentrality Parameters: • Let λ=kNAg, where g is the bracketed function. • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error. • Then, when λ= λ* (that is, equal power for both specifications), NA*/NA=g/g*.
%MSSN Function ( NA*/ NA )~ 1+ C12ε12+C13ε13 + C21ε21+C23ε23+ C31ε31+ C 32ε32. Suppose C13 = 7. Then every 1% increase in ε13 requires a 7% increase in sample size to maintain constant power
%MSSN Coefficients • The %MSSN coefficient associated with the error rate of misclassifying the more common homozygote as the heterozygote is given by
%MSSN Coefficients • Similar expressions hold for the other five %MSSN coefficients.
Example of Sample Size increase in presence of errors Suppose we have: