Download
which snp genotyping errors are most costly and when n.
Skip this Video
Loading SlideShow in 5 Seconds..
Which SNP genotyping errors are most costly and when? PowerPoint Presentation
Download Presentation
Which SNP genotyping errors are most costly and when?

Which SNP genotyping errors are most costly and when?

460 Vues Download Presentation
Télécharger la présentation

Which SNP genotyping errors are most costly and when?

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Which SNP genotyping errors are most costly and when? Stephen J. Finch Stony Brook University

  2. Acknowledgments • Joint work • Derek Gordon (Rockefeller University) • Sun Jung Kang (Duke University) • Five papers are the material for this talk with additional coauthors • Michael Nothnagel and Jurg Ott in paper 1 • Mark Levenstien and Jurg Ott in paper 2 • Abe Brown and Jurg Ott in paper 4

  3. Acknowledgments • Colleagues: • Nancy Mendell • Kenny Ye • Stony Brook students (work in progress) • Nathan Tintle (repeated sampling) • Qing Wang (LRT for mixtures) • Kwangmi Ahn, Rose Saint Fleur • Undergraduates: Alex Borress, Josh Ren, Jelani Wiltshire

  4. First Paper • Gordon, D., Finch, S.J., Nothnagel, M., Ott, J. (2002). Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Human Heredity, 54, 22-33.

  5. Second Paper • Gordon D., Levenstien M.A., Finch S.J., and Ott J. (2003). Errors and linkage disequilibrium interact multiplicatively when computing sample sizes for genetic case-control association studies. Pacific Symposium on Biocomputing: 490-501.

  6. Third Paper • Kang, S.J., Gordon, D., Finch, S.J. (2004). What SNP Genotyping Errors Are Most Costly for Genetic Association Studies. Genetic Epidemiology, 26, 132-141.

  7. Fourth Paper • Kang, S.J., Gordon, D., Brown, A.M., Ott, J., Finch, S.J. (2004). Tradeoff between No-Call Reduction in Genotyping Error Rate and Loss of Sample Size for Genetic Case/Control Association Studies. Pacific Symposium on Biocomputing:

  8. Fifth Paper • Kang, S.J., Finch, S.J., Gordon, D. (2004). Quantifying the cost of SNP genotyping errors in genetic model based association studies. Human Heredity, In press.

  9. PAWE Web Site http://linkage.rockefeller.edu/pawe/pawe.cgi

  10. Review Paper • Gordon, D., Finch, S.J. (2004). Factors affecting statistical power to detect genetic association. Submitted for publication.

  11. Background • Definition of SNPs • SNP genotyping measurements • Specification of error models • Tests of association • Two supplementary measurement approaches

  12. Definition of SNP • A gene with two possible alleles (here A and B) • A is the more common allele in the controls • Three possible genotypes • AA, index=1 (more common homozygote) • AB, index=2 (heterozygote) • BB, index=3 (less common homozygote)

  13. Measure of Cost • The percentage increase in the minimum sample size necessary to maintain constant Type I and Type II error rates associated with an increase of 1% in a genotyping error rate is our measure of the cost of a genotyping error. • %MSSN is our abbreviation for this measure.

  14. SNP Genotyping Measurements • Two die intensities are measured: R and G. • Measurements are typically taken at two or three time points. • Ratio F=R/(R+G) is used to classify into genotypes. • Genotyping error – event in which an observed genotype is different from the true genotype.

  15. SNP Genotyping Measurements (Raw Data)

  16. Scatterplot of SNP Dye Intensities

  17. Scatterplot of SNP Fraction by Cycle Time

  18. Approaches to Replication • Sutcliffe studied the reclassification of subjects using the same classification procedure at all remeasurements. • Tenenbein studied the reclassification of subjects using a virtually perfect instrument for the second reclassification.

  19. Regenotyping Results • There is a common perception that genotyping error is negligible. • One test is to regenotype a set of data. • COGA provided such data to last GAW. • Tintle et al. (2004) analyzed it.

  20. Regenotyping Results Summed over All SNPs(COGA GAW Data)

  21. Observations on Table • Homozygote to homozygote inconsistencies are extremely rare. • CIDR “missing rate” is 6.7%. • Affymetrix “missing rate” is 6.1% • Double missing rate is 1.7%, much higher than the 0.4% expected under independence, suggesting some subjects may be consistently more difficult to genotype.

  22. Regenotyping Definitions • Consistency: Two genotypes on a SNP for a regenotyped subject exist and are the same. • Nonreplication: One genotype on a SNP for a regenotyed subject exists, and data is “missing” for the other genotype. Note that we treat two missing genotypes as replicated. • SNP nonreplication rate: the number of non-replications divided by the sum of the number of replications and the number of non replications.

  23. Critical assumptions about errors • Regardless of nature of errors, they are random and independent • Error model is same for cases (affecteds) and controls (unaffecteds)

  24. True Genotype Observed Genotype AA AB BB AA AB BB Mote-Anderson Model [1965] Penetrance Table (most general)

  25. Simple but Realistic Error Model • Homozygote to homozygote error rates set to zero • All other error rates set to equal error rate

  26. Three Component Normal Mixture • Given AA, F is normal(-Δ, 1) • Given AB, F is normal(0, 1) • Given BB, F is normal(Δ,1) • Symmetric cutpoints create an error model that has equal error rates for all errors except homozygote to homozygote errors.

  27. Tests of Association • Case-control study. The ratio of number of controls to number of cases is k. • We use the 2x3 chi-squared test of independence (simplest non-trivial case). • Mitra found the noncentrality parameter of the chi-squared test of association which is needed for power and sample size calculations • Recommended (Sasieni) test is test of trend (Armitage).

  28. Test Statistic • Pearson’s on 2 × 3 tables Example Table

  29. Effect of Misclassification Errors on Tests of Association • Bross found that level of significance is unchanged when the same error mechanism affects cases and control and that parameter estimates are biased. • Mote and Anderson found that the power is reduced (level of significance constant) when there are misclassification errors.

  30. Notation Count parameters: NA = number of cases in the absence of errors NU = number of controls in the absence of errors NA* = number of cases in the presence of errors NU* = number of controls in the presence of errors

  31. What is needed for asymptotic power calculations?

  32. Genetic model free parameterization • Specify the genotype probabilities directly Assuming Hardy Weinberg Equilibrium (HWE), all probabilities specified with two parameters ( p, q ):

  33. Genetic model free parameterization • Specify the genotype probabilities directly • Not assuming HWE, can specify all probabilities with four parameters:

  34. Genetic Model Specification • p1 = allele frequency of SNP marker 1allele • p2 = allele frequency of SNP marker 2 allele = 1- p1 • pd = allele frequency of disease locus d allele • p+ = allele frequency of disease wild-type allele = 1- pd

  35. Genetic Model Specification • D= disequilibrium (non-scaled as defined in Hartl and Clark • DMAX= min (p1 pd, p2 p+) • D’=D/ DMAX

  36. Genetic Model Specification (penetrance parameters)

  37. Results Demonstrate analytic solution of asymptotic power using standard chi-square test of genotypic association

  38. Genotype Frequencies in the Presence of Errors

  39. Noncentrality Parameter We assume NU = kNA. Using Mitra’s work (1958),

  40. Noncentrality Parameter • Let λ=kNAg, where g is the bracketed function for genotypes measured without error. • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error.

  41. To maintain constant asymptotic power We choose NA* so that λ* = λ.

  42. Paper 1 Findings • Noncentrality parameter for the 2x3 chi-squared test of independence from Mitra to describe asymptotic power. • Increase in error rate (three error models) requires a corresponding increase in sample size to maintain Type I and Type II error rates. • Regression analysis of increase in %MSSN as function of error rate in a number of published models. • Interaction of linkage disequilibrium (D) and measure of overall error rate (S).

  43. Paper 2 Findings • Linkage Disequilibrium (LD) and errors interact in a non-linear fashion. • The increase in sample size necessary to maintain constant asymptotic power and level of significance as a function of S (sum of error rates) is smallest when D’ = 1 (perfect LD). • The increase grows monotonically as D’ decreases to 0.5 for all studies.

  44. Paper 3 Method • Saturated error model (called Mote-Anderson in PAWE software). • Taylor series expansion of the ratio of sample sizes expressed with the non-centrality parameters. • The coefficients of each error parameter give the %MSSN for a 1% increase in that error rate.

  45. Recall the Noncentrality Parameters: • Let λ=kNAg, where g is the bracketed function. • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error. • Then, when λ= λ* (that is, equal power for both specifications), NA*/NA=g/g*.

  46. %MSSN Function ( NA*/ NA )~ 1+ C12ε12+C13ε13 + C21ε21+C23ε23+ C31ε31+ C 32ε32. Suppose C13 = 7. Then every 1% increase in ε13 requires a 7% increase in sample size to maintain constant power

  47. %MSSN Coefficients • The %MSSN coefficient associated with the error rate of misclassifying the more common homozygote as the heterozygote is given by

  48. %MSSN Coefficients • Similar expressions hold for the other five %MSSN coefficients.

  49. Example of Sample Size increase in presence of errors Suppose we have: