Which SNP genotyping errors are most costly and when?

Which SNP genotyping errors are most costly and when? Stephen J. Finch Stony Brook University

Acknowledgments • Joint work • Derek Gordon (Rockefeller University) • Sun Jung Kang (Duke University) • Five papers are the material for this talk with additional coauthors • Michael Nothnagel and Jurg Ott in paper 1 • Mark Levenstien and Jurg Ott in paper 2 • Abe Brown and Jurg Ott in paper 4

Acknowledgments • Colleagues: • Nancy Mendell • Kenny Ye • Stony Brook students (work in progress) • Nathan Tintle (repeated sampling) • Qing Wang (LRT for mixtures) • Kwangmi Ahn, Rose Saint Fleur • Undergraduates: Alex Borress, Josh Ren, Jelani Wiltshire

First Paper • Gordon, D., Finch, S.J., Nothnagel, M., Ott, J. (2002). Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Human Heredity, 54, 22-33.

Second Paper • Gordon D., Levenstien M.A., Finch S.J., and Ott J. (2003). Errors and linkage disequilibrium interact multiplicatively when computing sample sizes for genetic case-control association studies. Pacific Symposium on Biocomputing: 490-501.

Third Paper • Kang, S.J., Gordon, D., Finch, S.J. (2004). What SNP Genotyping Errors Are Most Costly for Genetic Association Studies. Genetic Epidemiology, 26, 132-141.

Fourth Paper • Kang, S.J., Gordon, D., Brown, A.M., Ott, J., Finch, S.J. (2004). Tradeoff between No-Call Reduction in Genotyping Error Rate and Loss of Sample Size for Genetic Case/Control Association Studies. Pacific Symposium on Biocomputing:

Fifth Paper • Kang, S.J., Finch, S.J., Gordon, D. (2004). Quantifying the cost of SNP genotyping errors in genetic model based association studies. Human Heredity, In press.

PAWE Web Site http://linkage.rockefeller.edu/pawe/pawe.cgi

Review Paper • Gordon, D., Finch, S.J. (2004). Factors affecting statistical power to detect genetic association. Submitted for publication.

Background • Definition of SNPs • SNP genotyping measurements • Specification of error models • Tests of association • Two supplementary measurement approaches

Definition of SNP • A gene with two possible alleles (here A and B) • A is the more common allele in the controls • Three possible genotypes • AA, index=1 (more common homozygote) • AB, index=2 (heterozygote) • BB, index=3 (less common homozygote)

Measure of Cost • The percentage increase in the minimum sample size necessary to maintain constant Type I and Type II error rates associated with an increase of 1% in a genotyping error rate is our measure of the cost of a genotyping error. • %MSSN is our abbreviation for this measure.

SNP Genotyping Measurements • Two die intensities are measured: R and G. • Measurements are typically taken at two or three time points. • Ratio F=R/(R+G) is used to classify into genotypes. • Genotyping error – event in which an observed genotype is different from the true genotype.

SNP Genotyping Measurements (Raw Data)

Scatterplot of SNP Dye Intensities

Scatterplot of SNP Fraction by Cycle Time

Approaches to Replication • Sutcliffe studied the reclassification of subjects using the same classification procedure at all remeasurements. • Tenenbein studied the reclassification of subjects using a virtually perfect instrument for the second reclassification.

Regenotyping Results • There is a common perception that genotyping error is negligible. • One test is to regenotype a set of data. • COGA provided such data to last GAW. • Tintle et al. (2004) analyzed it.

Regenotyping Results Summed over All SNPs(COGA GAW Data)

Observations on Table • Homozygote to homozygote inconsistencies are extremely rare. • CIDR “missing rate” is 6.7%. • Affymetrix “missing rate” is 6.1% • Double missing rate is 1.7%, much higher than the 0.4% expected under independence, suggesting some subjects may be consistently more difficult to genotype.

Regenotyping Definitions • Consistency: Two genotypes on a SNP for a regenotyped subject exist and are the same. • Nonreplication: One genotype on a SNP for a regenotyed subject exists, and data is “missing” for the other genotype. Note that we treat two missing genotypes as replicated. • SNP nonreplication rate: the number of non-replications divided by the sum of the number of replications and the number of non replications.

Critical assumptions about errors • Regardless of nature of errors, they are random and independent • Error model is same for cases (affecteds) and controls (unaffecteds)

True Genotype Observed Genotype AA AB BB AA AB BB Mote-Anderson Model [1965] Penetrance Table (most general)

Simple but Realistic Error Model • Homozygote to homozygote error rates set to zero • All other error rates set to equal error rate

Three Component Normal Mixture • Given AA, F is normal(-Δ, 1) • Given AB, F is normal(0, 1) • Given BB, F is normal(Δ,1) • Symmetric cutpoints create an error model that has equal error rates for all errors except homozygote to homozygote errors.

Tests of Association • Case-control study. The ratio of number of controls to number of cases is k. • We use the 2x3 chi-squared test of independence (simplest non-trivial case). • Mitra found the noncentrality parameter of the chi-squared test of association which is needed for power and sample size calculations • Recommended (Sasieni) test is test of trend (Armitage).

Test Statistic • Pearson’s on 2 × 3 tables Example Table

Effect of Misclassification Errors on Tests of Association • Bross found that level of significance is unchanged when the same error mechanism affects cases and control and that parameter estimates are biased. • Mote and Anderson found that the power is reduced (level of significance constant) when there are misclassification errors.

Notation Count parameters: NA = number of cases in the absence of errors NU = number of controls in the absence of errors NA* = number of cases in the presence of errors NU* = number of controls in the presence of errors

What is needed for asymptotic power calculations?

Genetic model free parameterization • Specify the genotype probabilities directly Assuming Hardy Weinberg Equilibrium (HWE), all probabilities specified with two parameters ( p, q ):

Genetic model free parameterization • Specify the genotype probabilities directly • Not assuming HWE, can specify all probabilities with four parameters:

Genetic Model Specification • p1 = allele frequency of SNP marker 1allele • p2 = allele frequency of SNP marker 2 allele = 1- p1 • pd = allele frequency of disease locus d allele • p+ = allele frequency of disease wild-type allele = 1- pd

Genetic Model Specification • D= disequilibrium (non-scaled as defined in Hartl and Clark • DMAX= min (p1 pd, p2 p+) • D’=D/ DMAX

Genetic Model Specification (penetrance parameters)

Results Demonstrate analytic solution of asymptotic power using standard chi-square test of genotypic association

Genotype Frequencies in the Presence of Errors

Noncentrality Parameter We assume NU = kNA. Using Mitra’s work (1958),

Noncentrality Parameter • Let λ=kNAg, where g is the bracketed function for genotypes measured without error. • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error.

To maintain constant asymptotic power We choose NA* so that λ* = λ.

Paper 1 Findings • Noncentrality parameter for the 2x3 chi-squared test of independence from Mitra to describe asymptotic power. • Increase in error rate (three error models) requires a corresponding increase in sample size to maintain Type I and Type II error rates. • Regression analysis of increase in %MSSN as function of error rate in a number of published models. • Interaction of linkage disequilibrium (D) and measure of overall error rate (S).

Paper 2 Findings • Linkage Disequilibrium (LD) and errors interact in a non-linear fashion. • The increase in sample size necessary to maintain constant asymptotic power and level of significance as a function of S (sum of error rates) is smallest when D’ = 1 (perfect LD). • The increase grows monotonically as D’ decreases to 0.5 for all studies.

Paper 3 Method • Saturated error model (called Mote-Anderson in PAWE software). • Taylor series expansion of the ratio of sample sizes expressed with the non-centrality parameters. • The coefficients of each error parameter give the %MSSN for a 1% increase in that error rate.

Recall the Noncentrality Parameters: • Let λ=kNAg, where g is the bracketed function. • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error. • Then, when λ= λ* (that is, equal power for both specifications), NA*/NA=g/g*.

%MSSN Function ( NA*/ NA )~ 1+ C12ε12+C13ε13 + C21ε21+C23ε23+ C31ε31+ C 32ε32. Suppose C13 = 7. Then every 1% increase in ε13 requires a 7% increase in sample size to maintain constant power

%MSSN Coefficients • The %MSSN coefficient associated with the error rate of misclassifying the more common homozygote as the heterozygote is given by

%MSSN Coefficients • Similar expressions hold for the other five %MSSN coefficients.

Example of Sample Size increase in presence of errors Suppose we have:

Which SNP genotyping errors are most costly and when?

Which SNP genotyping errors are most costly and when?

Presentation Transcript

Global Warming: Attribution, who is to blame?

ICD-9-CM Coordination and Maintenance Committee Meeting October 8 th , 2004

Meritorious New Teacher Candidate

CS 495 : Senior Seminar Pr. Richard Steflik Group 4 October 28 th , 2004

Presented By Foo Wah Teng Eastman Chemical Singapore Pte Ltd 28th October 2004

Projectile Point Typology on the Columbia Plateau

Some Blind Deconvolution Techniques in Image Processing

FREE STATE DEPARTMENT OF EDUCATION

Introduction to Syntax and Context-Free Grammars www1.cs.columbia/~rambow/teaching/lecture-2009-09-22

P2P-SIP Peer to peer Internet telephony using SIP

Just In Time Education: Linking Clinical and Educational Systems

British Columbia to 1896

Genetics of Alcoholism Part II

Amárach safe track Wave 7

Web Services Session GOSC meeting, Edinburgh, 29th October 2004.

Liver EQA meeting October 5th 2004 Circulation P

PRIN 2004 Project GeoPKDD

Patient Flow Collaborative Learning Session 2

Key Exchange: Part II

Rickettsia, Ehrlichia, and Borrelia

Instructor: Li Erran Li ( lierranli@cs.columbia )