Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 206 2-4271

EPI293Design and analysis of gene association studiesWinter Term 2008Lecture 4: Multilocus haplotypes Peter Kraft pkraft@hsph.harvard.eduBldg 2 Rm 2062-4271

“Effective number of SNPs” • Can use spectral decomposition to approximate permutation correction for multiple testing

Example: Candidate Genes from Smoking GWAS Number of SNPs in gene region Meff Bonferroni corrected threshold Effective number of SNPs

Tradeoffs: Coverage vs. Sample Size • Say you want to study JNK1 in your study of 400 cases and 800 controls and can afford 19,200 genotypes • You have to choose between good coverage (high minimum r2) and sample size

Outline • Background • Haplotype estimation algorithms • Rh2, Rs2 and htSNPs • Plug-in methods (regression calibration) • Missing data methods (structural modeling) • Rare haplotypes, robust models • Summary

Why haplotypes? • May be useful in linkage-disequilibrium mapping • May be directly relevant (cis-interaction) • May be efficient surrogates for other SNPs • “tag SNPs” enable indirect testing for unknown variants • Haplotypes may reduce test degrees of freedom = 1 1 0 0 = 1 0 0 1

Basic “tagging SNP” design Genotype reference panel(many markers, few subjects) Choose “tagging SNPs”(many markers, few subjects) Genotype “tagging SNPs” in main study(few markers, many subjects) Analyze appropriately(single locus or multilocus--unphased or phased)

ATM

CYP19

G A A G A/G A G G A G A G A G A G A A/G A G G A A/G G A A G G A G A A G Unobservedtrue haplotyes Observedgenotypes Possiblehaplotypes Basic technical problem: unknown phase

Need to infer haplotypes while accounting for haplotype uncertainty • Missing data methods [EM algorithm, Bayesian methods] can be used to estimate haplotype frequencies, infer haplotype probability, conditional on observed genotypes = Pr(H|G;qH) • Can use this probability to assign haplotype pair [as if it were known] or as a weight in standard regression analysis • Missing data methods can also be used to jointly estimate regression parameters and haplotype frequencies

G A A G G A G A G A G A G A G A A G G A G A A G Only one pair is possible! Example: how we can say something about haplotypes given genotypes What if we knew the haplotype frequencies?

Haplotype frequencies Indicator: 1 if G is compatible with H, 0 otherwise Sum is over all haplotype pairs How to calculate Pr(H|G;qH) Use Bayes’ theorem: Assumes HWE

Example 1 Denominator

Example 2 Denominator

The EM algorithm • Uses calculations on previous pages to estimate allele frequencies • Individual haplotype probabilities (given genotypes) are natural by-product • Assumes HWE • Pr({Hi,Hj})=2qiqj if ij; qi2 otherwise • Does not substantially affect individual haplotype estimates in regions of limited haplotype diversity (high rh2) • EM algorithm makes no assumptions about inter-marker distance or marker order! • More sophisticated algorithms (e.g. PHASE) do • Again, in regions of limited haplotype diversity (low recombination) may not matter much

Plain-vanilla EM • Start with guess at haplotype frequencies qH(0) • Count expected numbers of haps, given genos and qH(0) • Update qH(1) as relative frequencies of haps • Keep going until converged • Provides natural measure of phase uncertainty • Pr(H|G;estimated q)

Example – plain vanilla EM • Raw data • 64 subjects with genotypes {G/G,G/G,G/G} • 32 subjects with genotypes {G/A,G/A,G/A} • 4 subjects with genotypes {A/A,A/A,A/A} • Initialize qH(0)=(.125,.125,.125,.125,.125,.125,.125,.125) • Expected hap counts given genos, qH(0) • 64 with haps {G-G-G,G-G-G} • 8 with haps {G-G-G,A-A-A} • 8 each with {G-G-A,A-A-G},{G-A-G,A-G-A},{G-A-A,A-G-G} • 4 with {A-A-A,A-A-A} • Update qH(1)=(.68,.04,.04,.04,.04,.04,.04,.08) qH(2)=(.791,.003,.003,.003,.003,.003,.003,.191)

Modifications to EM algorithm • For #SNPs > 8, plain vanilla algorithm can be slow • Faster versions involve “partial ligation” • Calculate frequencies within blocks*—then merge blocks • PLEM, tagsnps—user-defined blocks • SNPHAP, PROC HAPLOTYPE (EST=STEPEM)—add one SNP at a time Niu et al. (2002) AJHG 70:157 * These blocks need have nothing to do with “haplotype blocks”—they are defined arbitrarily for computational convenience

Bayesian approaches • HAPLOTYPER • Original “partition-ligation” software • Used Bayesian framework as computational tool • Prior = dirichlet • PHASE • Uses what we know about genetics to aid freq estimation • Prior = coalescent plus recombination • May improve estimation where recomb. matters (long dist) • Marginal improvement over EM where recomb. doesn’t matter • Others (Arlequin 3.0, hap, …)

Stram et al. (2003) Hum Hered 55:27 • Presents Rh2 and Rs2 measures • These account for phase uncertainty • Other measures [Weale et al. (2003) AJHG 73:551] do not • Uses these to select haplotype tagging SNPs • Implemented in “tagsnps”

Rh2 • Rh2 is the squared correlation between true and predicted haplotype dosage • Expected number of hap h given G and q = E{h(H)|G} • G may or may not contain all relevant markers • E.g. G may be genotypes of tagging SNPs • May be <1 even if all relevant markers are genotyped! Count of hs in H

Rh2 =.54 Even if all four SNPs are genotyped, Rh2 runs from .72 to .98 Kraft et al (2005) Genet Epidemiol

Rs2 • Rs2 is the squared correlation between true and predicted genotype • Expected allele count given G and q = E{s(H)|G} • Will be 1 if SNP s is genotyped (contained in G) • Not used by tagger (tagger looks at correlation with one haplotype and untyped SNP; this looks at correlation with all haplotypes and untyped SNP)

Rs2 and Rh2 related to power • 1/R2 is “sample size inflator” • To achieve same power as study with N subjects that observed causal SNP s (haplotype h), we need N/R2 subjects • Equivalently: calculate power with “effective sample size” R2 N

Choosing tag SNPs s.t. min R2 >  • Stram et al. use stepwise algorithm • Given k-1 tag SNPs… • …add SNP that leads to greatest increase in R2… • …then see if any of the original k-1 can be “swapped out” • Keep going until R2 >  • Performs well relative to exhaustive search

Alternative paradigmsNO PHASE! NO BLOCKS! • Carlson—use pairwise R2 • Partition SNPs in “bins” of high mutual correlation • Choose one SNP per bin • Chapman—use multivariate R2 • Predict untyped SNP using tags • Choose tags s.t. min R2 >  using “crude step up” algorithm • How many SNPs to put into test? • Still have to define region to tag • Can lead to trying to fit too many SNPs at once

Solutions to the unknown phase problem • Plug in (or regression calibration) • Old school: estimate freqs in cases and controls seperately • Naïve: use “best guess haplotype” • Use expected haplotype scores • Closely related: Use posterior haplotype probability weights • Missing data (or structural modeling) • Treats phase as missing data • Fits likelihood marginal over phase

Old school • Estimate haplotypes assuming HWE • Pooled • In cases and controls separately • LRT* = 2 (log Lcases + log Lcontrols – log Lpooled) • Disadvantages • May not have usual ChiSquare dist’n • Cases generally not in HWE under alternative • Rare haplotypes • Cannot incorporate covariates • Haplotype-specific odds ratios dodgy * Caveat Emptor! Test statistic in SAS PROC HAPLOTYPE leaves off the 2! Must double statistic & calculate p-value by hand.

Plug-in methods • Write down GLM you want to fit... • ...then use GLM you can fit

Naïve plug-in approach • Use “best guess haplotypes” • Assign Hi s.t. Pr(Hi|Gi;qhat) = maxH Pr(H|Gi;qhat) • Error in best guess leads to bias in parameter estimates Rh2=0.83 Rh2=0.54 * All biases significantly diff’t from 0 with p < .005 Kraft et al (2005) Genet Epidemiol

Best guess: carrier Rh2=0.83 Best guess: non-carrier Rh2=0.54 Kraft et al (2005) Genet Epidemiol

Expectation substitution (regression calibration) • Estimate qhat in pooled cases and controls • Set Z* equal to E[Z|G;qhat] • E.g. for dominant model, Zh* = Pr(Carrier of h|G) • “Hedging your bets” • Fudges some technical details • Estimates of parameter variance may be narrow • Variation in Z* due to uncertainty in qhat not accounted for • Imputation of H and hence Z* should account for sampling • Want Z = E[Z|G,Asc;q,] • These details don’t seem to matter for realistic situations

Bias in parameter estimates: a comparison of “plug in” methods Expectation substitution Best Guess Kraft et al (2005) Genet Epidemiol

Expectation substitution: Example Macro %happy creates data set with expected counts for different models (see www.hsph.harvard.edu/faculty/kraft/soft.htm) %happy( indsn=oohay, /* input data set */ keep=m1 m2 m3 m4 m5 m6 m7 m8, /* marker variables--REQUIRED */ style=MAR, /* use default proc haplotype input */ outadd=yooadd, /* output data set--additive coding */ outcodomnant=yoocod, /* output data set--codominant coding */ range=.05); /* set 'rare' threshold for pooling */

Haplotype frequencies Haplotype Freq Code 1-4-3-2 0.37209 z1 3-4-3-2 0.28084 z2 3-2-4-2 0.11223 z3 3-4-4-1 0.09201 z4 3-4-4-2 0.08943 z5Other 0.054 z6 Using additive coding (zh = expected number of hap h) /* run logistic model including rare haps (z6)*/ proclogistic; model d(descending)=z2-z6; /* run logistic model without rare haps */ proclogistic; model d(descending)=z2-z5; run; This works just like a multiallelic test, as if we had observed the alleles (haplotypes) 1 through 6

Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 16.3993 4 0.0025 Score 16.3316 4 0.0026 Wald 16.1984 4 0.0028 Without “other” Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits z2 0.894 0.756 1.057 z3 1.331 1.048 1.690 z4 1.349 1.048 1.735 z5 1.081 0.838 1.394 Slight changes towards (away) null[3rd SNP is causal] Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 17.8874 5 0.0031 Score 17.8094 5 0.0032 Wald 17.6553 5 0.0034 With “other” Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits z2 0.910 0.768 1.079 z3 1.338 1.053 1.699 z4 1.369 1.063 1.763 z5 1.112 0.859 1.440 z6 1.238 0.878 1.744

Using codominant coding (z1h = probability of carrying one copy of h; z2h = probability of carrying two copies of h) proclogistic; model d(descending)=z13 z23; run; Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 6.1077 2 0.0472 Score 6.0395 2 0.0488 Wald 5.9014 2 0.0523 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits z13 1.262 0.980 1.625 z23 2.327 0.880 6.156 Most useful and informative for individualhaplotype—including several haplotypes in model leads to problems with sparse data colinearity

Posterior-prob. Weights • Raw output data set from PROC HAPLOTYPE: • Recode hap1, hap2 as z1…zK (for additive coding) id HAPLOTYPE1 HAPLOTYPE2 PROB 1 1-1-1-4 1-3-3-2 1.00000 ... 6 1-1-1-4 1-1-1-4 0.98866 6 1-1-1-4 1-3-1-2 0.01129 6 1-1-1-4 1-3-1-4 0.00002 6 1-3-1-2 1-3-1-2 0.00003 ... 38 1-3-3-2 3-3-1-4 0.80049 38 1-3-3-4 3-3-1-4 0.19951 data dome; merge yahap yahoo; by id; z3=(haplotype1 eq '1-1-1-4') + (haplotype2 eq '1-1-1-4'); keep id z3 d prob; run; French et al. Genet Epidemiol 2006 Sep;30(6):485-94.

New data set • Analyze using probability weights id PROB d z3 1 1.00000 2 1 2 1.00000 2 1 ... 5 1.00000 2 0 6 0.98866 2 2 6 0.01129 2 1 6 0.00002 2 1 6 0.00003 2 0 procgenmod ...; model d=z3 / ...; weight prob; run; Need to use “sandwich” variance estimator (identity working correlation) Can use Taylor series expansion to show E-sub and prob weight likelihoods are almost identical

Missing data methods • Fits marginal likelihood—jointly estimates  and q: • Computational tricks—require specialized software • Weighted EM (M-step involves separate easy calcs for  and q) • Estimating equations • Brute force • Still fudges some details • “Prospective” likelihood ignores ascertainment • Assumes HWE • Again: not so bad?

Checking the details • Ascertainment • Sampling cases  oversampling risk haplotypes • Std. result (Prentice and Pyke): pros. likelihood OK for retro data • E.g. logistic regression for unmatched case-control studies • Result does not immediately follow for haplotypes • Saturated dist of hap pairs cannot be estimated [Schaid (2005)] • But little-no observed bias when Rh2 > .8 [> .5 ?] • Retrospective likelihood available [Epstein and Satten (2003)] • Deviations from HWE • Not too bad in realistic cases • Retrospective likelihood is more sensitive than prospective! • Even after modeling departures from HWE! [Satten and Epstein (2004)]

Excess heterozygotes Retro Prosp Prosp [Satten and Epstein (2004)]

Exp-Sub “Plug In” vs. Missing Data Prevalence=1/1000; 1000 replicate studies of 350 cases and 350 controls;causal haplotype freq=.07, good rh2 NB: haplotype frequencies are not representative of population frequencies;frequency of risk haplotype in case-control sample 3x that in gen’l pop’n

Very small in regions of “limited haplotype diversity” Why so close? Z* is “plug in” estimate; Z is true E(H|G,design); f is logit function. Still needs to be written up—and open question whether this can be extended beyond GLMs

Select bibliographyI=“plug in”; Ib=“quasi-missing data”; II=“missing data” Yet more:Zeng D & Lin DY Genet Epidemiol. 2005 Jan;28(1):70-82Spinka C et al. Genet Epidemiol. 2005 Sep;29(2):108-27Lin DY & Zeng D JASA 2006 March (with discussion)http://www.bios.unc.edu/~lin/publications/2006/LinZeng06.pdf Number of papers with “haplotype” in title, Genet Epidemiol Am J Hum Genet, Biostats, Biometrika and Human Heredity, 1997-2007

Common modeling issues • Many haplotypes, rare haplotypes • Simple approaches • 1 d.f. per common haplotype—pool rare haplotypes • Model each haplotype separately • Assumes causal variant lies on one haplotype • Note differences in reference category across models • More sophisticated approaches • Shrinkage / multi-model inference • Trade increased bias for smaller MSE, smaller prediction error • Cluster haplotypes Refs 20,22

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 206 2-4271