1 / 56

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 206 2-4271

EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 4: Multilocus haplotypes. Peter Kraft pkraft@hsph.harvard.edu Bldg 2 Rm 206 2-4271. “Effective number of SNPs”. Can use spectral decomposition to approximate permutation correction for multiple testing.

ervin
Télécharger la présentation

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 206 2-4271

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EPI293Design and analysis of gene association studiesWinter Term 2008Lecture 4: Multilocus haplotypes Peter Kraft pkraft@hsph.harvard.eduBldg 2 Rm 2062-4271

  2. “Effective number of SNPs” • Can use spectral decomposition to approximate permutation correction for multiple testing

  3. Example: Candidate Genes from Smoking GWAS Number of SNPs in gene region Meff Bonferroni corrected threshold Effective number of SNPs

  4. Tradeoffs: Coverage vs. Sample Size • Say you want to study JNK1 in your study of 400 cases and 800 controls and can afford 19,200 genotypes • You have to choose between good coverage (high minimum r2) and sample size

  5. Outline • Background • Haplotype estimation algorithms • Rh2, Rs2 and htSNPs • Plug-in methods (regression calibration) • Missing data methods (structural modeling) • Rare haplotypes, robust models • Summary

  6. Why haplotypes? • May be useful in linkage-disequilibrium mapping • May be directly relevant (cis-interaction) • May be efficient surrogates for other SNPs • “tag SNPs” enable indirect testing for unknown variants • Haplotypes may reduce test degrees of freedom = 1 1 0 0 = 1 0 0 1

  7. Basic “tagging SNP” design Genotype reference panel(many markers, few subjects) Choose “tagging SNPs”(many markers, few subjects) Genotype “tagging SNPs” in main study(few markers, many subjects) Analyze appropriately(single locus or multilocus--unphased or phased)

  8. ATM

  9. CYP19

  10. CYP19

  11. G A A G A/G A G G A G A G A G A G A A/G A G G A A/G G A A G G A G A A G Unobservedtrue haplotyes Observedgenotypes Possiblehaplotypes Basic technical problem: unknown phase

  12. Need to infer haplotypes while accounting for haplotype uncertainty • Missing data methods [EM algorithm, Bayesian methods] can be used to estimate haplotype frequencies, infer haplotype probability, conditional on observed genotypes = Pr(H|G;qH) • Can use this probability to assign haplotype pair [as if it were known] or as a weight in standard regression analysis • Missing data methods can also be used to jointly estimate regression parameters and haplotype frequencies

  13. G A A G G A G A G A G A G A G A A G G A G A A G Only one pair is possible! Example: how we can say something about haplotypes given genotypes What if we knew the haplotype frequencies?

  14. Haplotype frequencies Indicator: 1 if G is compatible with H, 0 otherwise Sum is over all haplotype pairs How to calculate Pr(H|G;qH) Use Bayes’ theorem: Assumes HWE

  15. Example 1 Denominator

  16. Example 2 Denominator

  17. The EM algorithm • Uses calculations on previous pages to estimate allele frequencies • Individual haplotype probabilities (given genotypes) are natural by-product • Assumes HWE • Pr({Hi,Hj})=2qiqj if ij; qi2 otherwise • Does not substantially affect individual haplotype estimates in regions of limited haplotype diversity (high rh2) • EM algorithm makes no assumptions about inter-marker distance or marker order! • More sophisticated algorithms (e.g. PHASE) do • Again, in regions of limited haplotype diversity (low recombination) may not matter much

  18. Plain-vanilla EM • Start with guess at haplotype frequencies qH(0) • Count expected numbers of haps, given genos and qH(0) • Update qH(1) as relative frequencies of haps • Keep going until converged • Provides natural measure of phase uncertainty • Pr(H|G;estimated q)

  19. Example – plain vanilla EM • Raw data • 64 subjects with genotypes {G/G,G/G,G/G} • 32 subjects with genotypes {G/A,G/A,G/A} • 4 subjects with genotypes {A/A,A/A,A/A} • Initialize qH(0)=(.125,.125,.125,.125,.125,.125,.125,.125) • Expected hap counts given genos, qH(0) • 64 with haps {G-G-G,G-G-G} • 8 with haps {G-G-G,A-A-A} • 8 each with {G-G-A,A-A-G},{G-A-G,A-G-A},{G-A-A,A-G-G} • 4 with {A-A-A,A-A-A} • Update qH(1)=(.68,.04,.04,.04,.04,.04,.04,.08) qH(2)=(.791,.003,.003,.003,.003,.003,.003,.191)

  20. Modifications to EM algorithm • For #SNPs > 8, plain vanilla algorithm can be slow • Faster versions involve “partial ligation” • Calculate frequencies within blocks*—then merge blocks • PLEM, tagsnps—user-defined blocks • SNPHAP, PROC HAPLOTYPE (EST=STEPEM)—add one SNP at a time Niu et al. (2002) AJHG 70:157 * These blocks need have nothing to do with “haplotype blocks”—they are defined arbitrarily for computational convenience

  21. Bayesian approaches • HAPLOTYPER • Original “partition-ligation” software • Used Bayesian framework as computational tool • Prior = dirichlet • PHASE • Uses what we know about genetics to aid freq estimation • Prior = coalescent plus recombination • May improve estimation where recomb. matters (long dist) • Marginal improvement over EM where recomb. doesn’t matter • Others (Arlequin 3.0, hap, …)

  22. Outline • Background • Haplotype estimation algorithms • Rh2, Rs2 and htSNPs • Plug-in methods (regression calibration) • Missing data methods (structural modeling) • Rare haplotypes, robust models • Summary

  23. Stram et al. (2003) Hum Hered 55:27 • Presents Rh2 and Rs2 measures • These account for phase uncertainty • Other measures [Weale et al. (2003) AJHG 73:551] do not • Uses these to select haplotype tagging SNPs • Implemented in “tagsnps”

  24. Rh2 • Rh2 is the squared correlation between true and predicted haplotype dosage • Expected number of hap h given G and q = E{h(H)|G} • G may or may not contain all relevant markers • E.g. G may be genotypes of tagging SNPs • May be <1 even if all relevant markers are genotyped! Count of hs in H

  25. Rh2 =.54 Even if all four SNPs are genotyped, Rh2 runs from .72 to .98 Kraft et al (2005) Genet Epidemiol

  26. Rs2 • Rs2 is the squared correlation between true and predicted genotype • Expected allele count given G and q = E{s(H)|G} • Will be 1 if SNP s is genotyped (contained in G) • Not used by tagger (tagger looks at correlation with one haplotype and untyped SNP; this looks at correlation with all haplotypes and untyped SNP)

  27. Rs2 and Rh2 related to power • 1/R2 is “sample size inflator” • To achieve same power as study with N subjects that observed causal SNP s (haplotype h), we need N/R2 subjects • Equivalently: calculate power with “effective sample size” R2 N

  28. Choosing tag SNPs s.t. min R2 >  • Stram et al. use stepwise algorithm • Given k-1 tag SNPs… • …add SNP that leads to greatest increase in R2… • …then see if any of the original k-1 can be “swapped out” • Keep going until R2 >  • Performs well relative to exhaustive search

  29. Alternative paradigmsNO PHASE! NO BLOCKS! • Carlson—use pairwise R2 • Partition SNPs in “bins” of high mutual correlation • Choose one SNP per bin • Chapman—use multivariate R2 • Predict untyped SNP using tags • Choose tags s.t. min R2 >  using “crude step up” algorithm • How many SNPs to put into test? • Still have to define region to tag • Can lead to trying to fit too many SNPs at once

  30. Outline • Background • Haplotype estimation algorithms • Rh2, Rs2 and htSNPs • Plug-in methods (regression calibration) • Missing data methods (structural modeling) • Rare haplotypes, robust models • Summary

  31. Solutions to the unknown phase problem • Plug in (or regression calibration) • Old school: estimate freqs in cases and controls seperately • Naïve: use “best guess haplotype” • Use expected haplotype scores • Closely related: Use posterior haplotype probability weights • Missing data (or structural modeling) • Treats phase as missing data • Fits likelihood marginal over phase

  32. Old school • Estimate haplotypes assuming HWE • Pooled • In cases and controls separately • LRT* = 2 (log Lcases + log Lcontrols – log Lpooled) • Disadvantages • May not have usual ChiSquare dist’n • Cases generally not in HWE under alternative • Rare haplotypes • Cannot incorporate covariates • Haplotype-specific odds ratios dodgy * Caveat Emptor! Test statistic in SAS PROC HAPLOTYPE leaves off the 2! Must double statistic & calculate p-value by hand.

  33. Plug-in methods • Write down GLM you want to fit... • ...then use GLM you can fit

  34. Naïve plug-in approach • Use “best guess haplotypes” • Assign Hi s.t. Pr(Hi|Gi;qhat) = maxH Pr(H|Gi;qhat) • Error in best guess leads to bias in parameter estimates Rh2=0.83 Rh2=0.54 * All biases significantly diff’t from 0 with p < .005 Kraft et al (2005) Genet Epidemiol

  35. Best guess: carrier Rh2=0.83 Best guess: non-carrier Rh2=0.54 Kraft et al (2005) Genet Epidemiol

  36. Expectation substitution (regression calibration) • Estimate qhat in pooled cases and controls • Set Z* equal to E[Z|G;qhat] • E.g. for dominant model, Zh* = Pr(Carrier of h|G) • “Hedging your bets” • Fudges some technical details • Estimates of parameter variance may be narrow • Variation in Z* due to uncertainty in qhat not accounted for • Imputation of H and hence Z* should account for sampling • Want Z = E[Z|G,Asc;q,] • These details don’t seem to matter for realistic situations

  37. Bias in parameter estimates: a comparison of “plug in” methods Expectation substitution Best Guess Kraft et al (2005) Genet Epidemiol

  38. Expectation substitution: Example Macro %happy creates data set with expected counts for different models (see www.hsph.harvard.edu/faculty/kraft/soft.htm) %happy( indsn=oohay, /* input data set */ keep=m1 m2 m3 m4 m5 m6 m7 m8, /* marker variables--REQUIRED */ style=MAR, /* use default proc haplotype input */ outadd=yooadd, /* output data set--additive coding */ outcodomnant=yoocod, /* output data set--codominant coding */ range=.05); /* set 'rare' threshold for pooling */

  39. Haplotype frequencies Haplotype Freq Code 1-4-3-2 0.37209 z1 3-4-3-2 0.28084 z2 3-2-4-2 0.11223 z3 3-4-4-1 0.09201 z4 3-4-4-2 0.08943 z5Other 0.054 z6 Using additive coding (zh = expected number of hap h) /* run logistic model including rare haps (z6)*/ proclogistic; model d(descending)=z2-z6; /* run logistic model without rare haps */ proclogistic; model d(descending)=z2-z5; run; This works just like a multiallelic test, as if we had observed the alleles (haplotypes) 1 through 6

  40. Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 16.3993 4 0.0025 Score 16.3316 4 0.0026 Wald 16.1984 4 0.0028 Without “other” Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits z2 0.894 0.756 1.057 z3 1.331 1.048 1.690 z4 1.349 1.048 1.735 z5 1.081 0.838 1.394 Slight changes towards (away) null[3rd SNP is causal] Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 17.8874 5 0.0031 Score 17.8094 5 0.0032 Wald 17.6553 5 0.0034 With “other” Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits z2 0.910 0.768 1.079 z3 1.338 1.053 1.699 z4 1.369 1.063 1.763 z5 1.112 0.859 1.440 z6 1.238 0.878 1.744

  41. Using codominant coding (z1h = probability of carrying one copy of h; z2h = probability of carrying two copies of h) proclogistic; model d(descending)=z13 z23; run; Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 6.1077 2 0.0472 Score 6.0395 2 0.0488 Wald 5.9014 2 0.0523 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits z13 1.262 0.980 1.625 z23 2.327 0.880 6.156 Most useful and informative for individualhaplotype—including several haplotypes in model leads to problems with sparse data colinearity

  42. Posterior-prob. Weights • Raw output data set from PROC HAPLOTYPE: • Recode hap1, hap2 as z1…zK (for additive coding) id HAPLOTYPE1 HAPLOTYPE2 PROB 1 1-1-1-4 1-3-3-2 1.00000 ... 6 1-1-1-4 1-1-1-4 0.98866 6 1-1-1-4 1-3-1-2 0.01129 6 1-1-1-4 1-3-1-4 0.00002 6 1-3-1-2 1-3-1-2 0.00003 ... 38 1-3-3-2 3-3-1-4 0.80049 38 1-3-3-4 3-3-1-4 0.19951 data dome; merge yahap yahoo; by id; z3=(haplotype1 eq '1-1-1-4') + (haplotype2 eq '1-1-1-4'); keep id z3 d prob; run; French et al. Genet Epidemiol 2006 Sep;30(6):485-94.

  43. New data set • Analyze using probability weights id PROB d z3 1 1.00000 2 1 2 1.00000 2 1 ... 5 1.00000 2 0 6 0.98866 2 2 6 0.01129 2 1 6 0.00002 2 1 6 0.00003 2 0 procgenmod ...; model d=z3 / ...; weight prob; run; Need to use “sandwich” variance estimator (identity working correlation) Can use Taylor series expansion to show E-sub and prob weight likelihoods are almost identical

  44. Missing data methods • Fits marginal likelihood—jointly estimates  and q: • Computational tricks—require specialized software • Weighted EM (M-step involves separate easy calcs for  and q) • Estimating equations • Brute force • Still fudges some details • “Prospective” likelihood ignores ascertainment • Assumes HWE • Again: not so bad?

  45. Checking the details • Ascertainment • Sampling cases  oversampling risk haplotypes • Std. result (Prentice and Pyke): pros. likelihood OK for retro data • E.g. logistic regression for unmatched case-control studies • Result does not immediately follow for haplotypes • Saturated dist of hap pairs cannot be estimated [Schaid (2005)] • But little-no observed bias when Rh2 > .8 [> .5 ?] • Retrospective likelihood available [Epstein and Satten (2003)] • Deviations from HWE • Not too bad in realistic cases • Retrospective likelihood is more sensitive than prospective! • Even after modeling departures from HWE! [Satten and Epstein (2004)]

  46. Excess heterozygotes Retro Prosp Prosp [Satten and Epstein (2004)]

  47. Exp-Sub “Plug In” vs. Missing Data Prevalence=1/1000; 1000 replicate studies of 350 cases and 350 controls;causal haplotype freq=.07, good rh2 NB: haplotype frequencies are not representative of population frequencies;frequency of risk haplotype in case-control sample 3x that in gen’l pop’n

  48. Very small in regions of “limited haplotype diversity” Why so close? Z* is “plug in” estimate; Z is true E(H|G,design); f is logit function. Still needs to be written up—and open question whether this can be extended beyond GLMs

  49. Select bibliographyI=“plug in”; Ib=“quasi-missing data”; II=“missing data” Yet more:Zeng D & Lin DY Genet Epidemiol. 2005 Jan;28(1):70-82Spinka C et al. Genet Epidemiol. 2005 Sep;29(2):108-27Lin DY & Zeng D JASA 2006 March (with discussion)http://www.bios.unc.edu/~lin/publications/2006/LinZeng06.pdf Number of papers with “haplotype” in title, Genet Epidemiol Am J Hum Genet, Biostats, Biometrika and Human Heredity, 1997-2007

  50. Common modeling issues • Many haplotypes, rare haplotypes • Simple approaches • 1 d.f. per common haplotype—pool rare haplotypes • Model each haplotype separately • Assumes causal variant lies on one haplotype • Note differences in reference category across models • More sophisticated approaches • Shrinkage / multi-model inference • Trade increased bias for smaller MSE, smaller prediction error • Cluster haplotypes Refs 20,22

More Related