1 / 41

Linear Modelling III

Linear Modelling III. Richard Mott Wellcome Trust Centre for Human Genetics. Synopsis. Logistic Regression Mixed Models. Logistic Regression and Statistical Genetics. Applicable to Association Studies Data: Binary outcomes ( eg disease status)

caia
Télécharger la présentation

Linear Modelling III

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics

  2. Synopsis • Logistic Regression • Mixed Models

  3. Logistic Regression and Statistical Genetics • Applicable to Association Studies • Data: • Binary outcomes (eg disease status) • Dependent on genotypes [+ sex, environment] • Aim is to identify which factors influence the outcome • Rigorous tests of statistical significance • Flexible modelling language • Generalisation of Chi-Squared Test

  4. Data • Data for individual labelled i=1…n: • Binary Response yi • Unphased Genotypes gij for markers j=1..m

  5. Coding Unphased Genotypes • Several possibilities: • AA, AG, GG original genotypes • 12, 21, 22 • 1, 2, 3 • 0, 1, 2 # of G alleles • Missing Data • NA default in R

  6. Using R • Load genetic logistic regression tools • > source(‘logistic.R’) • Read data table from file • > t <- read.table(‘geno.dat’, header=TRUE) • Column names • names(t) • t$yresponse (0,1) • t$m1, t$m2, ….Genotypes for each marker

  7. Contigency Tables in R • ftable(t$y,t$m31)prints the contingency table > ftable(t$y,t$m31) 11 12 22 0 515 387 75 1 28 11 2 >

  8. Chi-Squared Test in R > chisq.test(t$y,t$m31) Pearson's Chi-squared test data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464 Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >

  9. The Logistic Model • Effect of SNP k on phenotype • Null model

  10. The Logistic Model • Prob(Yi=1) = exp(hi)/(1+exp(hi)) • hi = Sjxijbj- Linear Predictor • xij– Design Matrix (genotypes etc) • bj– Model Parameters (to be estimated) • Model is investigated by • estimating the bj’s by maximum likelihood • testing if the estimates are different from 0

  11. The Logistic FunctionProb(Yi=1) = exp(hi)/(1+exp(hi)) Prob(Y=1) h

  12. Types of genetic effect at a single locus

  13. Additive Genotype Model • Code genotypes as • AA x=0, • AG x=1, • GG x=2 • Linear Predictor • h = b0 + xb1 • P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) • PAA = P(Y=1|x=0) = exp(b0)/(1+exp(b0)) • PAG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1)) • PGG = P(Y=1|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))

  14. Additive Model: b0 = -2 b1 = 2PAA = 0.12 PAG = 0.50 PGG = 0.88 Prob(Y=1) h

  15. Additive Model: b0 = 0 b1 = 2PAA = 0.50 PAG = 0.88 PGG = 0.98 Prob(Y=1) h

  16. Recessive Model • Code genotypes as • AA x=0, • AG x=0, • GG x=1 • Linear Predictor • h = b0 + xb1 • P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) • PAA = PAG = P(Y=1|x=0) = exp(b0)/(1+exp(b0)) • PGG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))

  17. Recessive Model: b0 = 0 b1 = 2PAA = PAG = 0.50 PGG = 0.88 Prob(Y=1) h

  18. Genotype Model • Each genotype has an independent effect • Code genotypes as (for example) • AA x=0, z=0 • AG x=1, z=0 • GG x=0, z=1 • Linear Predictor • h = b0 + xb1+zb2 two parameters • P(Y=1|x) = exp(b0 + xb1+zb2)/(1+exp(b0 + xb1+zb2)) • PAA = P(Y=1|x=0,z=0) = exp(b0)/(1+exp(b0)) • PAG = P(Y=1|x=1,z=0) = exp(b0 + b1)/(1+exp(b0 + b1)) • PGG = P(Y=1|x=0,z=1) = exp(b0 + b2)/(1+exp(b0 + b2))

  19. Genotype Model: b0 = 0 b1 = 2 b2 = -1 PAA = 0.5 PAG = 0.88 PGG = 0.27 Prob(Y=1) h

  20. Models in Rresponse ygenotype g

  21. Data Transformation • g <- t$m1 • use these functions to treat a genotype vector in a certain way: • a <- additive(g) • r <- recessive(g) • d <- dominant(g) • g <- genotype(g)

  22. Fitting the Model • afit <- glm( t$y ~ additive(g),family=‘binomial’) • rfit <- glm( t$y ~ recessive(g),family=‘binomial’) • dfit <- glm( t$y ~ dominant(g),family=‘binomial’) • gfit <- glm( t$y ~ genotype(g),family=‘binomial’) • Equivalent models: • genotype = dominant + recessive • genotype = additive + recessive • genotype = additive + dominant • genotype ~ standard chi-squared test of genotype association

  23. Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 *** b1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733 . b2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

  24. Analysis of DevianceChi-Squared Test > anova(glm( t$y ~ genotype(t$m31), family='binomial'), test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. DfResid. Dev NULL 1017 343.71 genotype(t$m31) 2 3.96 1015 339.76

  25. Model Comparison • Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance 1 1016 38.301 2 1015 38.124 1 0.177 >

  26. Scanning all Markers > logscan(t,model=‘additive’) Deviance DF Pval LogPval m1 8.604197e+00 1 3.353893e-03 2.474450800 m2 7.037336e+00 1 7.982767e-03 2.097846522 m3 6.603882e-01 1 4.164229e-01 0.380465360 m4 3.812860e+00 1 5.086054e-02 1.293619014 m5 7.194936e+00 1 7.310960e-03 2.136025588 m6 2.449127e+00 1 1.175903e-01 0.929628598 m7 2.185613e+00 1 1.393056e-01 0.856031566 m8 1.227191e+00 1 2.679539e-01 0.571939852 m9 2.532562e+01 1 4.842353e-07 6.314943565 m10 5.729634e+01 1 3.748518e-14 13.426140380 m11 3.107441e+01 1 2.483233e-08 7.604982503 … … …

  27. Multilocus Models • Can test the effects of fitting two or more markers simultaneously • Several multilocus models are possible • Interaction Model assumes that each combination of genotypes has a different effect • eg t$y ~ t$m10 * t$m15

  28. Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial’, test="Chisq") > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. DfResid. Dev NULL 1017 343.71 genotype(t$m13) 2 108.68 1015 235.03 genotype(t$m26) 2 1.14 1013 233.89 genotype(t$m13):genotype(t$m26) 3 6.03 1010 227.86 > pchisq(6.03,3,lower.tail=F) calculate p-value [1] 0.1101597

  29. Adding the effects of Sex and other Covariates • Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age • Fit models of the form • fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) • fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)

  30. Adding the effects of Sex and other Covariates • Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age • anova(fit1,fit2)

  31. Generalised Linear Models • Applies linear modelling framework to data that are not normally distributed • Can be used for probability distributions from the exponential family: • f(y|q) = h(y) exp (h(q).T(y) –A(q)) • Expected value of y is a transformed version of a linear predictor • E(y) = g-1(Xb) • g() is called the link function • Examples include logistic regression, survival analysis • In R GLMs are fitted using glm() and models compared using the anova() function

  32. Random EffectsMixed Models • So far our models have had fixed effects • each parameter can take any value independently of the others • each parameter estimate uses up a degree of freedom • models with large numbers of parameters have problems • saturation - overfitting • poor predictive properties • numerically unstable and difficult to fit • in some cases we can treat parameters as being sampled from a distribution – random effects • only estimate the parameters needed to specify that distribution • can result in a more robust model

  33. Example of a Mixed Model • Testing genetic association across a large number of of families • yi = trait value of i’th individual • bi = genotype of individual at SNP of interest • fi = family identifier (factor) • ei = error, variance s2 • y = m + b + f + e • If we treat family effects f as fixed then we must estimate a large number of parameters • Better to think of these effects as having a distribution N(0,t2) for some variance t which must be estimated • individuals from the same family have the same f and are correlated • cov = t2 • individuals from different families are uncorrelated • genotype parameters b still treated as fixed effects • mixed model

  34. Mixed Models • y = m + b + f + e • Fixed effects model • E(y) = m + b + f • Var(y) = Is2 • I is identity matrix • Mixed model • E(y) = m + b • Var(y) = Is2 + Ft2 • F is a matrix, Fij = 1 if i, j in same family • Need to estimate both s2 and t2

  35. Generalised Least Squares • y = Xb + e • Var(y) = V (symmetric covariance matrix) • V = Is2 (uncorrelated errors) • V = Is2 + Ft2 (single grouping random effect) • V = Is2 + Gt2 (G = genotype identity by state) • GLS solution, if V is known • unbiased, • efficient (minimum variance) • consistent (tends to true value in large samples) • asymptotically normally distributed

  36. Multivariate Normal Distribution • Joint distribution of a vector of correlated observations • Another way of thinking about the data • Estimation of parameters in a mixed model is special case of likelihood analysis of a MVN • y ~ MVN(m, V) • m is vector of expected values • V is variance-covariance matrix • If V = Is2 then elements of y are uncorrelated and equivalent to n independent Normal variables • probability density/Likelihood is

  37. Henderson’s Mixed Model Equations • General linear mixed model. • b are fixed effect • u are random effects • X, Z design matrices

  38. Mixed Models • Fixed effects models are special cases of mixed models • Mixed models sometimes are more powerful (as fewer parameters) • But check that the assumption effects are sampled from a Normal distribution is reasonable • Differences between estimated parameters and statistical significance in fixed vs mixed models • Shrinkage: often random effect estimates from a mixed model have smaller magnitudes than from a fixed effects model.

  39. Mixed Models in R • Several R packages available, e.g. • lme4 general purpose mixed models package • emma for genetic association in structured populations • Formulas in lme4, eg test for association between genotype and trait • H0: E(y) = m • vs • H1: E(y) = m + b • Var(y) = Is2 + Ft2 • h0 = lmer(y ~ 1 +(1|family), data=data) • h1 = lmer(y ~ genotype +(1|family), data=data) • anova(h0,h1)

  40. Formulas in lmer() • y ~ fixed.effects + (1|Group1) + (1|Group2) etc • random intercept models • y ~ fixed.effects +(1|Group1) + (x|Group1) • random slope model

  41. Example: HDL Data • > library(lme4) • > b=read.delim("Biochemistry.txt”) • cc=complete.cases(b$Biochem.HDL) • > f0=lm(Biochem.HDL ~ Family , data=b,subset=cc) • > f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc) • > anova(f0) • Analysis of Variance Table • Response: Biochem.HDL • Df Sum Sq Mean Sq F value Pr(>F) • Family 175 147.11 0.84064 5.2098 < 2.2e-16 *** • Residuals 1681 271.24 0.16136 • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • > f1 • Linear mixed model fit by REML • Formula: Biochem.HDL ~ (1 | Family) • Data: b • Subset: cc • AIC BIC logLik deviance REMLdev • 2161 2178 -1078 2149 2155 • Random effects: • Groups Name Variance Std.Dev. • Family (Intercept) 0.066633 0.25813 • Residual 0.161487 0.40185 • Number of obs: 1857, groups: Family, 176 • Fixed effects: • Estimate Std. Error t value • (Intercept) 1.60615 0.02263 70.97 • > plot(resid(f0),resid(f1))

More Related