Télécharger la présentation
## BIOL 582

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**BIOL 582**Lecture Set 12 ANOVA for simple linear regression Multiple regression With R annotations**BIOL 582**Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • First, one might not wish to assert a “dependence” on one variable on the other, but simply ask if there is a linear relationship • Are the variables correlated? > rm(list=ls()) > > snake<-read.csv("snake.data.csv") > > attach(snake) > > cor(SVL,HS) [1] 0.7289513**BIOL 582**Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • First, one might not wish to assert a “dependence” on one variable on the other, but simply ask if there is a linear relationship • Are the variables correlated? The null hypothesis for a correlation coefficient is H0: ρ = 0 The alternative can have direction HA: ρ > 0 HA: ρ< 0 HA: ρ ≠ 0 A t-test or randomization test can be used to test the null. Remember that r is the sample stat, ρ is the population parameter > rm(list=ls()) > > snake<-read.csv("snake.data.csv") > > attach(snake) > > cor(SVL,HS) [1] 0.7289513**BIOL 582**Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • First, one might not wish to assert a “dependence” on one variable on the other, but simply ask if there is a linear relationship • Are the variables correlated? The null hypothesis for a correlation coefficient is H0: ρ = 0 The alternative can have direction HA: ρ > 0 HA: ρ< 0 HA: ρ ≠ 0 A t-test or randomization test can be used to test the null. Remember that r is the sample stat, ρ is the population parameter > rm(list=ls()) > > snake<-read.csv("snake.data.csv") > > attach(snake) > > cor(SVL,HS) [1] 0.7289513 > r<-cor(SVL,HS) > n<-length(HS) > t<-r/sqrt(((1-r^2)/(n-2))) > t [1] 6.5641 > p.value<-1-pt(t,n-2) > p.value [1] 4.809665e-08**BIOL 582**Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • First, one might not wish to assert a “dependence” on one variable on the other, but simply ask if there is a linear relationship • Are the variables correlated? > # Randomization approach > > r<-cor(SVL,HS) > result<-r; p<-1 > > for(i in 1:10000){ + HS.r<-sample(HS) + r.r<-cor(SVL,HS.r) + result<-c(result,r.r) + p<-ifelse(r.r>=r,p+1,p) + } > > p.value<-p/10000 > p.value [1] 1e-04**BIOL 582**Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • Second, one might not wish to assert a “dependence” on one variable on the other. This makes most sense when one variable is clearly independent of the other (e.g., age), but sometimes, the independent variable is simple a “general” variable (e.g., overall size) and the dependent variable is a specific variable (e.g., size of certain morphological trait). • The linear equation is • Which has the model • That produces error The value, x0, is generally 1 for all subjects, and can, therefore, be eliminated from the equation. However, it can also be 0, if one wishes to force the linear model through the origin (where y must be 0, if x is 0). Thus, x0 is a dummy variable which indicates whether to include or exclude a y-intercept.**BIOL 582**Linear Regression Model Set-up • It is simple to see how the linear model is reduced • Imagine that for each model (with at least one parameter) , SSEcan be obtained easily (from residuals of predictions made by estimated model parameters). There are two sets of SSE from the two different models • From model containing: slopeintercept only • Both models contain an intercept**BIOL 582**Linear Regression Model Evaluation • Calculating SS for the regression (the type of SS is irrelevant with only one slope) • These are the only values needed for ANOVA • One can use ANOVA or t-tests**BIOL 582**Linear Regression Model Evaluation • t-tests to test coefficient hull hypotheses • First, the t stats are • Where the standard errors are • In each case > lm.svl<-lm(HS~SVL, x=T) > summary(lm.svl) Call: lm(formula = HS ~ SVL, x = T) Residuals: Min 1Q Median 3Q Max -4.7469 -1.2607 -0.7911 1.1206 7.2280 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.26976 1.15227 1.970 0.0562 . SVL 0.09987 0.01521 6.564 9.62e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 2.547 on 38 degrees of freedom Multiple R-squared: 0.5314, Adjusted R-squared: 0.519 F-statistic: 43.09 on 1 and 38 DF, p-value: 9.619e-08**BIOL 582**Linear Regression Model Evaluation • Summary of ANOVA simple linear regression > lm.svl<-lm(HS~SVL, x=T) > anova(lm.svl) Analysis of Variance Table Response: HS Df Sum Sq Mean Sq F value Pr(>F) SVL 1 279.48 279.475 43.087 9.619e-08 *** Residuals 38 246.48 6.486 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1**BIOL 582**Linear Regression Model Uses and Assumptions • Linear regression (or extensions of it) are used in just about any analysis which requires examining trends. • E.g., trait vs. time, trait vs. age, trait vs. size, response vs. temp. • Assumptions include • Normally distributed residuals (not data) • Homoscedasticity • Independent observations (i.e., sample sizes don’t contain multiple measurements on the same subjects; however, there are analyses which deal with repeated measures.) • These are the assumptions of Linear Models!**BIOL 582**Linear Regression Model Set-up • One can also consider more than one independent variable • Some data on fish body shape will be used as an example, but first the model set-up • The linear equation is • Which has the model • That produces error**BIOL 582**Linear Regression Model Set-up • Using two independent variables, the model is reduced as • It should be obvious that ANOVA will have to consider the type of SS**BIOL 582**Example Data (more pupfish) • Data collected from photographs – 13 anatomical landmarks • Body shape calculated with geometric morphometrics (GM)methods • The size of the head, body, and tail also calculated using log(centroid size) • The goal of the research was to understand how salinity influences body shape, but we knew that fish size could also influence shape • Question: Is body shape (streamlined/bluff) more associated with head, body, or tail size? Response variable more negative Shape index more positive From Collyer et al. 2007, Ecol. Res.**BIOL 582**Example Data (more pupfish) • Data collected from photographs – 13 anatomical landmarks • Body shape calculated with geometric morphometrics (GM)methods • The size of the head, body, and tail also calculated using log(centroid size) • The goal of the research was to understand how salinity influences body shape, but we knew that fish size could also influence shape • Question: Is body shape (streamlined/bluff) more associated with head, body, or tail size? From Collyer et al. 2007, Ecol. Res.**BIOL 582**Multiple Linear Regression Model Uses and Assumptions • Linear regression (or extensions of it) are used in just about any analysis which requires examining trends. • E.g., trait vs. time, trait vs. age, trait vs. size, response vs. temp. • E.g., multiple regression: trait vs. time, age, and size • Assumptions include • Normally distributed residuals (not data) • Homoscedasticity • Independent observations (i.e., sample sizes don’t contain multiple measurements on the same subjects; however, there are analyses which deal with repeated measures.) • These are the assumptions of Linear Models!**BIOL 582**Multiple Linear Regression Model Uses and Assumptions • Assumptions include • Normally distributed residuals (not data) • Homoscedasticity • Independent observations (i.e., sample sizes don’t contain multiple measurements on the same subjects; however, there are analyses which deal with repeated measures.) • These are the assumptions of Linear Models! > shape.fit<-lm(shape~log(headCS) +log(bodyCS)+log(tailCS)) > > par(mfcol=c(2,2)) > plot(shape.fit) > shapiro.test(resid(shape.fit)) Shapiro-Wilk normality test data: resid(shape.fit) W = 0.9888, p-value = 0.8362**BIOL 582**Multiple Linear Regression Model Uses and Assumptions • Assumptions include • Normally distributed residuals (not data) • Homoscedasticity • Independent observations (i.e., sample sizes don’t contain multiple measurements on the same subjects; however, there are analyses which deal with repeated measures.) • These are the assumptions of Linear Models! > # Some additional diagnostic plots (just for visual reference) > > par(mfcol=c(1,3)) > > plot(log(headCS),resid(shape.fit)) > plot(log(bodyCS),resid(shape.fit)) > plot(log(tailCS),resid(shape.fit))**BIOL 582**Linear Regression Model Evaluation • Calculating SS for the model • These are the only values needed for ANOVA, but ANOVA can also look at the SS for each independent variable, separately • One can use ANOVA or t-tests**BIOL 582**Linear Regression Model Evaluation • t-tests to test coefficient hull hypotheses • First, the t stats are • Where the standard errors are • In each case > summary(shape.fit) Call: lm(formula = shape ~ log(headCS) + log(bodyCS) + log(tailCS)) Residuals: Min 1Q Median 3Q Max -0.036973 -0.011269 -0.001346 0.011482 0.039555 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.73692 0.16605 4.438 4.03e-05 *** log(headCS) 0.10427 0.04503 2.315 0.0241 * log(bodyCS) 0.92555 0.15862 5.835 2.43e-07 *** log(tailCS) 0.12293 0.06217 1.977 0.0527 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 0.01589 on 59 degrees of freedom Multiple R-squared: 0.4348, Adjusted R-squared: 0.4061 F-statistic: 15.13 on 3 and 59 DF, p-value: 2.043e-07**BIOL 582**Linear Regression Model Evaluation • t-tests to test coefficient hull hypotheses • First, the t stats are • Where the standard errors are • In each case > summary(shape.fit) Call: lm(formula = shape ~ log(headCS) + log(bodyCS) + log(tailCS)) Residuals: Min 1Q Median 3Q Max -0.036973 -0.011269 -0.001346 0.011482 0.039555 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.73692 0.16605 4.438 4.03e-05 *** log(headCS) 0.10427 0.04503 2.315 0.0241 * log(bodyCS) 0.92555 0.15862 5.835 2.43e-07 *** log(tailCS) 0.12293 0.06217 1.977 0.0527 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 0.01589 on 59 degrees of freedom Multiple R-squared: 0.4348, Adjusted R-squared: 0.4061 F-statistic: 15.13 on 3 and 59 DF, p-value: 2.043e-07 Full Model ANOVA**BIOL 582**Linear Regression Model Evaluation • Summary of ANOVA for different independent variables • Type I SS • Note M refers to b1,b2,..,b(k-1), bk , or in other words, all model parameters**BIOL 582**Linear Regression Model Evaluation • Summary of ANOVA for different independent variables • Type III SS • Note M refers to b1,b2,..,b(k-1), bk , or in other words, all model parameters • Note M-bkrefers to all model parameters except the kth • Note Type II only applies if there are interactions**BIOL 582**Linear Regression Model Evaluation > library(car) > # Type I SS ANOVA > anova(shape.fit) Analysis of Variance Table Response: shape Df Sum Sq Mean Sq F value Pr(>F) log(headCS) 1 0.0011737 0.0011737 4.6475 0.03519 * log(bodyCS) 1 0.0093027 0.0093027 36.8360 9.921e-08 *** log(tailCS) 1 0.0009874 0.0009874 3.9099 0.05268 . Residuals 59 0.0149001 0.0002525 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 > > # Type III SS ANOVA > Anova(shape.fit,type="III") Anova Table (Type III tests) Response: shape Sum Sq Df F value Pr(>F) (Intercept) 0.0049742 1 19.6963 4.034e-05 *** log(headCS) 0.0013539 1 5.3609 0.02409 * log(bodyCS) 0.0085985 1 34.0477 2.425e-07 *** log(tailCS) 0.0009874 1 3.9099 0.05268 . Residuals 0.0149001 59 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1**BIOL 582**Multiple Comparisons? > # There really is not a way to do multiple comparisons if there are not multiple groups > # But one can still visualize which effects are more meaningful > # for predicting responses > > par(mfcol=c(1,3)) > > plot(log(headCS),shape); abline(lm(shape~log(headCS)),col='red') > plot(log(bodyCS),shape); abline(lm(shape~log(bodyCS)),col='blue') > plot(log(tailCS),shape); abline(lm(shape~log(tailCS)),col='green')**BIOL 582**Multiple Comparisons? > # And if one is clever, one can look at partial effects > # Warning, only try this if you know what you are doing > > par(mfcol=c(1,3)) > > # partial residuals > res.no.head<-resid(lm(shape~log(bodyCS)+log(tailCS))) > res.no.body<-resid(lm(shape~log(headCS)+log(tailCS))) > res.no.tail<-resid(lm(shape~log(headCS)+log(bodyCS))) > > plot(log(headCS),res.no.head); abline(lm(res.no.head~log(headCS)),col='red') > plot(log(bodyCS),res.no.body); abline(lm(res.no.body~log(bodyCS)),col='blue') > plot(log(tailCS),res.no.tail); abline(lm(res.no.tail~log(tailCS)),col='green') > > # The plots show “partial” effects**BIOL 582**Some other considerations What if one just wanted to know correlations among variables? Correlation matrix > # first make a matrix > log.headCS<-log(headCS);log.bodyCS<-log(bodyCS);log.tailCS<-log(tailCS) > Y<-cbind(log.headCS,log.bodyCS,log.tailCS,shape) > > # then make a correlation matrix > > cor(Y) log.headCS log.bodyCS log.tailCS shape log.headCS 1.00000000 0.01516104 -0.1124418 0.2109957 log.bodyCS 0.01516104 1.00000000 -0.6660615 0.5971490 log.tailCS -0.11244183 -0.66606147 1.0000000 -0.2754235 shape 0.21099571 0.59714905 -0.2754235 1.0000000 > > # and plot > > pairs(Y,main="All pairwise scatterplots")**BIOL 582**Some other considerations What if one does not want to assume strict independence of independent variables? Structural Equation Modeling (Path Analysis) Calculates partial correlations among variables and one can create a path diagram to show relationships. (This is a more complex subject, we might cover it later) Variable 1 Response Variable 2 Variable 3**BIOL 582**Some other considerations What if one wants to examine curvilinear relationships? Polynomial Regression > # polynomial regression of shape and total body size (total CS) > > plot(totalCS,shape)**BIOL 582**Some other considerations What if one wants to examine curvilinear relationships? Polynomial Regression > shapefit<-lm(shape~poly(totalCS,3)) > par(mfcol=c(2,2)) > plot(shapefit) > shapiro.test(resid(shapefit)) Shapiro-Wilk normality test data: resid(shapefit) W = 0.9723, p-value = 0.1665**BIOL 582**Some other considerations What if one wants to examine curvilinear relationships? Polynomial Regression > plot(totalCS,shape) > points(totalCS,predict(shapefit),pch=21,bg="red") > summary(shapefit) Call: lm(formula = shape ~ poly(totalCS, 3)) Residuals: Min 1Q Median 3Q Max -0.035464 -0.012123 -0.002539 0.010906 0.055230 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.902e-20 2.254e-03 0.000 1.0000 poly(totalCS, 3)1 7.911e-02 1.789e-02 4.422 4.27e-05 *** poly(totalCS, 3)2 -3.485e-02 1.789e-02 -1.948 0.0562 . poly(totalCS, 3)3 1.798e-03 1.789e-02 0.100 0.9203 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 0.01789 on 59 degrees of freedom Multiple R-squared: 0.2836, Adjusted R-squared: 0.2472 F-statistic: 7.785 on 3 and 59 DF, p-value: 0.0001834