Linear Models in R

Linear Models in R Fish 507 F: Lecture 7

Supplemental Readings • Practical Regression and ANOVA using R (Faraway, 2002) • Chapters 2,3,6,7,8,10,16 • http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf • (This is essentially an older version (but a free copy) of Julian Faraway’s excellent book Linear Models with R) • QERM 514 Lecture Notes (Nesse, 2008) • Chapters 3-9 • http://students.washington.edu/nesse/qerm514/notes.pdf • Hans Nesse practically wrote a book for this course !

Linear models • Today’s lecture will focus on a single equation and how to fit it. • Classic linear model • When the response, yi is normally distributed • Categorical predictor (s) : ANOVA • Continuous predictor (s) : Regression • Mixed Categorical/Continuous predictor (s) : Regression/ANCOVA

Typical Goals • Is this model for . . . • Estimation : What parameters in a particular model best fit the data? • Tricks (e.g. Formulaic relationships, Ricker, . . . ) • Inference: How certain are those estimates and what can be interpreted from them? • Adequacy: Is the model probably the right choice? • Prediction: When can predictions be made for new observations?

Outline • ANOVA • Regression (ANCOVA) • Model Adequacy • * This lecture will not go into detail on statistical models or concepts, but rather present functions to fit those models

One-way ANOVA • The classical (null) hypothesis in a one-way ANOVA is that the means of all the groups are the same • i : 1, 2, . . . . , nj is the observation within group j: 1,2, . . . J H0: μ1 = μ2 = . . . . = μJ H1: At least two of the μj’s are different

Archaeological metals • Archaeological investigations work to identify similarities and differences between sites. Traces of metals found in artifacts give some indication of manufacturing techniques • The data metals gives the percentage of iron found in pottery from four Roman-era sites > metals <- read.table("http://studen…F/data/metals.txt", + header=T) > head(metals, n=3) Al Fe Mg Ca Na Site 1 14.4 7.00 4.30 0.15 0.51 L 2 13.8 7.08 3.43 0.12 0.17 L 3 14.6 7.09 3.88 0.13 0.20 L Site will automatically get coded as a factor

aov() / summary() • The simplest way to fit an ANOVA model is with the aov function > Fe.anova <- aov(Fe ~ Site, data=metals) > summary(Fe.anova) Df Sum Sq Mean Sq F value Pr(>F) Site 3 134.222 44.741 89.883 1.679e-12 *** Residuals 22 10.951 0.498 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Mean squared error – sum of squares divided by degrees of freedom Partial F-statistic comparing the full model to the reduced model Probability of observing F or higher Significance – a quick visual guide to which p-values are low Sums of squares group – sum of squared difference between group means and overall means Sums of squares error – sum of squared difference between observations within a group and their respective mean Degrees of freedom for each group and the residuals

The model statement • We fit the ANOVA model by specify a model: • Fe ~ Site • The functions in R that fit the more common statistical models take as a first argument a model statement in a compact symbolic form • We’ve actually briefly seen this symbolic form in the first plotting lecture • plot(y ~ x, data = ) Predictor, independent variable Response, dependent variable

The model statement • Look up help on ?formula for a full explanation • Some common model statements

One-way ANOVA • In the previous function we fit the model with the aov() function. We can also fit the ANOVA model in with the functions lm() and anova(). Depending on what analysis we are conducting we might chose either one of these approaches. > Fe.lm <- lm(Fe ~ Site, data=metals) > anova(Fe.lm) Analysis of Variance Table Response: Fe Df Sum Sq Mean Sq F value Pr(>F) Site 3 134.222 44.741 89.883 1.679e-12 *** Residuals 22 10.951 0.498 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Coding categorical variables • The lm() function is also used to fit regression models. (In fact, regression and ANOVA are really the same thing). It all has to do with how a categorical variable is coded in the model. • The ANOVA model can be written in a the familiar looking form by cleverly selecting the predictors to be Treatment coding

treatment coding • Coding schemes describe how each group is represented as the values of x1, x2, . . . . xJ • In R the default coding scheme for unordered factors is the treatment coding. This is likely what you learned in your introductory statistics courses. • Recall that in this scheme the estimate of the intercept β0 represents the mean of the baseline group and that the estimate of the of each βJ describes the difference between the mean of group i and the baseline group

Coding schemes • Treatment coding • This may seem trivial but its very important to know how categorical variables are being coded in a linear model when interpreting parameters. • To find out the current coding scheme > options()$contrasts unordered ordered "contr.treatment" "contr.poly" μ1 is the group chosen as the baseline

Other coding schemes • There are several other coding schemes • hermet : Awkward interpretation. Improves matrix computations. • poly : Levels of a group are ordered. β0 = constant effect , β1 = linear effect, β2 = quadratic effect, . . . • SAS : same as treatment, but the last level of a group is used as the baseline (the first level will always be used in treatment • sum : When the group sample sizes are equal, the estimate of the intercept represents the grand mean and the βJ represent the differences of those levels from the grand mean

Changing coding schemes • The C function is used to specify the contrast of a factor • C(object, contr) > ( metals$Site <- C(metals$Site, sum) ) [1] L L L L L L L L L L L L L L C C I I I I I A A A A A attr(,"contrasts") [1] contr.sum Levels: A C I L • The functions contr.() (e.g. contr.treatment) will create the matrix of contrasts used in lm() and other functions

One-way ANOVA > summary(Fe.lm) Call: lm(formula = Fe ~ Site, data = metals) Residuals: Min 1Q Median 3Q Max -2.11214 -0.33954 0.01143 0.49036 1.22800 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.5120 0.3155 4.792 8.73e-05 *** SiteC 3.9030 0.5903 6.612 1.20e-06 *** SiteI 0.2000 0.4462 0.448 0.658 SiteL 4.8601 0.3676 13.222 6.04e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7055 on 22 degrees of freedom Multiple R-squared: 0.9246, Adjusted R-squared: 0.9143 F-statistic: 89.88 on 3 and 22 DF, p-value: 1.679e-12 More output Residual summary Recall that this t-test tests whether the βJ is significantly different from zero. So this says that sites C and L were significantly different from the baseline site, A Got to report R-squared, right?

Multiple comparisons • When comparing the means for the levels of a factor in an analysis of variance, a simple comparison using t-tests will inflate the probability of declaring a significant difference when it is not in fact present • There are several ways around this the most common being Tukey’s honest significant difference • TukeyHSD(object) This needs to be a fitted model object fromaov()

TukeyHSD() > TukeyHSD(Fe.anova) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Fe ~ Site, data = metals) $Site diff lwr upr p adj C-A 3.9030000 2.2638764 5.542124 0.0000068 I-A 0.2000000 -1.0390609 1.439061 0.9692779 L-A 4.8601429 3.8394609 5.880825 0.0000000 I-C -3.7030000 -5.3421236 -2.063876 0.0000146 L-C 0.9571429 -0.5238182 2.438104 0.3023764 L-I 4.6601429 3.6394609 5.680825 0.0000000 These should not contain zero for a significant difference

plot(TukeyHSD(Fe.anova))

Model assumptions • Independence • Within and between samples • Normality • Histograms, QQ-Plots • Tests for normality • Kolmogorov-Smirnov test : ks.test() • Shapiro-Wilk: shapiro.test() • Homogeneity of variance • Boxplots • Tests for equal variances • Bartlett’s test: bartlett.test() • Fligner-Killeen Test: fligner.test() Load the MASS library The null hypothesis is that the data follow a specific distribution The null hypothesis is that the data came from a normal distribution The null hypothesis is that all the variances are equal

In-class exercise 1 • Check the assumptions using plots and tests for the archeological ANOVA model. Recall that the normality test is conducted on the residuals of the model so you will need to figure out how to extract these from Fe.lm • Are the assumptions met ?

ANCOVA / Regression • With a basic understanding of the lm() function it’s then not hard to fit other linear models • Regression models with mixed categorical and continuous variables can be fit the with lm() function. • There are also a suite of functions associated with lm() objects which we use for common model evaluation and prediction routines.

Marmot data • The length of yellow-bodied marmot whistles in response to simulated predators. > head(marmot) len rep dist type loc 1 0.12214826 1 17.2733271 Human A 2 0.07630072 3 0.2445166 Human A 3 0.11584495 1 13.0901767 Human A 4 0.11318707 1 14.9489510 Human A 5 0.09931512 2 13.0074619 Human A 6 0.10285429 2 10.6129169 Human A

Marmot data • len: length of marmot whistle (response variable) • rep: number of repetitions of whistle per bout - continuous • dist: distance to challenge when whistle began– continuous • type: type of challenge (Human, RC Plane, Dog) - categorical • loc: test location (A, B, C) – categorical

Exploring potential models • Basic exploratory data analysis should always be performed before starting to fit a model • Always try and fit a meaningful model • When there are at least two categorical predictors an interaction plot is useful for determining whether the effect of x1 on y depends on the level of x2 (an interaction) • interaction.plot(x.factor, trace.factor, response)

interaction.plot(marmot$loc, marmot$type, marmot$len, . . . Slight evidence for an interaction No RCPlanes were test at location C. This will prevent us from fitting an interaction between these two variables.

Exploring potential models • We can also examine potential interactions between continuous and categorical variables with simple bivariate plots conditioned on factors plot(dist, len,xlab="Distance", ylab="Length", type='n') points(dist[type == "Dog"], len[type == "Dog"],pch=17, col="blue") points(dist[type == "Human"], len[type == "Human"],pch=18, col="red") points(dist[type == "RCPlane"], len[type == "RCPlane"], pch=19,col="green") legend("bottomleft", bty='n', levels(type), + col=c("blue", "red", "green"), pch=17:19) Setup a blank plot Quick way to extract names of categorical variables Add points for x and y by some factor

Humans have a much stronger linear effect

A model • Suppose after some exploratory data analysis and model fitting we arrived at the model: • Length ~ Location + Distance + Type + Distance*Type • We can fit this model simply by > ( lm1 <- lm(len ~ loc + type*dist, data=marmot) ) Call: lm(formula = len ~ loc + type * dist, data = marmot) Coefficients: (Intercept) locB locC typeHuman 0.0941227 0.0031960 0.0026906 -0.0353553 typeRCPlane dist typeHuman:dist typeRCPlane:dist 0.0001025 0.0005970 0.0034316 -0.0011266

The summary() command works just as before > summary(lm1) Call: lm(formula = len ~ loc + type * dist, data = marmot) Residuals: Min 1Q Median 3Q Max -0.092966 -0.010812 0.001030 0.010029 0.059588 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0941227 0.0106280 8.856 4.82e-15 *** locB 0.0031960 0.0042574 0.751 0.45417 locC 0.0026906 0.0049046 0.549 0.58421 typeHuman -0.0353553 0.0136418 -2.592 0.01063 * typeRCPlane 0.0001025 0.0153766 0.007 0.99469 dist 0.0005970 0.0008158 0.732 0.46555 typeHuman:dist 0.0034316 0.0010809 3.175 0.00187 ** typeRCPlane:dist -0.0011266 0.0011891 -0.947 0.34515 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.02147 on 132 degrees of freedom Multiple R-squared: 0.2906, Adjusted R-squared: 0.2529 F-statistic: 7.723 on 7 and 132 DF, p-value: 8.208e-08

Extracting model components • All the components to the summary() output are also stored in a list > names(lm1) [1] "coefficients" "residuals" "effects" [4] "rank" "fitted.values" "assign" [7] "qr" "df.residual" "contrasts" [10] "xlevels" "call" "terms" [13] "model" > lm1$call lm(formula = len ~ loc + type * dist, data = marmot)

Comparing models • Suppose that the model without the interaction was also a potential model. We can look at the t-values to test whether a single βJ = 0, but need to perform a partial F-test to test whether several predictors = 0. • This is what an ANOVA model tests. • H0: Reduced model – H1:Full model > anova(lm2,lm1) Analysis of Variance Table Model 1: len ~ loc + type + dist Model 2: len ~ loc + type * dist Res.Df RSS Df Sum of Sq F Pr(>F) 1 134 0.069588 2 132 0.060827 2 0.008761 9.5058 0.0001391 *** Twolm()objects need to be given toanova() Evidence for interaction

AIC • AIC is a more sound way to select a model. • In it’s most simple form, p = # of parameters • The function AIC() will extract the AIC from a linear model. • Note that there is also the function extractAIC() which evaluates the log-likelihood based on the model deviance (Generalized Linear Models) and uses a different penalty. • Be careful !

AICc • AIC corrected is a better choice, especially when there a lot of parameters in relationship to the size of the data • AICc should always be used since AIC and AICc should yield equivalent results as n gets large. Correction term. What will this converge to as n gets very large ?

Hands-on exercise 2 • Compute the AIC and AICc for the two marmot models that were fit • For AIC use the AIC() function, but also do the computation by hand • R does not have a built in function for AICc in its base package, so you will also have to do this computation by hand. • Hint: Use the loglik() function

Checking assumptions • Suppose that we decided on the marmot model that included the interaction term between Distance and type of challenge • The same assumptions must be met and can be evaluated by plotting the model object • plot(lm1) • Clicking on the plot will allow to scrow through plots • Specifying which= in the command will allow to select which plot • plot(lm1, which=1)

Checking constant variance assumption Unusual observations are flagged (High leverage)

Very heavy tails. Normality assumption is not met.

Similar to fitted v. residual plot

Check for influential observations Cook’s distance is a function of the residual and leverage, so the isoclines trace out the Cook’s distance for any point in a region

Box-cox transformations • Failing to meet assumptions indicates model mis-specification • The QQ-plot suggested that that we might want to transform the response variable, length of whistle. • It is always a good idea to use a meaningful transformation if possible. • Optimal transformation can be found with the boxcox() function • Computes a likelihood profile over λ

boxcox(lm1, lambda=seq(0.5,2.5, l=25))

Influential observations • Unusual observations are also likely driving the lack of fit. Cook’s distance is commonly used to identify influential observations • This statistic has an F distribution and thus points greater than 1 usually require further examination. Points greater than 0.5 is also commonly used

Influential observations • cooks.distance() takes as input a lm() object and returns the Cook’s distance • None of the observations were influential > which(cooks.distance(lm1) > 0.5) named integer(0) and the values flagged by R likely had high leverage, meaning that they had the potential to be influential.

Making predictions • After assessing model assumptions, influential observations, refitting the model, validating the model , . . . . . • We want to make prediction for a future observation • The point estimate for a new observation is: but we wish express our uncertainty in this observation • Two types of predictions with uncertainty • Confidence interval: Express uncertainty in a parameter • Prediction interval: Express uncertainty in a new observation

predict() predict(object, newdata, interval = c("none", "confidence", "prediction"), level = 0.95, na.action = na.pass) lm()object Data frame of new observations. Note the column names of the data frame must match the data used in the model Desired confidence interval How to handle missing values. Recall all the na. functions. By default R will not make the prediction (pass)

predict() > ( newobs <- data.frame(rep=c(3,1), dist=c(10, 13), + type=rep("Dog",2), loc=rep("A",2)) ) rep dist type loc 1 3 10 Dog A 2 1 13 Dog A > predict(lm1, newobs, interval="confidence") fit lwr upr 1 0.1000929 0.09170514 0.1084807 2 0.1018840 0.09411091 0.1096571 > predict(lm1, newobs, interval="prediction") fit lwr upr 1 0.1000929 0.05680941 0.1433764 2 0.1018840 0.05871539 0.1450526 Note, names match data Wider intervals

Parameter confidence intervals • Confidence intervals for model parameters can easily be obtained with the confint() function > confint(lm1) 2.5 % 97.5 % (Intercept) 0.073099364 0.115145945 locB -0.005225605 0.011617694 locC -0.007011095 0.012392320 typeHuman -0.062340202 -0.008370466 typeRCPlane -0.030313894 0.030518809 dist -0.001016644 0.002210696 typeHuman:dist 0.001293523 0.005569594 typeRCPlane:dist -0.003478859 0.001225620

Other useful functions • addterm: Forward selection using AIC • dropterm: Backwards selection using AIC • stepAIC: Step-wise selection using AIC

Linear Models in R

Linear Models in R

Presentation Transcript

Functional linear models

General Linear Models; Generalized Linear Models

HIERARCHICAL LINEAR MODELS

Linear Programming Models in Services

Generalized Linear Models

Classification: Linear Models

GENERAL LINEAR MODELS

Linear regression models in R (session 2)

Log-Linear Models in NLP

Generalized Linear Models

Classification: Linear Models

Linear Programming Models in Services

Linear Regression Models

Linear Models

Log-Linear Models in NLP

Linear Models in R

Linear models in Epidemiology

Mixed Linear Models