Chapter 15: Multiple Linear Regression

Chapter 15: Multiple Linear Regression

In Chapter 15: 15.1 The General Idea 15.2 The Multiple Linear Regression Model 15.3 Categorical Explanatory Variables in Regression Models 15.4 Regression Coefficients 15.5 ANOVA for Multiple Linear Regression 15.6 Examining Multiple Regression Conditions

15.1 The General Idea Simple regression considers the relation between a single explanatory variable X and response variable Y.

The General Idea, cont. Multiple regression considers the relation between multiple explanatory variables (X1, X2, …, Xk) and response variable Y.

The General Idea, cont. • The multiple regression model helps “adjust out” the effects of confounding factors (i.e., extraneous lurking variables that bias results). • Simple linear concepts (Chapter 14) must be mastered before tackling multiple regression.

Simple Regression Model The simple regression population model: where μY|x ≡ expected value of Y given value x α ≡ intercept parameter β ≡ slope parameter

Simple Regression Model, cont. Estimates for the model are derived by minimizing ∑residuals2 to come up with estimates for an intercept (denoted a) and slope (b): The standard error of the regression(sY|x) quantifies variability around the line:

Multiple Regression Model The multiple regression population model with two explanatory variables is: where μY|x ≡ expected value of Y given values x1 and x2 α ≡ intercept parameter β1 ≡ slope parameter for X1 β2 ≡ slope parameter for X2

Multiple Regression Model, cont. Estimates for the coefficients are derived by minimizing ∑residuals2 to derive this multiple regression model: The standard error of the regression quantifies variability around the regression plane

Regression Modeling • A simple regression model fits a line in two-dimensional space • A multiple regression model with two explanatory variables fits a regression plane in three dimensional space • A multiple regression model with k explanatory variables fits a regression “surface” in k + 1 dimensional space (cannot be visualized)

Interpretation of Coefficients • Here’s a model with two independent variables • The intercept predicts where the regression plane crosses the Y axis • The slope for variable X1 predicts the change in Y per unit X1 holding X2 constant • The slope for variable X2 predicts the change in Y per unit X2 holding X1 constant

15.3 Categorical Explanatory Variables in Regression Models • Categorical explanatory (independent) variables can be fit into a regression model by converting them into 0/1 indicator variables (“dummy variables”) • For binary variables, code the variable 0 for “no” and 1 for “yes” • For categorical variables with k categories, use k–1 dummy variables.

Dummy Variables, cont. In this example, the variable SMOKE2 here has three levels initially coded 0 = non-smoker, 1 = former smoker, 2 = current smoker. To use this variable in a regression model, convert it to two dummy variables as follows:

Illustrative Example Childhood respiratory health survey. • A binary explanatory variable (SMOKE) is coded 0 for non-smoker and 1 for smoker. • The response variable Forced Expiratory Volume (FEV) is measured in liters/second • The mean FEV in nonsmokers is 2.566 • The mean FEV in smokers is 3.277.

Illustrative Example, cont. • A regression model is applied (regress FEV on SMOKE) • The least squares regression line is ŷ = 2.566 + 0.711X • The intercept (2.566) = the mean FEV of group 0 • The slope = the mean difference = 3.277 −2.566 = 0.711 • tstat = 6.464 with 652 df, P ≈ 0.000 (same as equal variance t test) • The 95% CI for slope is 0.495 to 0.927 (same as the 95% CI for μ1 −μ2)

Illustrative Example, cont. Scatter plot for categorical independent variable (SMOKE) with regression line passing through group means.

Confounding, Illustrative Example • In our example, the children who smoked had higher mean FEV than the children who did not smoke • How can this be true given what we know about the deleterious respiratory effects of smoking? • The answer lies in the fact that the smokers were older than the nonsmokers: AGE confounding the relationship between SMOKE and FEV • A multiple regression model can adjust for AGE in this situation

15.4 Multiple Regression Coefficients We rely on software to calculate the multiple regression statistics. This is the regression dialogue box from SPSS

Slope b2 Slope b1 Intercept a Multiple Regression Example cont. Here is the SPSS output The multiple regression model is: FEV = 0.367 + −.209(SMOKE) + .231(AGE)

Multiple Regression Coefficients, cont. • The slope coefficient associated for SMOKE is −.206, suggesting that smokers have .206 less FEV on average compared to non-smokers (after adjusting for age) • The slope coefficient for AGE is .231, suggesting that each year of age in associated with an increase of .231 FEV units on average (after adjusting for SMOKE)

Inference About the Coefficients Inferential statistics are calculated for each regression coefficient. For example, in testing H0: β1 = 0 (SMOKE coefficient controlling for AGE) tstat = −2.588 and P = 0.010 This statistic has df = n – k – 1 = 654 – 2 – 1 = 651 See text for formulas

Inference About the Coefficients The 95% confidence interval for this slope of SMOKE controlling for AGE is −0.368 to − 0.050.

15.5 ANOVA for Multiple Regression See pp. 343 – 346

15.6 Examining Regression Conditions • Conditions for multiple regression mirror those of simple regression • Linearity • Independence • Normality • Equal variance • These can be evaluated by analyzing the pattern of the residuals

Examining Conditions This shows standardized residuals against standardized predicted values for the illustrative model (FEV regressed on AGE and SMOKE) Same number of points above and below the horizontal  no major departures from linearity Higher variability at high values of Y  possible unequal variance (biologically reasonable)

Examining Conditions Normal Q-Q plot of standardized residuals The fairly straight diagonal suggests no major departures from Normality

Chapter 15: Multiple Linear Regression