Linear regression

Linear regression Summer program Brian Healy

Previous classes • Hypothesis testing • Parametric • Nonparametric • Correlation

What are we doing today? • Linear regression • Continuous outcome with continuous, dichotomous or categorical predictor • Equation: • Connection between regression and • t-test • ANOVA

Big picture • Linear regression is the most commonly used statistical technique. It allows the comparison of dichotomous, categorical and continuous predictors with a continuous outcome. More importantly it allows multiple predictors of a single outcome. • Extensions of linear regression allow • Dichotomous outcomes- logistic regression • Repeated measures • Multilevel models • Median regression • Survival analysis • Amazingly, many of the analyses we have learned can be completed using linear regression

Quick math review • As you remember from high school math, the basic equation of a line is given by y=mx+b where m is the slope and b is the y-intercept • One definition of m is that for every one unit increase in x, there is an m unit increase in y • One definition of b is the value of y when x is equal to zero

Picture • Look at the data in this picture • Does there seem to be a correlation (linear relationship) in the data? • Is the data perfectly linear? • Could we fit a line to this data?

What is linear regression? • Linear regression tries to find the best line (curve) to fit the data • The equation of the line is • The method of finding the best line (curve) is least squares, which minimizes the sum of the distance from the line for each of points

Linearity assumption • One of the assumptions of linear regression is that the relationship between the predictors and the outcomes is in fact linear • We call this the population regression line E(Y | X=x) = my|x = b0+ b1x • This equation says that the mean of y given a specific value of x is defined by the b coefficients • There are two keys to understanding the equation • The equation is conditional on the value of x • The coefficients act exactly like the slope and y-intercept from the simple equation of a line from before

Individual regression equation • Now, we have only described the relationship between the mean of the outcome, my|x, and the predictors • The actual observations, yi, may be slightly off the population line because of variability in the population, leading to the following individual regression equation, yi = b0 + b1xi + ei, where ei is the deviation from the population line (See picture). This is the distance from the line for patient 1, e1

Normality and homoscedasticity assumption • Two other assumptions of linear regression are related to the ei’s • Normality- the distribution of the errors are normal. Since we are conditioning on the values of xi, the outcomes, yi, are also normal given a specific value of xi • Homoscedasticity- all of the errors have the same variance. This implies the variance of y given x is the same for all values of x Distribution of y-values at each value of x is normal with the same variance

Assumptions of linear regression • Linearity • The relationship between the outcome and the predictors can be described by a linear relationship • The linear relationship applies to the coefficients, not the predictors. For example, E(Y|X=x)=b0 + b1x1 + b2x22is still a linear regression equation because each of the b’s is to the first power • Normality of the errors • The errors, ei, are normally distributed, N(0, s2) • Homoskedasticity of the errors • The errors, ei, have the same variance • This assumption is required for ANOVA and a two sample t-test with equal variance • Independence • All of the data points are independent • Correlated data points can be taken into account using multivariate and longitudinal data methods (BIO226 and BIO245)

How do we find the best line? • Let’s look at three candidate lines • Which do you think is the best? • What is a way to determine the best line to use?

Least squares • The method employed to find the best line is called least squares. This method find the values of b that minimize the squared vertical distance from the line to each of the point. This is the same as minimizing the sum of the ei2 • As you remember from calculus, to minimize this sum we must take the derivative of these with respect to each b • A matrix form of the regression equations can also be written and is easier to apply to other situation. You will see this in the fall

Estimates of regression coefficients • Once we have solved the least squares equation, we obtain estimates for the b’s, which we refer to as • The hat notation is used from now on to refer to as the estimated value of the coefficient or outcome value • The final least squares equation is where yhat is the mean value of y for a value of x1

Example • Let’s look at an example • A recent study I worked on a study that looked at patients who had to receive Fresh Frozen Plasma (FFP) after a surgery. The investigator wanted to determine if there were any pre-operative factors that predicted red blood cell loss during surgery. • We would like to see if there is a linear relationship between these two variables

Results • To find the least squares solution using R, we use the following command • summary(lm(sbp~gestage)) Call: lm(formula = rbc ~ ptpre) Residuals: Min 1Q Median 3Q Max -5.941 -3.816 -1.941 2.273 20.907 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.2225 6.9015 0.902 0.369 ptpre -0.0930 0.4489 -0.207 0.836 Residual standard error: 5.157 on 143 degrees of freedom Multiple R-Squared: 0.0003001, Adjusted R-squared: -0.006691 F-statistic: 0.04292 on 1 and 143 DF, p-value: 0.8362 These are the parameter estimates

The final regression equation is • The coefficients mean • the estimate of the mean red blood cell loss for a patient with a PT time of 0 is 6.22 (b0hat) • an increase of one second in PT time leads to an estimated decrease of 0.09 in mean red blood cell loss (b1hat) • For any given value of PT time, we can predict the value of the red blood cell loss. For a patient with a PT time of 16 seconds:

Example 2 • Another classic example that appears in the Pagano book • Here we are interested in determining the effect of certain factors on the systolic blood pressure (sbp) of a low birth weight infant, including the gestational age of the child.

Results • To find the least squares solution using R, we use the following command • summary(lm(sbp~gestage)) Call: lm(formula = data$sbp ~ data$gestage) Residuals: Min 1Q Median 3Q Max -23.162 -7.828 -1.483 5.568 39.781 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.5521 12.6506 0.834 0.40625 data$gestage 1.2644 0.4362 2.898 0.00463 ** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 11 on 98 degrees of freedom Multiple R-Squared: 0.07895, Adjusted R-squared: 0.06956 F-statistic: 8.401 on 1 and 98 DF, p-value: 0.004628 These are the parameter estimates

The final regression equation is • The coefficients mean • the estimate of the mean sbp for a baby with a gestational age of 0 is 10.6 (b0hat) • an increase of one week of gestational age leads to an estimated increase of 1.26 in mean sbp (b1hat) • For any given value of gestational age, we can predict the value of the systolic blood pressure. For example, a child with gestational age of 27 weeks has a systolic blood pressure of

Unanswered questions • Is the estimate of b1 significantly different than zero? In other words, is there a significant relationship between the predictor and the outcome? • Have the assumptions of regression been met? • Can we include more than one predictor in the model?

Significance of results • From the R output, we can see that there are standard errors, t-statistics, and p-values attached to each of the coefficients. As we know, the point estimates alone are not enough to determine the significance of results • In the fall, you will derive how to determine the standard error of the parameter estimates. For now, we will use R to calculate the standard error and determine the t-statistic using this standard error • The degrees of freedom for the test statistic is n-2 because we had to estimate 2 parameters in the regression model • To test a hypothesis about b1, we follow the same steps we have followed before

Hypothesis test-blood • Linear regression with alpha level of 0.05 • Null hypothesis: No relationship between predictor and outcome • Test statistic: t-statistic with 98 dof • p-value= 0.836 • Fail to reject null hypothesis • Conclusion: There is no significant association between the PT time before the operation and the red blood cell loss

Hypothesis test-sbp • Linear regression with alpha level of 0.05 • Null hypothesis: No relationship between predictor and outcome • Test statistic: t-statistic with 98 dof • p-value=0.00463 • Reject null hypothesis • Conclusion: An increase in gestational age is significantly associated with an increase in systolic blood pressure

Intercept • Notice that R also provides a test statistic and p-value for the estimate of the intercept • This is for Ho: b0 = 0, which is often not a hypothesis of interest because this correspond to testing whether the sbp is equal to zero among babies with a gestational age of 0 weeks • Since no babies actually have a gestational age of 0 weeks, there is no reason to test hypothesis about this coefficient. Therefore, the no significant p-value for the intercept does not matter • There are instances in which this hypothesis does have meaning, but often we need to test for b0 = c, a constant not equal to 0

Tests for the assumptions • There are several different ways to test the assumptions of linear regression. We will briefly discuss some in the next couple of slides, but detail is left to the fall course • Many of the tests use the residuals, which are the distances from the fitted line and the outcomes • Other methods have been developed to allow departures from the assumptions, which will be briefly mentioned as well

Linearity • Simple: Plot of predictor vs outcome • In cases with multiple predictors, you need to look at other plots • One way to handle this is to transform the predictor to include a quadratic or higher order term

Homoscedasticity • Check residual plot, which plots the residuals versus the fitted values. This allows you to see if there is a change in the distance to the line as the you move along the line. • The top plot shows the assumption is met, while the bottom plot shows that there is a greater amount of variance for larger fitted values

Normality • To test if the residuals are normal: • Histogram of residuals • Normal probability plot • You could also check the residual plot again. Notice here that the errors are clearly not normally distributed because they are all far from the line

Linear regression with dichotomous predictor • Linear regression can also be used for dichotomous predictors, like sex • To do this, we use an indicator variable, which equals 1 for male and 0 for female. The resulting regression equation for predicting sbp is • We determine the regression coefficients in the same way as before and use the same R code

Results from R Call: lm(formula = data$sbp ~ data$sex) Residuals: Min 1Q Median 3Q Max -27.4643 -6.4643 -0.1640 5.1364 39.1364 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 46.464 1.529 30.395 <2e-16 *** data$sex 1.399 2.305 0.607 0.545 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 11.44 on 98 degrees of freedom Multiple R-Squared: 0.003748, Adjusted R-squared: -0.006418 F-statistic: 0.3687 on 1 and 98 DF, p-value: 0.5451 These are the coefficient estimates These are the p-values for the coefficients

Interpretation of results • The final regression equation is • The meaning of the coefficients in this case are • 46.5 (b0) is the estimate of the mean sbp when sex=0, in the female group • 47.9 (b0 + b1) is the estimate of the mean sbp when sex=1, in the male group • 1.4 (b1)is the estimate of the mean increase in sbp between the males and females • The difference between the two groups is b1. • Can we test if there is a significant difference between the two groups?

Hypothesis test • Linear regression with alpha level of 0.05 • Null hypothesis: There is no difference between males and females in terms of systolic blood pressure • Test statistic: t-statistic with 98 dof • p-value=0.545 • Fail to reject null hypothesis • Conclusion: There is no evidence of a difference in systolic blood pressure between males and females

T-test • As hopefully you remember, you could have tested this same null hypothesis using a two sample t-test • We are going to make the assumption of equal variance in the two groups because this is the assumption we made for the regression analysis • The R code to do this is t.test(data$sbp[data$sex==1], data$sbp[data$sex==0],var.equal=T)

Hypothesis test • Two independent sample t-test with equal variance; alpha = 0.05 • Null hypothesis: There is no difference between males and females in terms of systolic blood pressure • Test statistic: t-statistic with 98 dof • p-value=0.545 • Fail to reject null hypothesis • Conclusion: There is no evidence of a difference in systolic blood pressure between males and females

Comparison of linear regression and t-test • Notice that the test statistic, p-value and conclusion are exactly the same!!! • A similar link will occur with other tests, including ANOVA

Multiple regression • The largest advantage of regression is the ability to include multiple predictors of an outcome in one analysis • Confounders can also be included in multiple regression to control for these factors. Remember confounders are factors that affect the outcome and the predictor independently. These can mask the true effect. • A multiple regression equation looks just like a simple regression equation.

Example • Now, we believe that low birth weight babies from mothers with toxemia would have different systolic blood pressure than babies from mothers without toxemia • We decide that we may need to control for toxemia in our analysis of gestational age and systolic blood pressure • The resulting regression equation is • where toxemia=1 if the mother experienced toxemia and =0 if not

The meaning of each coefficient • b0: the average sbp when gestational age is 0 and the mother does not have toxexmia • b1: the average increase in sbp for a one unit increase in gestational age, HOLDING TOXEMIA STATUS CONSTANT • b2: the average change in sbp between mothers with toxemia and without toxemia, HOLDING GESTATIONAL AGE STATUS CONSTANT • In terms of the regression lines, this means that the two groups have different intercepts, but have the same slope

Practice • Read in the red blood cell loss data • The outcome of interest is red blood cell loss • There are six possible predictors, PT time before operation (pttim), PTT time before operation (ptttim), unit of FFP transfused (units), PT time after the operation (ptpos), age (age), sex (sex) • Using a for loop, run six linear models and determine which of the six predictors are significantly associated with the outcome

Linear regression