1 / 89

Linear regression

Linear regression. Brian Healy, PhD BIO203. Previous classes. Hypothesis testing Parametric Nonparametric Correlation. What are we doing today?. Linear regression Continuous outcome with continuous, dichotomous or categorical predictor Equation: Interpretation of coefficients

riva
Télécharger la présentation

Linear regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear regression Brian Healy, PhD BIO203

  2. Previous classes • Hypothesis testing • Parametric • Nonparametric • Correlation

  3. What are we doing today? • Linear regression • Continuous outcome with continuous, dichotomous or categorical predictor • Equation: • Interpretation of coefficients • Connection between regression and • correlation • t-test • ANOVA

  4. Big picture • Linear regression is the most commonly used statistical technique. It allows the comparison of dichotomous, categorical and continuous predictors with a continuous outcome. • Extensions of linear regression allow • Dichotomous outcomes- logistic regression • Survival analysis- Cox proportional hazards regression • Repeated measures • Amazingly, many of the analyses we have learned can be completed using linear regression

  5. Example • Yesterday, we investigated the association between age and BPF using a correlation coefficient • Can we fit a line to this data?

  6. Quick math review • As you remember from high school math, the basic equation of a line is given by y=mx+b where m is the slope and b is the y-intercept • One definition of m is that for every one unit increase in x, there is an m unit increase in y • One definition of b is the value of y when x is equal to zero

  7. Picture • Look at the data in this picture • Does there seem to be a correlation (linear relationship) in the data? • Is the data perfectly linear? • Could we fit a line to this data?

  8. What is linear regression? • Linear regression tries to find the best line (curve) to fit the data • The method of finding the best line (curve) is least squares, which minimizes the sum of the distance from the line for each of points

  9. How do we find the best line? • Let’s look at three candidate lines • Which do you think is the best? • What is a way to determine the best line to use?

  10. Residuals • The actual observations, yi, may be slightly off the population line because of variability in the population. The equation is yi = b0 + b1xi + ei, where ei is the deviation from the population line (See picture). • This is called the residual This is the distance from the line for patient 1, e1

  11. Least squares • The method employed to find the best line is called least squares. This method finds the values of b that minimize the squared vertical distance from the line to each of the point. This is the same as minimizing the sum of the ei2

  12. Estimates of regression coefficients • Once we have solved the least squares equation, we obtain estimates for the b’s, which we refer to as • The final least squares equation is where yhat is the mean value of y for a value of x1

  13. Assumptions of linear regression • Linearity • Linear relationship between outcome and predictors • E(Y|X=x)=b0 + b1x1 + b2x22is still a linear regression equation because each of the b’s is to the first power • Normality of the residuals • The residuals, ei, are normally distributed, N(0, s2) • Homoscedasticity of the residuals • The residuals, ei, have the same variance • Independence • All of the data points are independent • Correlated data points can be taken into account using multivariate and longitudinal data methods

  14. Linearity assumption • One of the assumptions of linear regression is that the relationship between the predictors and the outcomes is linear • We call this the population regression line E(Y | X=x) = my|x = b0+ b1x • This equation says that the mean of y given a specific value of x is defined by the b coefficients • The coefficients act exactly like the slope and y-intercept from the simple equation of a line from before

  15. Normality and homoscedasticity assumption • Two other assumptions of linear regression are related to the ei’s • Normality- the distribution of the residuals are normal. • Homoscedasticity- the variance of y given x is the same for all values of x Distribution of y-values at each value of x is normal with the same variance

  16. Example • Here is a regression equation for the comparison of age and BPF

  17. Results • The estimated regression equation

  18. Estimated slope Estimated intercept

  19. Interpretation of regression coefficients • The final regression equation is • The coefficients mean • the estimate of the mean BPF for a patient with an age of 0 is 0.957 (b0hat) • an increase of one year in age leads to an estimated decrease of 0.0029 in mean BPF (b1hat)

  20. Unanswered questions • Is the estimate of b1 (b1hat) significantly different than zero? In other words, is there a significant relationship between the predictor and the outcome? • Have the assumptions of regression been met?

  21. Estimate of variance for bhat’s • In order to determine if there is a significant association, we need an estimate of the variance of b0hat and b1hat • sy|x is the residual variance in y after accounting for x (standard deviation from regression, root mean square error)

  22. Test statistic • For both regression coefficients, we use a t-statistic to test any specific hypothesis • Each has n-2 degrees of freedom (This is the sample size-number of parameters estimated) • What is the usual null hypothesis for b1?

  23. Hypothesis test • H0: b1=0 • Continuous outcome, continuous predictor • Linear regression • Test statistic: t=-3.67 (27 dof) • p-value=0.0011 • Since the p-value is less than 0.05, we reject the null hypothesis • We conclude that there is a significant association between age and BPF

  24. Estimated slope p-value for slope Estimated intercept

  25. Comparison to correlation • In this example, we found a relationship between the age and BPF. We also investigated this relationship using correlation • We get the same p-value!! • Our conclusion is exactly the same!! • There are other relationships we will see later

  26. Confidence interval for b1 • As we have done previously, we can construct a confidence interval for the regression coefficients • Since we are using a t-distribution, we do not automatically use 1.96. Rather we use the cut-off from the t-distribution • Interpretation of confidence interval is same as we have seen previously

  27. Intercept • STATA also provides a test statistic and p-value for the estimate of the intercept • This is for Ho: b0 = 0, which is often not a hypothesis of interest because this corresponds to testing whether the BPF is equal to zero at age of 0 • Since BPF can’t be 0 at age 0, this test is not really of interest • We can center covariates to make this test important

  28. Prediction

  29. Prediction • Beyond determining if there is a significant association, linear regression can also be used to make predictions • Using the regression equation, we can predict the BPF for patients with specific age values • Ex. A patient with age=40 • The expected BPF for a patient of age 40 based on our experiment is 0.841

  30. Extrapolation • Can we predict the BPF for a patient with age 80? What assumption would we be making?

  31. Confidence interval for prediction • We can place a confidence interval around our predicted mean value • This corresponds to the plausible values for the mean BPF at a specific age • To calculate a confidence interval for the predicted mean value, we need an estimate of variability in the predicted mean

  32. Confidence interval • Note that the standard error equation has a different magnitude based on the x value. In particular, the magnitude is the least when x=the mean of x • Since the test statistic is based on the t-distribution, our confidence interval is • This confidence interval is rarely used for hypothesis testing because

  33. Prediction interval • A confidence interval for a mean provides information regarding the accuracy of a estimated mean value for a sample size • Often, we are interested in how accurate our prediction would be for a single observation, not the mean of a group of observations. This is called a prediction interval • What would you estimate as the value for a single new observation? • Do you think a prediction interval is narrower or wider?

  34. Prediction interval • Confidence interval always tighter than prediction intervals • The variability in the prediction of a single observation contains two types of variability • Variability of the estimate of the mean (confidence interval) • Variability around the estimate of the mean (residual variability)

  35. Conclusions • Prediction interval is always wider than confidence interval • Common to find significant differences between groups but not be able to predict very accurately • To predict accurately for a single patient, we need limited overlap of the distribution. The benefit of an increased sample size decreasing the standard error does not help

  36. Model checking

  37. How good is our model? • Although we have found a relationship between age and BPF, linear regression also allows us to assess how well our model fits the data • R2=coefficient of determination=proportion of variance in the outcome explained by the model • When we have only one predictor, it is the proportion of the variance in y explained by x

  38. R2 • What if all of the variability in y was explained by x? • What would R2 equal? • What does this tell you about the correlation between x and y? • What if the correlation between x and y is negative? • What if none of the variability in y is explained by x? • What would R2 equal? • What is the correlation between x and y in this case?

  39. r vs. R2 • R2=(Pearson’s correlation coefficient)2=r2 • Since r is between -1 and 1, R2 is always less than r • r=0.1, R2=0.01 • r=0.5, R2=0.25

  40. Evaluation of model • Linear regression required several assumptions • Linearity • Homoscedasticity • Normality • Independence-usually from study design • We must determine if the model assumptions were reasonable or a different model may have been needed • Statistical research has investigated relaxing each of these assumptions

  41. Scatter plot • A good first step in any regression is to look at the x vs. y scatter plot. This allows us to see • Are there any outliers? • Is the relationship between x and y approximately linear? • Is the variance in the data approximately constant for all values of x?

  42. Tests for the assumptions • There are several different ways to test the assumptions of linear regression. • Graphical • Statistical • Many of the tests use the residuals, which are the distances from the fitted line and the outcomes

  43. Residual plot If the assumptions of linear regression are met, we will observe a random scatter of points

  44. Investigating linearity • Scatter plot of predictor vs outcome • What do you notice here? • One way to handle this is to transform the predictor to include a quadratic or other term

  45. Aging • Research has shown that the decrease in BPF in normal people is pretty slow up until age 65 and then there is a more steep drop

  46. Fitted line Note how the majority of the values are above the fitted line in the middle and below the fitted line on the two ends

  47. What if we fit a line for this? • Residual plot shows a non-random scatter because the relationship is not really linear

  48. What can we do? • If the relationship between x and y is not linear, we can try a transformation of the values • Possible transformations • Add a quadratic term • Fit a spline. This is when there is a slope for a certain part of the curve and a different slope for the rest of the curve

More Related