Simple Linear Regression

# Simple Linear Regression

Télécharger la présentation

## Simple Linear Regression

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression • SECTIONS 9.1, 9.3 • Inference for slope (9.1) • Confidence and prediction intervals (9.3) • Conditions for inference (9.1) • Transformations (not in book)

2. Sample to Population • Everything we did last class on regression was based solely on sample data • Now, we will do inference, extending from the sample to the population

3. Simple Linear Model • The population/true simple linear model is • 0 and 1, are unknown parameters • Can use familiar inference methods! Intercept Slope Random error

4. Simple Linear Regression • Simple linear regression estimates the population model • with the sample model:

5. Inference for the Slope • Confidence intervals and hypothesis tests for the slope can be done using the familiar formulas: • Population Parameter: 1, Sample Statistic: • Use t-distribution with n – 2 degrees of freedom

6. Inference for Slope Give a 95% confidence interval for the true slope. Is the slope significantly different from 0? (a) Yes (b) No

7. Confidence Interval > percentile(“t”,.975,df=5) [1] 2.570582 We are 95% confident that the true slope, regressing temperature on cricket chirp rate, is between 0.194 and 0.266 degrees per chirp per minute.

8. Hypothesis Test > tail.p(“t”,16.21,df=5,tail=“two”) [1] 1.628701e-05 There is strong evidence that the slope is significantly different from 0, and that there is an association between cricket chirp rate and temperature.

9. Correlation Test • Test for a correlation between temperature and cricket chirps (r = 0.9906). • The t-statistic (and p-value) for a test for a non-zero slope and a test for a non-zero correlation are identical! • Equivalent ways of testing for an association between two quantitative variables. > tail.p(“t”,16.21,df=5,tail=“two”) [1] 1.628701e-05

10. Two Quantitative Variables • The t-statistic (and p-value) for a test for a non-zero slope and a test for a non-zero correlation are identical! • They are equivalent ways of testing for an association between two quantitative variables.

11. Small Samples • The t-distribution is only appropriate for large samples (definitely not n = 7)! • We should have done inference for the slope using simulation methods...

12. Presidential Elections • We found the point estimate for Obama’s predicted margin of victory based on his approval rating to be 5.3 • We really want an interval estimate

13. Prediction • We would like to use the regression equation to predict y for a certain value of x • This includes not only a point estimate, but interval estimates also • We will predict the value of y at x = x*

14. Point Estimate • The point estimate for the average y value at x=x* is simply the predicted value: • Alternatively, you can think of it as the value on the line above the x value • The uncertainty in this point estimate comes from the uncertainty in the coefficients

15. Confidence Intervals • We can calculate a confidence interval for the averagey value for a certain x value • “We are 95% confident that the average y value for x=x* lies in this interval” • Equivalently, the confidence interval is for the point estimate, or the predicted value • This is the amount the line is free to “wiggle,” and the width of the interval decreases as the sample size increases

16. Bootstrapping • We need a way to assess the uncertainty in predicted y values for a certain x value… any ideas? • Take repeated samples, with replacement, from the original sample data (bootstrap) • Each sample gives a slightly different fitted line • If we do this repeatedly, take the middle P% of predicted y values at x* for a confidence interval of the predicted y value at x*

17. Bootstrap CI Middle 95% of predicted values gives the confidence interval for average (predicted) margin of victory for an incumbent president with an approval rating of 50%

18. Confidence Interval

19. Confidence Interval • For x* = 50%: (1.07, 9.52) • We are 95% confident that the average margin of victory for incumbent U.S. presidents with approval ratings of 50% is between 1.07 and 9.52 percentage points • But wait, this still doesn’t tell us about Obama! We don’t care about the average, we care about an interval for one incumbent president with an approval rating of 50%!

20. Prediction Intervals • We can also calculate a prediction interval for y values for a certain x value • “We are 95% confident that the y value for x = x* lies in this interval” • This takes into account the variability in the line (in the predicted value) AND the uncertainty around the line (the random errors)

21. Intervals

22. Intervals • A confidence interval has a given chance of capturing the mean y valueat a specified x value • A prediction interval has a given chance of capturing the y value for a particular case at a specified x value • For a given x value, which will be wider? • Confidence interval • Prediction interval A confidence interval only addresses uncertainty about the line, a prediction interval also includes the scatter of the points around the line

23. Intervals • -- Prediction • -- Confidence

24. Intervals • As the sample size increases: • the standard errors of the coefficients decrease • we are more sure of the equation of the line • the widths of the confidence intervals decrease • for a huge n, the width of the CI will be almost 0 • The prediction interval may be wide, even for large n, and depends more on the correlation between x and y (how well y can be linearly predicted by x)

25. Prediction Interval • Based on the data and the simple linear model (and using Obama’s current approval rating): • The predicted margin of victory for Obama is 5.3 percentage points • We are 95% confident that Obama’s margin of victory (or defeat) will be between -8.8 and 19.4 percentage points

26. Formulas for Intervals • NOTE: You will never need to use these formulas in this class – you will just have RStudio do it for you. • se : estimate for the standard deviation of the residuals

27. Conditions • Inference based on the simple linear model is only valid if the following conditions hold: • Linearity • Constant Variability of Residuals • Normality of Residuals

28. Linearity • The relationship between x and y is linear (it makes sense to draw a line through the scatterplot)

29. Dog Years Charlie • From www.dogyears.com: • “The old rule-of-thumb that one dog year equals seven years of a human life is not accurate. The ratio is higher with youth and decreases a bit as the dog ages.” ACTUAL • 1 dog year = 7 human years • Linear: human age = 7×dog age LINEAR A linear model can still be useful, even if it doesn’t perfectly fit the data.

30. “All models are wrong, but some are useful” • -George Box

31. Simple Linear Model

32. Residuals (errors) • Conditions for residuals: The errors are normally distributed The standard deviation of the errors is constant for all cases The average of the errors is 0 Constant vertical spread in the residual plot Check with a histogram (Always true for least squares regression)

33. Conditions not Met? • If the association isn’t linear: • Try to make it linear • If can’t make linear, then simple linear regression isn’t a good fit for the data • If variability is not constant, or residuals are not normal: •  The model itself is still valid, but inference may not be accurate

34. Non-Constant Variability

35. Non-Normal Residuals

36. Simple Linear Regression • Plot your data! • Association approximately linear? • Outliers? • Constant variability? • Fit the model (least squares) • Use the model • Interpret coefficients • Make predictions • Look at histogram of residuals (normal?) • Inference (extend to population) • Inference on slope • Confidence and prediction intervals

37. President Approval and Re-Election 1. Plot the data: Is the trend approximately linear? (a) Yes (b) No

38. President Approval and Re-Election 1. Plot the data: Are there obvious outliers? (a) Yes (b) No

39. President Approval and Re-Election 1. Plot the data: Is there approximately constant variability? (a) Yes (b) No

40. President Approval and Re-Election 2. Fit the Model:

41. President Approval and Re-Election 3. Use the model: • Which of the following is a correct interpretation? • For every percentage point increase in margin of victory, approval increases by 0.84 percentage points • For every percentage point increase in approval, predicted margin of victory increases by 0.84 percentage points • For every 0.84 increase in approval, predicted margin of victory increases by 1

42. President Approval and Re-Election 3. Use the model: • Based on Obama’s current approval rating of 50%, his predicted margin of victory is Predicted margin of victory = -36.5 + 0.84*50 = 5.5

43. President Approval and Re-Election 4. Look at histogram of residuals: • Are the residuals approximately normally distributed? • (a) Yes (b) No No, although this is a very small sample size, so it’s hard to tell. At least there are no extreme outliers.

44. President Approval and Re-Election 5. Inference • Should we do inference? • (a) Yes • (b) No • Due to the non-constant variability, the non-normal residuals, and the small sample size, our inferences may not be entirely accurate. • However, they are still better than nothing!

45. President Approval and Re-Election 5. Inference • Give a 95% confidence interval for the slope coefficient. • Is it significantly different than 0? • (a) Yes • (b) No 0.836  2.26*0.163 = (0.48, 1.20) t = 0.836/0.163 = 5.13 p-value = 0.0006

46. President Approval and Re-Election 5. Inference: • We don’t really care about the slope coefficient, we care about Obama’s margin of victory in the upcoming election! • A 95% prediction interval for Obama’s margin of victory is -8.8 to 19.4. • Actual margin of victory: 2 (50% Obama to 48% Romney)

47. Transformations • If the conditions are not satisfied, there are some common transformations you can apply to the response variable • You can take any function of y and use it as the response, but the most common are • log(y) (natural logarithm - ln) • y (square root) • y2 (squared) • ey (exponential))

48. log(y) Original Response, y: Logged Response, log(y):

49. y Original Response, y: Square root of Response, y: