1 / 35

STATS 330: Lecture 11

STATS 330: Lecture 11. Diagnostics 3. Outliers and high-leverage points. An outlier is a point that has a larger or smaller y value that the model would suggest Can be due to a genuine large error e Can be caused by typographical errors in recording the data

red
Télécharger la présentation

STATS 330: Lecture 11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STATS 330: Lecture 11 Diagnostics 3 330 lecture 11

  2. Outliers and high-leverage points • An outlier is a point that has a larger or smaller y value that the model would suggest • Can be due to a genuine large error e • Can be caused by typographical errors in recording the data • A high leverage point is a point with extreme values of the explanatory variables 330 lecture 11

  3. Outliers • The effect of an outlier depends on whether it is also a high leverage point • A “high leverage” outlier • Can attract the fitted plane, distorting the fit, sometimes extremely • In extreme cases may not have a big residual • In extreme cases can increase R2 • A “low leverage” outlier • Does not distort the fit to the same extent • Usually has a big residual • Inflates standard errors, decreases R2 330 lecture 11

  4. No outliers No high-leverage points Low leverage Outlier: big residual High leverage Not an outlier High-leverage outlier 330 lecture 11

  5. Example: the education data (ignoring urban) High leverage point 330 lecture 11

  6. An outlier also? Residual somewhat extreme 330 lecture 11

  7. Measuring leverage It can be shown (see eg STATS 310) that the fitted value of case i is related to the response data y1,…,yn by the equation The hij depend on the explanatory variables. The quantities hii are called “hat matrix diagonals” (HMD’s) and measure the influence yi has on the ith fitted value. They can also be interpreted as the distance between the X-data for the ith case and the average x-data for all the cases. Thus, they directly measure how extreme the x-values of each point are 330 lecture 11

  8. Interpreting the HMDs • Each HMD lies between 0 and 1 • The average HMD is p/n • (p=no of reg coefficients, p=k+1) • An HMD more than 3p/n is considered extreme 330 lecture 11

  9. Example: the education data educ.lm<-lm(educ~percapita+under18, data=educ.df) hatvalues(educ.lm)[50] 50 0.3428523 > 9/50 [1] 0.18 Clearly extreme! n=50, p=3 330 lecture 11

  10. Studentized residuals • How can we recognize a big residual? How big is big? • The actual size depends on the units in which the y-variable is measured, so we need to standardize them. • Can divide by their standard deviations • Variance of a typical residual e is var(e) = (1-h) s2 where h is the hat matrix diagonal for the point. 330 lecture 11

  11. Studentized residuals (2) • “Internally studentised” (Called “standardised” in R) • “Externally studentised” (Called “studentised” in R) s2 is Usual Estimate of s2 s2i is estimate of s2 after deleting the ith data point 330 lecture 11

  12. Studentized residuals (3) • How big is big? • Both types of studentised residual are approximately distributed as standard normals when the model is OK and there are no outliers. (in fact the externally studentised one has a t-distribution) • Thus, studentised residuals should be between -2 and 2 with approximately 95% probability. 330 lecture 11

  13. Studentized residuals (4) Calculating in R: library(MASS) # load the MASS library stdres(educ.lm) #internally studentised (standardised in R) studres(educ.lm) #externally studentised (studentised in R) > stdres(educ.lm)[50] 50 3.275808 > studres(educ.lm)[50] 50 3.700221 330 lecture 11

  14. What does studentised mean? 330 lecture 11

  15. Recognizing outliers • If a point is a low influence outlier, the residual will usually be large, so large residual and a low HMD indicates an outlier • If a point is a high leverage outlier, then a large error usually will cause a large residual. • However, in extreme cases, a high leverage outlier may not have a very big residual, depending on how much the point attracts the fitted plane. Thus, if a point has a large HMD, and the residual is not particularly big, we can’t always tell if a point is an outlier or not. 330 lecture 11

  16. High-leverage outlier Small residual! Small residual! 330 lecture 11

  17. Leverage-residual plot plot(educ.lm, which=5) Can plot standardised residuals versus leverage (HMD’s): the leverage-residual plot (LR plot) Point 50 is high leverage, big residual, is an outlier 330 lecture 11

  18. Interpreting LR plots Low leverage outlier 3p/n High leverage outlier Standardised residual 2 Possible high leverage outlier OK 0 -2 High leverage, outlier Low leverage outlier leverage 330 lecture 11

  19. Residuals and HMD’s No big studentized residuals, no big HMD’s (3p/n = 0.2 for this example) 330 lecture 11

  20. Residuals and HMD’s (2) Point 24 Point 24 One big studentized residual, no big HMD’s (3p/n = 0.2 for this example). Line moves a bit 330 lecture 11

  21. Residuals and HMD’s (3) Point 1 Point 1 No big studentized residual, one big HMD, pt 1. (3p/n = 0.2 for this example). Line hardly moves.Pt 1 is high leverage but not influential. 330 lecture 11

  22. Residuals and HMD’s (4) Point 1 Point 1 One big studentized residual, one big HMD, both pt 1. (3p/n = 0.2 for this example). Line moves but residual is large. Pt 1 is influential 330 lecture 11

  23. Residuals and HMD’s (5) Point 1 Point 1 No big studentized residuals, big HMD, 3p/n= 0.2. Point 1 is high leverage and influential 330 lecture 11

  24. Influential points • How can we tell if a high-leverage point/outlier is affecting the regression? • By deleting the point and refitting the regression: a large change in coefficients means the point is affecting the regression • Such points are called influential points • Don’t want analysis to be driven by one or two points 330 lecture 11

  25. “Leave-one out” measures • We can calculate a variety of measures by leaving out each data point in turn, and looking at the change in key regression quantities such as • Coefficients • Fitted values • Standard errors • We discuss each in turn 330 lecture 11

  26. Example: education data 330 lecture 11

  27. Standardized difference in coefficients: DFBETAS Formula: Problem when: Greater than 1 in absolute value This is the criterion coded into R 330 lecture 11

  28. Standardized difference in fitted values: DFFITS Formula: Problem when: Greater than 3Ö(p/(N-p) ) in absolute value (p=number of regression coefficients) 330 lecture 11

  29. COV RATIO & Cooks D • Cov Ratio: • Measures change in the standard errors • of the estimated coefficients • Problem indicated: when Cov Ratio more than • 1+3p/n or less than 1-3p/n • Cook’s D • Measures overall change in the coefficients • Problem when: More than qf(0.50, p,n-p) • (lower 50% of F distribution), roughly 1 in • most cases 330 lecture 11

  30. Calculations > influence.measures(educ.lm) Influence measures of lm(formula = educ ~ under18 + percap, data = educ.df) dfb.1. dfb.un18 dfb.prcp dffit cov.r cook.d hat inf 10 0.06381 -0.02222 -0.16792 -0.3631 0.803 4.05e-02 0.0257 * 44 0.02289 -0.02948 0.00298 -0.0340 1.283 3.94e-04 0.1690 * 50 -2.36876 2.23393 1.501812.4733 0.821 1.66e+000.3429 * > p=3, n=50, 3p/n=0.18, 3Ö(p/(n-p)) =0.758, qf(0.5,3,47)= 0.8002294 330 lecture 11

  31. Plotting influence # set up plot window with 2 x 4 array of plots par(mfrow=c(2,4)) # plot dffbetas, dffits, cov ratio, # cooks D, HMD’s influenceplots(educ.lm) 330 lecture 11

  32. 330 lecture 11

  33. Remedies for outliers • Correct typographical errors in the data • Delete a small number of points and refit (don’t want fitted regression to be determined by one or two influential points) • Report existence of outliers separately: they are often of scientific interest • Don’t delete too many points (1 or 2 max) 330 lecture 11

  34. Summary: Doing it in R • LR plot: plot(educ.lm, which=5) • Full diagnostic display plot(educ.lm) • Influence measures: influence.measures(educ.lm) • Plots of influence measures par(mfrow=c(2,4)) influenceplots(educ.lm) 330 lecture 11

  35. HMD Summary • Hat matrix diagonals • Measure the effect of a point on its fitted value • Measure how outlying the x-values are (how “high –leverage” a point is) • Are always between 0 and 1 with bigger values indicating high leverage • Points with HMD’s more than 3p/n are considered “high leverage” 330 lecture 11

More Related