1 / 0

Additional Topics in Regression

STAT E-150 Statistical Methods. Additional Topics in Regression. Transformations We may expect that the number of doctors in a city is related to the number of hospitals, and we can try to find a linear model for this relationship . Here are the SPSS results:

wattan
Télécharger la présentation

Additional Topics in Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STAT E-150Statistical Methods Additional Topics in Regression
  2. Transformations We may expect that the number of doctors in a city is related to the number of hospitals, and we can try to find a linear model for this relationship. Here are the SPSS results: The results suggest that there is a linear relationship between the variables. (F = 382.692, with p close to 0.)
  3. But the scatter diagram and an examination of the residuals do not suggest a linear relationship. (This is not unusual when working with counted data, where variability may increase as the counts increase.)
  4. And the NPP does not suggest a linear relationship.
  5. When not all conditions for a linear model are satisfied, we can try a transformation of one or more variables in the data. For counted data, where the variability increases as the counts increase, we can try a square root transformation of the data, using . The regression equation is = 14.033 + 2.915NumHosp Transform back to the original variable: NumMDs= (14.033 + 2.915NumHosp)2
  6. This model shows improvement over the results before the transformation:
  7. Other transformations of the predictor variables, the response variables, or both can be considered depending on the type of data and the shape of the scatter diagram. These possibilities include the square root, reciprocal, logs, other powers of the variables, and many more. How do you determine the best model for your data?
  8. To operate efficiently, power companies must be able to predict the peak power load at their stations. Peak power load is the maximum amount of power that must be generated to meet the demand. A power company wants to use daily high temperature, x, to model daily peak power load, y, during the summer months when demand is high. Although it is expected that peak load increases as the temperature rises, the rate of increase may not be constant. For example, a 1-degree increase in high temperature from 100ºF to 101ºF might result in a larger increase in power demand than an increase from 80ºF to 81ºF. That is, the model might include a quadratic term and possible a cubic term.
  9. A random sample of 25 summer days is selected and both the peak load (measured in megawatts) and the high temperature (measured in ºF) are recorded for each day. The scatterplot on the left shows a nonlinear, upward curving pattern. It may be a second-order model, so we can see if a quadratic model is appropriate. The results are shown in the graph on the right, which shows that a quadratic model appears to be a good fit. But it is still important to check the results to assess the fit.
  10. It can be seen that both the first-order and second-order coefficients are significant:
  11. Also, the ANOVA results show a p-value close to zero, indicating that the quadratic model is useful: And R2is close to 1:
  12. What if we were to try a third-order (cubic) model?
  13. What if we were to try a third-order (cubic) model? The results are shown below: Is this model a better fit? How can you decide?
  14. Try looking at the residuals:
  15. Try looking at the residuals: Which model should you use?
  16. We know how to use a t-test for individual predictors (H0: βi = 0) and how to use ANOVA to test all of the predictors (H0: β1 = β2 = β3 = 0). What if we want to test a subset of the predictors? A nested F-test is used to test a subset of the predictors. One model is nested within another model if all of its predictors are present in the larger model. We use this to compare the full model with some nested subset of the full model, to help determine which predictors to retain.
  17. For example, suppose you have data where there is one response variable (y) and two quantitative predictors, x1 and x2, and you are considering two models: a linear interaction model : y = β0 + β1x1 + β2x2 + β3x1x2 and a curvilinear model: y = β0 + β1x1 + β2x2 + β3x1x2 + β4x12 + β5x22 These are nested models because all terms in the linear model are in the more complex curvilinear model. The linear model is the reduced model and the more complex model is the full or complete model.
  18. To determine whether the complete model contributes more information for the prediction of y than the reduced model, test to see whether the quadratic terms are significant in the model by testing H0: β4 = β5 = 0. How to best choose predictors?
  19. For example, what variables would help a music producer predict CD sales? One possible predictor is advertising. Here is a sample of data for 200 CDs, showing the amount spent in advertising (in thousands of pounds) and the sales (in thousands) one week after the CD was released:
  20. We can see what the simple regression model looks like. These results show that there is a linear association between these two variables: Sales = .096 adverts + 134.14
  21. But there are certainly more factors that contribute to sales. Suppose we add two more predictors: the number of times the CD was played on air during the week prior to release (airplay), and the attractiveness of the band (attract).
  22. The new regression equation is: Sales= .085adverts + 3.37airplay + 11.086 attract - 26.613
  23. We can look at more diagnostics to decide which variables to include in the model. Select > Analyze > Regression > Linear Indicate the response variable and all predictors, then click on Statistics. Then choose the options shown in the Linear Regression Statistics dialog box.
  24. The results will include the Correlation matrix which shows the value of the correlation coefficient, r, for each pair of variables, as well as the one-tailed significance of each correlation. This table can also be used to check for multicollinearity; there should be no strong correlations (r > .9) between predictors.
  25. Which predictor would be the best predictor of CD sales? How can you tell?
  26. Which predictor would be the best predictor of CD sales? How can you tell? The number of plays correlates best with the CD Sales, so it might be the best predictor. (r = .599, p < .001)
  27. Which predictor would be the best predictor of CD sales? How can you tell? This result is also suggested by the matrixscatterplot:
  28. Let's compare the results for both models: Although both models show significant association, the results for the model with three predictors has a higher, and therefore more significant, F-ratio.
  29. But how can we find the best model? One method is Stepwise Regression, which is an iterative technique that is used to choose which predictor variables to include in a regression model. In our example, what variables would help a music producer predict CD sales?
  30. We will use SPSS for this analysis. In the Linear Regression dialog box, choose Advertising Budget (adverts) as the only predictor variable. Note that this is Block 1. Click on Next to create another model by adding an additional predictor, and enter the predictor for this second block.
  31. Then repeat this process to create the third model with all three predictors:
  32. Click on Statistics to select the statistics you want, then click on Continue to return to the main dialog box.
  33. In the main dialog box you can choose the values you wish to save. Choose Options to enter the criteria you want to use for entering variables in the stepwise regression, as well as the level of significance. The option to exclude cases listwise removes cases which have any missing variables from the regression analysis.
  34. Then click on Continue to return to the main regression dialog box once again. Choose Plots to enter the values you want to use and the residual plots you want to create. Then click on Continue and OK to run the analysis.
  35. The Model Summary table shows the results for all three models you chose so that you can compare them:
  36. We can also consider the ANOVA results:
  37. Which model, then, should we choose? In the table of coefficients, the t-tests measure the contribution the predictor is making to the model. Since all predictors have small p-values, all three are significant predictors to the model and should be included.
  38. So the final model is:
  39. Let's also look at the graphs that were produced. 1. Check the graph of the residuals vs. the predicted values for any pattern: This is exactly the scatterplot we would like to see; the points are randomly and evenly scattered around zero with no particular pattern or shape.
  40. 2. Check the residuals for normality: The histogram shows a unimodal and symmetric distribution, and the Normal Probability Plot shows that the points representing the residuals lie close to the line.
  41. You are saving to buy a house or condo and you wonder how much it will cost. Depending on what you want and its location, the price will vary dramatically. We will examine data for 100 home sales in Gainesville, Florida. The variables are Selling price, in dollars House size, in square feet Number of bedrooms Number of bathrooms Lot size, in square feet Annual real estate tax, in dollars Whether the house is in the northwest (NW) quadrant of the city
  42. Here is a view of the data:
  43. 1. Find the association with the selling price as the response variable and the house size and lot size as explanatory variables.
  44. 1. Find the association with the selling price as the response variable and the house size and lot size as explanatory variables. The regression equation is y = 53.8xhouse size + 2.84xlotsize- 10536
  45. Find the association with the selling price as the response variable and the house size and region as explanatory variables The regression equation is y = 78.0xhouse size + 30569xNW - 15258
  46. Find the association with the selling price as the response variable and the house size and region as explanatory variables The regression equation is y = 78.0xhouse size + 30569xNW - 15258 What is the regression equation for homes not in the NW quadrant?
  47. Find the association with the selling price as the response variable and the house size and region as explanatory variables The regression equation is y = 78.0xhouse size + 30569xNW - 15258 What is the regression equation for homes not in the NW quadrant? y = 78.0xhouse size + 30569xNW - 15258 = 78.0xhouse size + 30569(0) - 15258 = 78.0xhouse size - 15258
  48. Find the association with the selling price as the response variable and the house size and region as explanatory variables The regression equation is y = 78.0xhouse size + 30569xNW - 15258 What is the regression equation for homes in the NW quadrant? y = 78.0xhouse size + 30569xNW - 15258 = 78.0xhouse size + 30569 - 15258 = 78.0xhouse size+ 15311
  49. Find the association with the selling price as the response variable and the house size and region as explanatory variables The regression equation is y = 78.0xhouse size + 30569xNW - 15258 What is the regression equation for homes in the NW quadrant? y = 78.0xhouse size + 30569xNW - 15258 = 78.0xhouse size + 30569 - 15258 = 78.0xhouse size+ 15311
  50. Find the association with the selling price as the response variable and the house size and region as explanatory variables The regression equation is y = 78.0xhouse size + 30569xNW - 15258 What is the additional amount paid for a home in the desirable NW quadrant? 15311 - (- 15258) = 30569 Note that this is the slope of the indicator (xNW ) term.
  51. Find the association with the selling price as the response variable and the house size and region as explanatory variables The regression equation is y = 78.0xhouse size + 30569xNW - 15258 What is the additional amount paid for a home in the desirable NW quadrant? 15311 - (- 15258) = 30569 Note that this is the slope of the indicator (xNW ) term.
  52. Is there interaction? The regression equation for homes in the NW quadrant is y = 78.0xhouse size + 15311 The regression equation for homes not in the NW quadrant is y = 78.0xhouse size - 15258 Since these two lines have the same slope, they are parallel. The effect of the house size on the selling price is the same for both regions.
  53. 3. Suppose you wanted to use indicator variables for all of the quadrants. How should this be coded? How many indicator variables are needed?
  54. 3. Suppose you wanted to use indicator variables for all of the quadrants. How should this be coded? How many indicator variables are needed? If you want to use all four quadrants, you only need three indicator variables: Generally, a categorical explanatory variable can be expressed using one less indicator than the number of categories.
  55. Here are the results when all three indicators are added to the model: Are these additional indicators significant in this model? What changes do you see? Here are the results when all three indicators are added to the model:
  56. Here are the results when all three indicators are added to the model: Are these additional indicators significant in this model? The two new indicators are not significant. What changes do you see? Neither is NW!
More Related