1 / 72

Dummy Variables

Dummy Variables. Some potential explanatory variables are categorical and cannot be measured on a quantitative scale. However, we often need to use these variables because they are related to the response variable.

teryl
Télécharger la présentation

Dummy Variables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dummy Variables • Some potential explanatory variables are categorical and cannot be measured on a quantitative scale. • However, we often need to use these variables because they are related to the response variable. • The trick is to create dummy variables, also called indicator or 0-1 variables. • These are variables that indicate the category a given observation is in.

  2. Dummy Variables -- continued • To create dummy variables we can use an IF statement or we can use StatPro’s Dummy variable procedure. • The Dummy variable procedure is usually easier particularly when there are multiple categories. • Once the dummy variables are created, we can combine the variables if we like by simply adding the columns to get the dummy for the new category.

  3. Regression Analysis • In this example we create dummy variables for Gender, and EducLev. • Then we can run a regression analysis with Salary as the response variable, using any combination of numerical and dummy explanatory variables. • We must follow two rules: • We shouldn’t use any of the original categorical variables that the dummies are based on. • We should use one less dummy than the number of categories for any categorical variable.

  4. Regression Analysis -- continued • This second rule is a technical one. If we violate it the software will give us an error message. • For example, Ed_1-Ed_6, any five of these variables can be used. The omitted dummy then corresponds to the reference category. • As we will see the interpretation of the dummy variable coefficients are all relevant to this reference category. • To get used to dummy variables in regression analysis we will proceed in several stages.

  5. Regression Analysis -- continued • We first estimate a regression equation with only one variable. The output is shown in this table. The resulting equation isPredicated Salary = 45.505 - 8.26Female

  6. Regression Analysis -- continued • To interpret this equation recall that Female has only two possible values, 0 and 1. If we substitute 1 then the predicted salary equals 37.209 and if we substitute 0 the predicated salary is 45.505. • These are the average salaries of females and males. Therefore the interpretation of the -8.926 coefficient of the Female dummy variable is straightforward.

  7. Regression Analysis -- continued • The above equation only tells part of the story, it ignores all information except for gender. • We expand this equation by adding the experience variables. The output is shown in this table.

  8. Regression Analysis -- continued • The corresponding equation isPredicted Salary = 35.492 + 0.998YrsExper + 0.131YrsPrior - 8.080Female • It is useful to write two separate equations, one for females and one for males Predicted Salary = 27.412 + 0.988YrsExper + 0.131YrsPrior Predicted Salary = 35.492 + 0.988YrsExper + 0.131YrsPrior • We interpret the coefficient -8.080 of the Female dummy variable as the average salary disadvantage for females relative to males after controlling for job experience. But there is still more story to tell.

  9. Regression Analysis -- continued • We next add job grade to the equation by including five of the six job grade dummies. Although any five can be use we use Job_2-Job_6. The resulting output is shown in this table.

  10. Regression Analysis -- continued • The estimated regression equations is nowPredicated Salary=30.230 + 0.408YrsExper + 0.149YrsPrior - 1.962Female + 2.57Job_2 + 6.295Job_3 + 10.475Job_4 +16.011Job_5 + 27.647Job_6 • There are no two categorical variables involved, gender and job grade. • However, we can still write a separate equation for any combination of categories by setting the dummies to the appropriate values.

  11. Regression Analysis -- continued • For example, the equation for females at the fifth job grade is found by setting Female=1 and Job_5=1 and setting the other job dummies equal to 0. The equation formed isPredictedSalary = 44.279 + 0.408YrsExper + 0.150YrsPrior • We interpret this equation as follows: • For either gender and any job grade, the expected increase is salary for one extra year of experience with Fifth National is $408; the expected salary increase for one year experience with another bank is $149.

  12. Regression Analysis -- continued • The coefficients of the job dummies indicate the average increase in salary an employee can expect relative to the reference (lowest) job grade. • The key coefficient, the negative $1962 for females indicates the average salary disadvantage for females relative to males, given that they have the same experience levels and are in the same job grade • Although the “penalty” is still substantial, it is less than a fourth of the penalty we saw before. • It appears that females might be getting paid less on average partly because they are in the lower job categories.

  13. Regression Analysis -- continued • We can check whether females are disproportionately in the lower job categories by using a pivot table with JobGrade in the row area, Gender in the column area and the count (expressed as a percentage) of any variable in the data area.

  14. Regression Analysis -- continued • Clearly, females tend to be concentrated at the lower job grades. • This certainly helps to explain why females get lower salaries on average, but it doesn’t explain why females are at the lower job grades in the first place. • We won’t be able to provide a thorough analysis of this issue but we can add one more piece to the puzzle now by adding education level, age, and PCJob to the equation.

  15. Regression Analysis -- continued • We don’t provide the whole equation but the resulting output is shown here.

  16. Regression Analysis -- continued • The coefficients can be seen in the output. • It doesn’t appear to add much to the previous equation. The “penalty” does, however, go up to $2555, which is slightly greater than the $1962. • At face value we can interpret the coefficients of the education dummies as a benefit (or loss if negative) of extra education relative to a high school diploma, the reference category.

  17. Regression Analysis -- continued • The coefficient of PCJob implies that an employee with a computer-related job can expect an extra $4923 in salary relative to an employee without a computer-related job, provided the other variables are the same for each employee. • The age coefficient is quite small and has little effect on salary.

  18. Conclusion • The main conclusion we can draw from the output is that there is still a plausible case to be made for discrimination against females, even after including information on all the variables in the database in the regression equation.

  19. Modeling Possibilities

  20. BANK.XLS • The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. • The bank’s employee database is listed in this file. Here is a partial list of the data.

  21. Question • Earlier we estimated an equation for Salary suing the numerical explanatory variables YrsExper and YrsPrior and the dummy variable Female. • If we drop the YrsPrior variable from the equation (for simplicity) and rerun the regression, we obtain the equationPredicted Salary = 35.824 + 0.981YrsExper - 8.012Female • The R2 value for this equation is 49.1%. If we decide to include an interaction variable between YrsExper and Female in this equation, what is the effect?

  22. Interaction Terms • An interaction variable algebraically is the product of two variables. Its effect is to allow the effect of one of the variables on Y to depend on the value of the other variable. • The interaction term allows the slope of the regression line to differ between the two categories.

  23. Solution • We first need to form an interaction variable that is the product of YrsExper and Female. • This can be done two ways in Excel. • we can do it manually by introducing a new variable that contains the product of the two variables involved, or • we can use the StatPro/Data Utilities/Create Interaction Variable menu item. • Using the latter way we must select Female and YrsExper as the variables, and we do not check either of the boxes in the dialog box -- neither should be a categorical variable.

  24. Solution -- continued • Once the interaction variable has been created, we include it in the regression equation in addition to the other variables. The multiple regression output is shown here.

  25. Solution -- continued • The estimated regression equation isPredicated Salary = 30.430 + 1.528YrsExper + 4.908Female - 1.248YrsExper_Female • As we discussed before it is useful to write this equation as two separate equations, one for females and one for males. The female equation isPredicated Salary = 34.528 + 0.280YrsExperand the male equation isPredicated Salary = 30.430 + 1.528YrsExper • Next we can show these equations graphically.

  26. Nonparallel Female and Male Salary Lines

  27. Solution -- continued • The Y-intercept for the female line is slightly higher - females with no experience at Fifth National Bank tend to start out slightly higher than males - but the slope of the female line is much lower. That is, males tend to move up the salary ladder much more quickly than females. • Again, this provides another argument, although a somewhat different one, for gender discrimination against females. • The R2 value increased from 49.1% to 63.9%. The interaction variable has definitely added to the explanatory power of the equation.

  28. Modeling Possibilities

  29. BANK.XLS • The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. • The bank’s employee database is listed in this file. Here is a partial list of the data.

  30. Question • A glance at the distribution of salaries of the 208 employees shows some skewness to the right - a few employees make substantially more than the majority of employees. • Therefore, it might make sense to use the natural logarithm of Salary instead of Salary as the response variable. • If we do this, how do we interpret the results?

  31. Solution • All of the analyses we did earlier with this data set could be repeated except with Log_Salary as the response variable. • For the sake of discussion we will look only at the regression equation with Female and YrsExper as explanatory variables. • After we create the Log_Salary variable and run the regression, we obtain the output shown here.

  32. Regression Output with Log_Salary as Response Variable

  33. Solution • The estimated regression equation is Predicted Log_Salary = 3.5829 +0.0188YrsExper - 0.1616 Female • The R2 and se values are 42.4% and 0.1794. For comparison with Salary these were 49.1% and 8.070. • We first interpret that neither of these values are directly comparable to the Salary values. • The two R2 values are percentages explained of different response variables, Log_Salary and Salary. The fact that one is smaller does not mean a “worse” fit. They simply aren’t comparable.

  34. Solution -- continued • The situation for se is even worse. Each se is a measure of a typical residual, but the residuals in the Log_Salary equation are in log dollars, whereas the residuals in the Salary equation are in dollars. • Therefore it is no surprise that the Log_Salary is much smaller than the se for the Salary equation. • If we want comparable standard error measures for the two equations, we should take antilogs of the fitted values from the Log_Salary equation to convert them back to dollars, subtract these from the original Salary values, and take the standard deviation of these residuals.

  35. Solution -- continued • The resulting standard deviation is 7.74. This is somewhat smaller than the se from the Salary equation, an indication of a slightly better fit. • Finally we interpret the equation itself. • When the response variable is Log_Y and a term on the right hand side of the equation is of the form bX, then wheneverX increases by one unit Y-hat changes by a constant percentage, and this percentage is approximately equal to b (written as a percentage).

  36. Solution -- continued • This means that for each year of experience with Fifth National, an employees salary can be expected to increase 1.88%. • The Female expected percentage decrease in salary is 16.16%. • In other words this equation implies that females can expect to make about 16% less than men for comparable years of experience.

  37. Modeling Possibilities

  38. POWER.XLS • The Public Service Electric Company produces different quantities of electricity each month, depending on the demand. • This file lists the number of units of electricity produced (Units) and the total cost of producing these (Cost) for a 36-month period. • The data set appears on the next slide. • How can regression be used to analyze the relationship between Cost and Units?

  39. Data for Electric Power

  40. Solution • A good place to start is with a scatterplot of Cost versus Units.

  41. Solution -- continued • The scatterplot indicates a definite positive relationship and one that is nearly linear. • However, there is also some evidence of curvature in the plot. The points increase slightly less rapidly as Units increase from left to right. • In economic terms, there may be economics of scale, where marginal cost of the electricity decreases as more units of electricity are produced. • Nevertheless, we use regression to estimate a linear relationship between Cost and Units.

  42. Solution -- continued • The resulting regression equation is Predicted Cost = 23,651 + 30.53 Units • The corresponding R2 and se are 73.6% and $2734. We also requested a scatterplot of the residuals versus the fitted values. The scatterplot is on the next slide. Obtaining this scatterplot is always a good idea if nonlinearity is suspected. • The sign of nonlinearity in this plot is that the residuals to the far left and the far right are all negative, whereas the majority of the residuals in the middle are positive.

  43. Residuals from a Straight-Line Fit

  44. Solution -- continued • Admittedly the pattern is far from perfect - there are a few negatives in the middle - but the plot does hint at nonlinear behavior. • The negative-positive-negative behavior of the residuals suggests a parabola; that is, a quadratic equation with the square of Units included in the equation. • We first create a new variable Sqr_Units in the data set. This can be done manually or using StatPro’s Transform Variables menu item.

  45. Solution -- continued • Then we use multiple regression to estimate the equation for Cost with both explanatory variables, Units and Sqr_Units, included. • The resulting equation from the output on the next slide is Predicated Cost = 5793 +98.3Units - 0.0600Sqr_Units • Note that R2 has increase to 82.2% and se has decreased to $2281.

  46. Regression Output with Squared Term Included

  47. Solution -- continued • One way to see how this regression equation fits the scatterplot of Costs versus Units is to use Excel’s trendline option. • To do so activate the scatterplot, click on any point and use the Chart/Add Trendline menu item, click the Type tab and select the Polynormal type or order 2, that is a quadratic. • A graph of the equation is superimposed on the scatterplot on the following slide. It shows reasonably good fit, plus an obvious curvature.

  48. Quadratic Fit Scatterplot

  49. Solution -- continued • The main downside to a quadratic regression equation is that there is no easy interpretation of the coefficients of Units and Sqr_Units. • All we can say is that the terms in the equation combine to explain the nonlinear relationship between units produced and total cost. • A final note about the equation concerns the coefficient of Sqr_Units. • First, the fact that it is a negative make the parabola bend downward. This produces the decreasing marginal cost behavior, where every extra unit of electricity incurs a smaller cost.

  50. Solution -- continued • Second, we shouldn’t be fooled by the small magnitude of this coefficient. Remember that it is the coefficient of Units squared, which is a large quantity. Therefore, the effect of the product -0.0600Sqr_Units is sizable. • One other possibility we might examine is a logarithmic fit. • In this case we create a new variable Log_Units, the natural logarithm of Units, and then regress Cost against the single variable Log_Units.

More Related