Part 1: Regression Analysis Estimating Relationships
Preparing to Use Stat ToolsPharmex Drug Stores Pharmex.xls • Stat Tools is a part of the Decision Tools Suite • Open both Excel and Stat Tools. • Select StatTools + Data Set Manager • Select New • Highlight the portion of the spreadsheet that includes the data and select OK
Scatterplots: Graphing RelationshipsPharmex Drug Stores Pharmex.xls • Pharmex is a chain of drugstores that operates around the country. • The company has collected data from 50 randomly selected metropolitan regions. In each region it has collected data on its promotional expenditures and sales in the region over the past year. • There are two variables each of which are indexes, not dollar amounts. • Promote: Pharmex’s promotional expenditures as a percentage of those of the leading competitor. • Sales: Pharmex’s sales as a percentage of those of the leading competitor. • The company expects that there is a positive relationship between the two variables, so that regions with relatively more expenditures have relatively more sales. However, it is not clear what the nature of this relationship is.
Creating the ScatterplotPharmex Drug Stores Pharmex.xls • The tricky part is to decide which variable should be on the horizontal axis. • Select any data cell. • Select StatTools + Summary Graphs + Scatterplot… • In regression analysis, we always put the explanatory variable on the horizontal axis and the response variable on the vertical axis. In this example the store tends to believe that large promotional expenditures “cause” larger values of sales, so select “Sales” as the Y variable (the vertical axis).. • Select “Promote” as the X variable (the horizontal axis).
InterpretationPharmex Drug Stores • The scatterplot indicates that there is a positive relationship between Promote and Sales - the points tend to rise from bottom left to top right - but the relationship is not perfect. • The correlation of 0.673 is shown automatically on the plot. The important things to note about the correlation is that it is positive and its magnitude is moderately large. • Causation - we can never make definitive statements about causation based on regression analysis. Regression identifies only a statistical relationship, not a causal relationship
Simple Linear RegressionPharmex Drug Stores • The Pharmex scatterplot hints at a linear relationship between Promote and Sales. We want to draw the “best fitting” straight line through the points to quantify that linear relationship. • Since the relationship is not perfect, not all points lie exactly on the line. The differences are the residuals. They show how much the observed values differ from the fitted values. The fitted valueis the vertical distance from the horizontal axis to the line . • We decide to define “best fitting” line through the points in the scatterplot to be the one with the smallest sum of the squared residuals. This line is called the least squaresline • We now want to find the least squares line for the Pharmex drugstore data, using Sales as the response variable and Promote as the explanatory variable.
Least Squares Line with StatToolsPharmex Drug Stores • Select any data cell. • From the Menu bar, select : StatTools + Regression & Classification + Regression… • Specify that “Sales” is the response (dependent) variable. • Specify that “Promote” is the explanatory (independent) variable. • Select graph option: “Residuals vs Fitted values”
Regression Output TablePharmex Drug Stores • The “Constant” and “Promote” coefficients B18:C18 imply that the equation for the least squares line is: • Predicted Sales = 25.1264 + (0.7623 x Promote)
Least Square Line EquationPharmex Drug Stores We can interpret this equation as follows: • The slope 0.7623 indicates that the sales index tends to increase by about 0.76 for each unit increase in the promotional expenses index. • The interpretation of the intercept is less important. It is literally the predicted sales index for a region that does no promotions. The Scatterplot • A useful graph in almost any regression analysis is a scatterplot of residuals (on the vertical axis) versus fitted values. • We typically examine the scatterplot for striking patterns. A “good” fit not only has small residuals, but it has residuals scattered randomly around 0 with no apparent pattern. This is the case here.
Multiple RegressionBendrix Automotive Parts Company • The Bendrix Company manufactures various types of parts for automobiles. • The factory manager wants to get a better understanding of overhead costs, including supervision, indirect labor, supplies, payroll taxes, overtime premiums,depreciation, and a number of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses. • Some of the overhead costs are “fixed” in the sense they do not vary appreciably with the volume of work being done, whereas others are “variable” and do vary directly with the volume of work being done. It is not easy to draw a clear line between the fixed and variable overhead components. • The Bendrix manager has tracked total overhead costs for 36 months.
Explanatory VariablesBendrix Automotive Parts Company Bendrix.xls • The factory manager collected data on two variables he believes might be responsible for variations in overhead costs: • MachHrs: number of machine hours used during the month. • ProdRuns: the number of separate production runs during the month (Bendrix manufactures parts in fairly large batches called production runs. Between each run there is a downtime.). • Each observation (row) corresponds to a single month. • We need to estimate and interpret the equation for Overhead when both explanatory variables, MachHrs and ProdRuns, are included in the regression equation, but because these are time series variables we should also look out for relationships between these variables and the Month variable.
Multiple Regression with StatToolsBendrix Automotive Parts Company • Select StatTools + Regression & Classification + Regression… • Check “Overhead” as the response (dependent) variable. • Check “MachHrs” and ProdRuns” as the explanatory (independent) variables. • Select the Graph options in the dialog box as shown here.
Multiple Regression Output TableBendrix Automotive Parts Company • The coefficients in B18:B20 indicate that the estimated regression equation is Predicted Overhead = 3997 + (43.45 x MachHrs) + (883.62 x ProdRuns)
Interpretation of EquationBendrix Automotive Parts Company • If the number of production runs is held constant, then the overhead cost is expected to increase by $43.54 for each extra machine hour • If the number of machine hours is held constant, the overhead is expected to increase by $883.62 for each extra production run. • $3997 is the fixed component of overhead. • The slope terms involving MachHrs and ProdRuns are the variable components of overhead.
Equation ComparisonBendrix Automotive Parts Company • It is interesting to compare this equation with the separate equations: Predicted Overhead = 48,621 + 34.70(MachHrs) and Predicted Overhead = 75,606 + 655.07(ProdRuns) Predicted Overhead = 3,997 + 43.45 MachHrs + 883.62 ProdRuns • Note that both coefficients have increased. Also, the intercept is now lower than either intercept in the single variable equation. It is difficult to guess the changes that more explanatory variables will cause, but it is likely that changes will occur. • The reasoning for this is that when MachHrs is the only variable in the equation, we are obviously not holding ProdRuns constant - we are ignoring it - so in effect the coefficient 34.7 of MachHrs indicates the effect of MachHrs and the omitted ProdRuns on Overhead. • But when we include both variables, the coefficient of 43.5 of MachHrs indicates the effect of MachHrs only, holding ProdRuns constant. • Since the coefficients have different meanings, it is not surprising that we obtain different estimates.
Modeling PossibilitiesFifth National Bank Gender-Discrimination Suit • The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. • The bank’s employee database is listed in this file. Here is a partial list of the data. Bank.xls
EducLev: education level with categories 1 (high school grad), 2 (some college), 3 (bachelor’s degree), 4 (some graduate courses) & 5 (graduate degree) JobGrade: current job level, the possible levels being from 1-6 (6 is highest) YrHired: year employee was hired Salary: current annual salary in thousands of dollars YrBorn: year employee was born Gender: a categorical variable with values “Female” and “Male” YrsPrior: number of years of work experience at another bank prior to working at Fifth National PCJob: a dummy variable with value 1 if the employee’s current job is computer-related and value 0 otherwise VariablesFifth National Bank Gender-Discrimination Suit For each of the 208 employees, the variables in the data set are: Do the data provide evidence that females are discriminated against in terms of salary?
Naïve ApproachFifth National Bank Gender-Discrimination Suit • A naïve approach to the problem is to compare the average salaries of the males and females. • The average of all salaries is $39,922, the average female salary is $37,210, and the average male salary is $45,505. • The difference between the averages is statistically different. The females are definitely earning less, but perhaps there is a reason. • The question is whether the differences between the average salaries is still evident after taking other attributes into account. A perfect task for regression.
Dummy VariablesFifth National Bank Gender-Discrimination Suit • Some potential explanatory variables are categorical and cannot be measured on a quantitative scale. However, we often need to use these variables because they are related to the response variable. • The trick is to create dummyvariables, also called indicatoror 0-1 variables, that indicate the category a given observation is in. • To create dummy variables we can use an IF statement or we can use StatTools’ Dummy variable procedure, which is usually easier particularly when there are multiple categories. • Once the dummy variables are created, we can combine the variables if we like by simply adding the columns to get the dummy for the new category.
Regression Analysis w/Dummy VariablesFifth National Bank Gender-Discrimination Suit • In this example we create dummy variables for Gender, and JobGrade. We also create another variable: YrsExper = 95 – YrHired (since this is 1995 data) • We must follow two rules: • We shouldn’t use any of the original categorical variables that the dummies are based on. • We should use one less dummy than the number of categories for any categorical variable. • Then we can run a regression analysis with Salary as the response variable, using any combination of numerical and dummy explanatory variables.
Creating Dummy VariablesGender Categorical Variable • To create a dummy variable called Female for Gender: • Select any data cell. • From the Menu bar, selectStatTools + Data Utilities • + Dummy… • Select “Gender”, as the variable • Select “Create One Dummy Variable for Each Distinct Category”. • Answer “Yes” to warnings. • Repeat the procedure for JobGrade.
Regression AnalysisGender Only • We first estimate a regression equation with Female as the only variable. The resulting equation is: Predicted Salary = 45.505 - 8.296Female • To interpret this equation recall that Female has only two possible values, 0 and 1. If we substitute 1 then the predicted salary equals 37.209 and if we substitute 0 the predicated salary is 45.505. • These are the average salaries of females and males. Therefore the interpretation of the -8.2955 coefficient of the Female dummy variable is straightforward. • The above equation only tells part of the story, it ignores all information except for gender.
Regression AnalysisGender + YrsExper + YrsPrior • We expand this equation by adding YrsExper and YrsPrior. • The corresponding equation is: Pred Salary = 35.492 + 0.988YrsExper + 0.131YrsPrior - 8.080Female • It is useful to write two separate equations, one for females: Predicted Salary = 27.412 + 0.988YrsExper + 0.131YrsPrior and one for males: Predicted Salary = 35.492 + 0.988YrsExper + 0.131YrsPrior • We interpret the coefficient -8.080 of the Female dummy variable as the average salary disadvantage for females relative to males after controlling for job experience. But there is still more story to tell.
Regression AnalysisGender + YrsExper + YrsPrior + JobGrade • We next add job grade to the equation by including five of the six job grade dummies. Although any five can be use we use Job_2 - Job_6. • The estimated regression equations is now: Predicted Salary = 30.230 + 0.408YrsExper + 0.149YrsPrior - 1.962Female + 2.575Job_2+ 6.295Job_3 + 10.475Job_4 +16.011Job_5 + 27.647Job_6 • There are now two categorical variables involved, gender and job grade. However, we can still write a separate equation for any combination of categories by setting the dummies to the appropriate values.
InterpretationGender + YrsExper + YrsPrior + JobGrade • The equation for females at the fifth job grade is found by setting Female=1, Job_5=1, & other job dummies equal to 0. PredictedSalary = 44.279 + 0.408YrsExper + 0.149YrsPrior • The expected salary increase for one extra year of experience is $408; the expected salary increase for one year experience with another bank is $149 (either gender and any job grade). • The coefficients of the job dummies indicate the average increase in salary an employee can expect relative to the reference (lowest) job grade. • The key coefficient, the negative $1962 for females indicates the average salary disadvantage for females relative to males, given that they have the same experience levels and are in the same job grade • The “penalty” is less than a fourth of the penalty we saw before. It appears that females might be getting paid less on average partly because they are in the lower job categories.
Pivot TableConcentration of Females in Lower Paid Jobs • We can use a pivot table to check whether females are disproportionately in the lower job categories (set JobGrade in the row area, Gender in the column area and the count (expressed as a percentage) of any variable in the data area). • Clearly, females tend to be concentrated at the lower job grades. • This helps explain why females get lower salaries on average, but doesn’t explain why females are at the lower job grades in the first place. • We won’t be able to provide a thorough analysis of this issue.
Conclusion • The main conclusion we can draw from the output is that there is still a plausible case to be made for discrimination against females, even after including information on all the variables in the database in the regression equation.
Interaction TermsFifth National Bank Gender-Discrimination Suit • An interaction variable algebraically is the product of two variables. Its effect is to allow the effect of one of the variables on Y to depend on the value of the other variable. • The interaction term allows the slope of the regression line to differ between the two categories. • Earlier we estimated an equation for Salary using the numerical explanatory variables YrsExper and YrsPrior and the dummy variable Female. • If we drop the YrsPrior variable from the equation (for simplicity) and rerun the regression, we obtain the equation Predicted Salary = 35.824 + 0.981YrsExper - 8.012Female • The R2 value for this equation is 49.1%. If we decide to include an interaction variable between YrsExper and Female in this equation, what is the effect?
Solution with Interaction TermsFifth National Bank Gender-Discrimination Suit • We first need to form an interaction variable that is the product of YrsExper and Female. • This can be done two ways in Excel. • Do it manually by introducing a new variable that contains the product of the two variables involved, or • Use: StatTools + Data Utilities + Interaction… • Using the latter way we must select Female and YrsExper as the variables. • Once the interaction variable has been created, we include it in the regression equation in addition to the other variables.
Interpretation w/ Interaction TermsFifth National Bank Gender-Discrimination Suit • The estimated regression equation is Predicted Salary = 30.430 + 1.528YrsExper + 4.098Female - 1.248YrsExper_Female • The female equation is: Pred Salary = 34.528 + 0.280YrsExper& the male equation is: Pred Salary = 30.430 + 1.528YrsExper • Graphically - Nonparallel Female and Male Salary Lines
Conclusion w/Interaction TermsFifth National Bank Gender-Discrimination Suit • The Y-intercept for the female line is slightly higher - females with no experience at Fifth National Bank tend to start out slightly higher than males - but the slope of the female line is much lower. That is, males tend to move up the salary ladder much more quickly than females. • Again, this provides another argument, although a somewhat different one, for gender discrimination against females. • The R2 value increased from 49.1% to 63.9%. The interaction variable has definitely added to the explanatory power of the equation.
Part 2: Regression Analysis Statistical Inference
Inference About Regression CoefficientsBendrix Automotive Parts Company Bendrix1.xls • As before, the response variable is Overhead and the explanatory variables are MachHrs and ProdRuns. • What inferences can we make about the regression coefficients? • We obtain the output from using StatTools
Multiple Regression OutputBendrix Automotive Parts Company • Regression coefficientsestimate the true, but unobservable, population coefficients. • The standard error of bi indicates the accuracy of these point estimates. • For example, the effect on Overhead of a one-unit increase in MachHrs is 43.536. We are 95% confident that the coefficient is between 36.234 to 50.839. Similar statements can be made for the coefficient of ProdRuns and the intercept term. Predicted Overhead = 3997 + 43.54MachHrs + 883.62ProdRuns
A Test for the Overall Fit:The ANOVA TableBendrix Automotive Parts Company • Does the ANOVA table for the Bendrix manufacturing data indicate that the combination MachHrs and ProdRuns has at least some ability to explain variation in Overhead? • The F-ratio is “off the charts” and the p-value is practically 0.
Interpretation of the ANOVA TableBendrix Automotive Parts Company • This information wouldn’t be much comfort for the Bendrix manager who is trying to understand the causes of variation in overhead costs. • This manager already knows that machine hours and production runs are related positively to overhead costs - everyone in the company knows that! • What he really wants to know is a set of explanatory variables that yields a high R2 and a low se. • The low p-value in the ANOVA table does not guarantee these. All it guarantees is that MachHrs and ProdRuns are of “some help” in explaining variation in Overhead.
Violations of Regression AssumptionsBendrix Automotive Parts Company • Is there evidence of non constant variance? • Is there any evidence of lag 1 autocorrelation in the Bendrix data when Overhead is regressed on MachHrs and ProdRuns? • Is there evidence of non Normality?
Do the Residuals HaveConstant Variance?Bendrix Automotive Parts Company • If the residual variance is not constant, the standard error of the regression coefficient, s(bi), is incorrect. • Note: when we ran the regression we selected “Residuals vs Fitted Values” graphs.
Plot of Residuals vs Fitted ValuesBendrix Automotive Parts Company • Residuals appear to have equal Variances (homoscedasticity)
Autocorrelated ResidualsBendrix Automotive Parts Company • The residuals of time series data are often autocorrelated. The most frequent type of autocorrelation is positive autocorrelation. For example, if residuals separated by 1 month are auto correlated, this is called lag 1 autocorrelation. • We use the fitted (col C) and residuals values (col D) In the “Regression” tab. The residuals represent how much the regression over-predicts (if negative) or under-predicts (if positive) the overhead cost for that month.
Durbin-Watson TestBendrix Automotive Parts Company • We can check for lag 1 autocorrelation in two ways, with the Durbin-Watson(DW) statistic and by examining the time series graph of the residuals. • The Durbin-Watson (DW) statistic is scaled between 0 and 4. • 2 - little lag 1 autocorrelation • < 2 - positive autocorrelation • > 2 – negative autocorrelation. • If n = 30 and bi’s 1-5, <1.2 is a problem) • We calculate the DW statistics in cell E45 with the formula: =StatDurbinWatson(D45:D80) Based on our guidelines for DW value 1.3131 suggests positive autocorrelation - it is less than 2 - but not enough to cause concern.
Time Series Graph of ResidualsBendrix Automotive Parts Company • This general conclusion is supported by the time series graph. • Add the range A44:D80 as a Data set • StatTools + Time Series & Forecasting + Time Series Graph • Select Residuals as the variable • Serious autocorrelation of lag 1 would tend to show long runs of residuals • alternating above and below the horizontal axis - positives would tend to • follow positives and negatives would tend to follow negatives. There is some • indication of this in the graph but not an excessive amount.
Are the ResidualsNormally Distributed?Bendrix Automotive Parts Company • The Inferences we want to make assume the residuals • are normally distributed. • Using Data Set #2 • Select: StatTools + Normality Tests • + Q-Q Normal Plot • Select “Residuals” as the variable • Check “Plot Using Standardized Q-Values” • and “Include Reference Line”
Normal Probability PlotBendrix Automotive Parts Company • Error terms appear to be Normally Distributed
MulticollinearityHeight vs Left & Right Feet • The relationship between the explanatory variable X and the response variable Y is not always accurately reflected in the coefficient of X; it depends on which other X’s are included or not included in the equation (especially when there is a linear relationship between two or more explanatory variables, in which case we have multicollinearity). • Multicollinearity is the presence of a fairly strong linear relationship between two or more explanatory variables, and it can make estimation difficult. • We want to explain a person’s height by means of foot length. The response variable is Height, and the explanatory variables are Right and Left, the length of the right foot and the left foot, respectively. • It is likely that there is a large correlation between height and foot size, so we would expect this regression equation to do a good job. The R2 value will probably be large. But what about the coefficients of Right and Left?
Correlation of Left & RightHeight vs Left & Right Feet Height.xls • To show what can happen numerically, we generated a hypothetical data set of heights and left and right foot lengths in this file. • We did this so that, except for random error, height is approximately 32 plus 3.2 times foot length (in inches). StatTools + Summary Statistics + Correlation & Covariance The correlations between Height and either Right or Left in our data set are quite large, and the correlation between Right and Left is very close to 1.
Multiple RegressionHeight vs Left & Right Feet • The Regression output tells a somewhat confusing story. • The multiple R and the corresponding R2 are about what we would expect, given the correlations between Height and either Right or Left. • In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the se value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches. • However, the coefficients of Right and Left are not all what we might expect, given that we generated heights as approximately 32 plus 3.2 times foot length. • In fact, the coefficient of Left has the wrong sign - it is negative! • Besides this wrong sign, the tip-off that there is a problem is that the t-value of Left is quite small and the corresponding p-value is quite large.
Solution • Judging by this, we might conclude that Height and Left are either not related or are related negatively. But we know from the table of correlations that both of these are false. • In contrast, the coefficient of Right has the “correct” sign, and its t-value and associated p-value do imply statistical significance, at least at the 5% level. • However, this happened mostly by chance, slight changes in the data could change the results completely.
Solution • Although both Right and Left are clearly related to Height, it is impossible for the least squares method to distinguish their separate effects. • Note that the sum of the coefficients is 3.178 which is close to the coefficient of 3.2 we used to generate the data. Therefore, the estimated equation will work well for predicting heights, but does not provide reliable estimates of the coefficients of Right and Left. • When Right is only variable: Predicted Height = 31.546 + 3.195Right • The R2 = 81.6%, se = 2.005, the t-value = 21.34 and p-value = 0.000 for the coefficient of Right - very significant. • When Left is only variable: Predicted Height = 31.526 + 3.197Left • The R2 = 81.1%, and se = 2.033, the t-value = 20.99, and the p-value = 0.0000 for the coefficient of Left - again very significant. • Clearly, both of these equations tell almost identical stories, and they are much easier to interpret than the equation with both Right and Left included.