Hierarchical Multiple Regression

# Hierarchical Multiple Regression

## Hierarchical Multiple Regression

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample problem Steps in hierarchical multiple regression Homework Problems

2. Differences between standard and hierarchical multiple regression • Standard multiple regression is used to evaluate the relationship between a set of independent variables and a dependent variable. • Hierarchical regression is used to evaluate the relationship between a set of independent variables and the dependent variable, controlling for or taking into account the impact of a different set of independent variables on the dependent variable. • For example, a research hypothesis might state that there are differences between the average salary for male employees and female employees, even after we take into account differences between education levels and prior work experience. • In hierarchical regression, the independent variables are entered into the analysis in a sequence of blocks, or groups that may contain one or more variables. In the example above, education and work experience would be entered in the first block and sex would be entered in the second block.

3. Differences in statistical results • SPSS shows the statistical results (Model Summary, ANOVA, Coefficients, etc.) as each block of variables is entered into the analysis. • In addition (if requested), SPSS prints and tests the key statistic used in evaluating the hierarchical hypothesis: change in R² for each additional block of variables. • The null hypothesis for the addition of each block of variables to the analysis is that the change in R² (contribution to the explanation of the variance in the dependent variable) is zero. • If the null hypothesis is rejected, then our interpretation indicates that the variables in block 2 had a relationship to the dependent variable, after controlling for the relationship of the block 1 variables to the dependent variable.

4. Variations in hierarchical regression - 1 • A hierarchical regression can have as many blocks as there are independent variables, i.e. the analyst can specify a hypothesis that specifies an exact order of entry for variables. • A more common hierarchical regression specifies two blocks of variables: a set of control variables entered in the first block and a set of predictor variables entered in the second block. • Control variables are often demographics which are thought to make a difference in scores on the dependent variable. Predictors are the variables in whose effect our research question is really interested, but whose effect we want to separate out from the control variables.

5. Variations in hierarchical regression - 2 • Support for a hierarchical hypothesis would be expected to require statistical significance for the addition of each block of variables. • However, many times, we want to exclude the effect of blocks of variables previously entered into the analysis, whether or not a previous block was statistically significant. The analysis is interested in obtaining the best indicator of the effect of the predictor variables. The statistical significance of previously entered variables is not interpreted. • The latter strategy is the one that we will employ in our problems.

6. Differences in solving hierarchical regression problems • R² change, i.e. the increase when the predictors variables are added to the analysis is interpreted rather than the overall R² for the model with all variables entered. • In the interpretation of individual relationships, the relationship between the predictors and the dependent variable is presented. • Similarly, in the validation analysis, we are only concerned with verifying the significance of the predictor variables. Differences in control variables are ignored.

7. A hierarchical regression problem The problem asks us to examine the feasibility of doing multiple regression to evaluate the relationships among these variables. The inclusion of the “controlling for” phrase indicates that this is a hierarchical multiple regression problem. Multiple regression is feasible if the dependent variable is metric and the independent variables (both predictors and controls) are metric or dichotomous, and the available data is sufficient to satisfy the sample size requirements.

8. Level of measurement - answer Hierarchical multiple regression requires that the dependent variable be metric and the independent variables be metric or dichotomous. "Spouse's highest academic degree" [spdeg] is ordinal, satisfying the metric level of measurement requirement for the dependent variable, if we follow the convention of treating ordinal level variables as metric. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation. "Age" [age] is interval, satisfying the metric or dichotomous level of measurement requirement for independent variables. "Highest academic degree" [degree] is ordinal, satisfying the metric or dichotomous level of measurement requirement for independent variables, if we follow the convention of treating ordinal level variables as metric. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation. "Sex" [sex] is dichotomous, satisfying the metric or dichotomous level of measurement requirement for independent variables. True with caution is the correct answer.

9. Sample size - question The second question asks about the sample size requirements for multiple regression. To answer this question, we will run the initial or baseline multiple regression to obtain some basic data about the problem and solution.

10. The baseline regression - 1 After we check for violations of assumptions and outliers, we will make a decision whether we should interpret the model that includes the transformed variables and omits outliers (the revised model), or whether we will interpret the model that uses the untransformed variables and includes all cases including the outliers (the baseline model). In order to make this decision, we run the baseline regression before we examine assumptions and outliers, and record the R² for the baseline model. If using transformations and outliers substantially improves the analysis (a 2% increase in R²), we interpret the revised model. If the increase is smaller, we interpret the baseline model. To run the baseline model, select Regression | Linear… from the Analyze model.

11. The baseline regression - 2 First, move the dependent variable spdeg to the Dependent text box. Fourth, click on the Next button to tell SPSS to add another block of variables to the regression analysis. Second, move the independent variables to control for age and sex to the Independent(s) list box. Third, select the method for entering the variables into the analysis from the drop down Method menu. In this example, we accept the default of Enter for direct entry of all variables in the first block which will force the controls into the regression.

12. The baseline regression - 3 SPSS identifies that we will now be adding variables to a second block. First, move the predictor independent variable degree to the Independent(s) list box for block 2. Second, click on the Statistics… button to specify the statistics options that we want.

13. The baseline regression - 4 Second, mark the checkboxes for Model Fit, Descriptives, and R squared change. The R squared change statistic will tell us whether or not the variables added after the controls have a relationship to the dependent variable. First, mark the checkboxes for Estimates on the Regression Coefficients panel. Fifth, click on the Continue button to close the dialog box. Fourth, mark the Collinearity diagnostics to get tolerance values for testing multicollinearity. Third, mark the Durbin-Watson statistic on the Residuals panel.

14. The baseline regression - 5 Click on the OK button to request the regression output.

15. R² for the baseline model The R² of 0.281 is the benchmark that we will use to evaluate the utility of transformations and the elimination of outliers. Prior to any transformations of variables to satisfy the assumptions of multiple regression or the removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 28.1%. The relationship is statistically significant, though we would not stop if it were not significant because the lack of significance may be a consequence of violation of assumptions or the inclusion of outliers.

16. Sample size – evidence and answer Hierarchical multiple regression requires that the minimum ratio of valid cases to independent variables be at least 5 to 1. The ratio of valid cases (136) to number of independent variables (3) was 45.3 to 1, which was equal to or greater than the minimum ratio. The requirement for a minimum ratio of cases to independent variables was satisfied. In addition, the ratio of 45.3 to 1 satisfied the preferred ratio of 15 cases per independent variable. The answer to the question is true.

17. Assumption of normality for the dependent variable - question Having satisfied the level of measurement and sample size requirements, we turn our attention to conformity with three of the assumptions of multiple regression: normality, linearity, and homoscedasticity. First, we will evaluate the assumption of normality for the dependent variable.

18. Run the script to test normality First, move the variables to the list boxes based on the role that the variable plays in the analysis and its level of measurement. Second, click on the Normality option button to request that SPSS produce the output needed to evaluate the assumption of normality. Fourth, click on the OK button to produce the output. Third, mark the checkboxes for the transformations that we want to test in evaluating the assumption.

19. Normality of the dependent variable: spouse’s highest degree The dependent variable "spouse's highest academic degree" [spdeg] did not satisfy the criteria for a normal distribution. The skewness of the distribution (0.573) was between -1.0 and +1.0, but the kurtosis of the distribution (-1.051) fell outside the range from -1.0 to +1.0. The answer to the question is false.

20. Normality of the transformed dependent variable: spouse’s highest degree The "log of spouse's highest academic degree [LGSPDEG=LG10(1+SPDEG)]" satisfied the criteria for a normal distribution. The skewness of the distribution (-0.091) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.678) was between -1.0 and +1.0. The "log of spouse's highest academic degree [LGSPDEG=LG10(1+SPDEG)]" was substituted for "spouse's highest academic degree" [spdeg] in the analysis.

21. Normality of the control variable: age Next, we will evaluate the assumption of normality for the control variable, age.

22. Normality of the control variable: age The independent variable "age" [age] satisfied the criteria for a normal distribution. The skewness of the distribution (0.595) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.351) was between -1.0 and +1.0.

23. Normality of the predictor variable: highest academic degree Next, we will evaluate the assumption of normality for the predictor variable, highest academic degree.

24. Normality of the predictor variable:respondent’s highest academic degree The independent variable "highest academic degree" [degree] satisfied the criteria for a normal distribution. The skewness of the distribution (0.948) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.051) was between -1.0 and +1.0.

25. Assumption of linearity for spouse’s degree and respondent’s degree - question The metric independent variables satisfied the criteria for normality, but the dependent variable did not. However, the logarithmic transformation of "spouse's highest academic degree" produced a variable that was normally distributed and will be tested as a substitute in the analysis. The script for linearity will support our using the transformed dependent variable without having to add it to the data set.

26. Run the script to test linearity When the linearity option is selected, a default set of transformations to test is marked. Second , since we have decided to use the log transformation of the dependent variable, we mark the check box for the Logarithmic transformation and clear the check box for the Untransformed version of the dependent variable. First, click on the Linearity option button to request that SPSS produce the output needed to evaluate the assumption of linearity. Third, click on the OK button to produce the output.

27. Linearity test: spouse’s highest degree and respondent’s highest academic degree The correlation between "highest academic degree" and logarithmic transformation of "spouse's highest academic degree" was statistically significant (r=.519, p<0.001). A linear relationship exists between these variables.

28. Linearity test: spouse’s highest degree and respondent’s age The assessment of the linear relationship between logarithmic transformation of "spouse's highest academic degree" [LGSPDEG=LG10(1+SPDEG)] and "age" [age] indicated that the relationship was weak, rather than nonlinear. Neither the correlation between logarithmic transformation of "spouse's highest academic degree" and "age" nor the correlations with the transformations were statistically significant. The correlation between "age" and logarithmic transformation of "spouse's highest academic degree" was not statistically significant (r=.009, p=0.921). The correlations for the transformations were: the logarithmic transformation (r=.061, p=0.482); the square root transformation (r=.034, p=0.692); the inverse transformation (r=.112, p=0.194); and the square transformation (r=-.037, p=0.668)

29. Assumption of homogeneity of variance - question Sex is the only dichotomous independent variable in the analysis. We will test if for homogeneity of variance using the logarithmic transformation of the dependent variable which we have already decided to use.

30. Run the script to test homogeneity of variance When the homogeneity of variance option is selected, a default set of transformations to test is marked. Second , since we have decided to use the log transformation of the dependent variable, we mark the check box for the Logarithmic transformation and clear the check box for the Untransformed version of the dependent variable. First, click on the Homogeneity of variance option button to request that SPSS produce the output needed to evaluate the assumption of linearity. Third, click on the OK button to produce the output.

31. Assumption of homogeneity of variance – evidence and answer Based on the Levene Test, the variance in "log of spouse's highest academic degree [LGSPDEG=LG10(1+SPDEG)]" was homogeneous for the categories of "sex" [sex]. The probability associated with the Levene statistic (0.687) was p=0.409, greater than the level of significance for testing assumptions (0.01). The null hypothesis that the group variances were equal was not rejected. The homogeneity of variance assumption was satisfied. The answer to the question is true.

32. Including the transformed variable in the data set - 1 In the evaluation for normality, we resolved a problem with normality for spouse’s highest academic degree with a logarithmic transformation. We need to add this transformed variable to the data set, so that we can incorporate it in our detection of outliers. We can use the script to compute transformed variables and add them to the data set. We select an assumption to test (Normality is the easiest), mark the check box for the transformation we want to retain, and clear the check box "Delete variables created in this analysis." NOTE: this will leave the transformed variable in the data set. To remove it, you can delete the column or close the data set without saving.

33. Including the transformed variable in the data set - 2 First, move the variable SPDEG to the list box for the dependent variable. Second, click on the Normality option button to request that SPSS do the test for normality, including the transformation we will mark. Fourth, clear the check box for the option "Delete variables created in this analysis". Third, mark the transformation we want to retain (Logarithmic) and clear the checkboxes for the other transformations. Fifth, click on the OK button.

34. Including the transformed variable in the data set - 3 If we scroll to the rightmost column in the data editor, we see than the log of SPDEG in included in the data set.

35. Including the transformed variable in the list of variables in the script - 1 If we scroll to the bottom of the list of variables, we see that the log of SPDEG is not included in the list of available variables. To tell the script to add the log of SPDEG to the list of variables in the script, click on the Reset button. This will start the script over again, with a new list of variables from the data set.

36. Including the transformed variable in the list of variables in the script - 2 If we scroll to the bottom of the list of variables now, we see that the log of SPDEG is included in the list of available variables.

37. Detection of outliers - question In multiple regression, an outlier in the solution can be defined as a case that has a large residual because the equation did a poor job of predicting its value. We will run the regression again incorporating any transformations we have decided to test, and have SPSS compute the standardized residual for each case. Cases with a standardized residual larger than +/- 3.0 will be treated as outliers.

38. The revised regression using transformations To run the regression to detect outliers, select the Linear Regression command from the menu that drops down when you click on the Dialog Recall button.

39. The revised regression: substituting transformed variables Remove the variable SPDEG from the list of independent variables. Include the log of the variable, LGSPDEG. Click on the Statistics… button to select statistics we will need for the analysis.

40. The revised regression: selecting statistics Second, mark the checkboxes for Model Fit, Descriptives, and R squared change. The R squared change statistic will tell us whether or not the variables added after the controls have a relationship to the dependent variable. First, mark the checkboxes for Estimates on the Regression Coefficients panel. Third, mark the Durbin-Watson statistic on the Residuals panel. Sixth, click on the Continue button to close the dialog box. Fifth, mark the Collinearity diagnostics to get tolerance values for testing multicollinearity. Fourth, mark the checkbox for the Casewise diagnostics, which will be used to identify outliers.

41. The revised regression: saving standardized residuals Click on the Continue button to close the dialog box. Mark the checkbox for Standardized Residuals so that SPSS saves a new variable in the data editor. We will use this variable to omit outliers in the revised regression model.

42. The revised regression: obtaining output Click on the OK button to obtain the output for the revised model.

43. Outliers in the analysis If cases have a standardized residual larger than +/- 3.0, SPSS creates a table titled Casewise Diagnostics, in which it lists the cases and values that results in their being an outlier. If there are no outliers, SPSS does not print the Casewise Diagnostics table. There was no table for this problem. The answer to the question is true. We can verify that all standardized residuals were less than +/- 3.0 by looking the minimum and maximum standardized residuals in the table of Residual Statistics. Both the minimum and maximum fell in the acceptable range. Since there were no outliers, we can use the regression just completed to make our decision about which model to interpret.

44. Selecting the model to interpret - question Since there were no outliers, we can use the regression just completed to make our decision about which model to interpret. If the R² for the revised model is higher by 2% or more, we will base out interpretation on the revised model; otherwise, we will interpret the baseline model.

45. Selecting the model to interpret – evidence and answer Prior to any transformations of variables to satisfy the assumptions of multiple regression and the removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 28.1%. After substituting transformed variables, the proportion of variance in the dependent variable explained by the independent variables (R²) was 27.1%. Since the revised regression model did not explain at least two percent more variance than explained by the baseline regression analysis, the baseline regression model with all cases and the original form of all variables should be used for the interpretation. The transformations used to satisfy the assumptions will not be used, so cautions should be added for the assumptions violated. False is the correct answer to the question.

46. Re-running the baseline regression - 1 Having decided to use the baseline model for the interpretation of this analysis, the SPSS regression output was re-created. To run the baseline regression again, select the Linear Regression command from the menu that drops down when you click on the Dialog Recall button.

47. Re-running the baseline regression - 2 Remove the transformed variable lgspdeg from the dependent variable textbox and add the variable spdeg. Click on the Save button to remove the request to save standardized residuals to the data editor.

48. Revised regression using transformations and omitting outliers - 3 Click on the Continue button to close the dialog box. Clear the checkbox for Standardized Residuals so that SPSS does not save a new set of them in the data editor when it runs the new regression.

49. Re-running the baseline regression - 4 Click on the OK button to request the regression output.

50. Assumption of independence of errors - question We can now check the assumption of independence of errors for the analysis we will interpret.