870 likes | 943 Vues
Hierarchical Multiple Regression. Differences between standard and hierarchical multiple regression. Standard multiple regression is used to evaluate the relationship between a set of independent variables and a dependent variable.
E N D
Differences between standard and hierarchical multiple regression • Standard multiple regression is used to evaluate the relationship between a set of independent variables and a dependent variable. • Hierarchical regression is used to evaluate the relationship between a set of independent variables (predictors) and the dependent variable, controlling for or taking into account the impact of a different set of independent variables (control variables) on the dependent variable. • For example, a research hypothesis might state that there are differences in the average salary between male employees and female employees, even after we take into account differences in education levels and prior work experience. • In hierarchical regression, the independent variables are entered into the analysis in a sequence of blocks, or groups that may contain one or more variables. In the example above, education and work experience would be entered in the first block and sex would be entered in the second block. • Generally, our interest is in R² change, i.e. the increase when the predictors variables are added to the analysis rather than the overall R² for the model that includes both controls and predictors. • Moreover, the interpretation of individual relationships may focus on the relationship between the predictors and the dependent variables, and ignore the significance and interpretation of control variables. However, in our problems, we will interpret both controls and predictors.
Differences in statistical results • SPSS shows the statistical results (Model Summary, ANOVA, Coefficients, etc.) as each block of variables is entered into the analysis. • In addition (if requested), SPSS prints and tests the key statistic used in evaluating the hierarchical hypothesis: change in R² for each additional block of variables. • The null hypothesis for the addition of each block of variables to the analysis is that the change in R² (contribution to the explanation of the variance in the dependent variable) is zero. • If the null hypothesis is rejected, then our interpretation indicates that the variables in block 2 had a relationship to the dependent variable, after controlling for the relationship of the block 1 variables to the dependent variable, i.e. the variables in block 2 explain something about the dependent variables that was not explained in block 1. • The key statistic in hierarchical regression is R² change (the increase in R² when the predictors variables are added to the model that included only the control variables). If R² change is significant, the R² for the overall model that includes both controls and predictors will usually be significant as well since R² change is part of overall R².
Variations in hierarchical regression • A hierarchical regression can have as many blocks as there are independent variables, i.e. the analyst can specify a hypothesis that specifies an exact order of entry for variables. • A more common hierarchical regression specifies two blocks of variables: a set of control variables entered in the first block and a set of predictor variables entered in the second block. • Control variables are often demographics which are thought to make a difference in scores on the dependent variable. Predictors are the variables in whose effect our research question is really interested, but whose effect we want to separate out from the control variables. • Hierarchical regression specifies the order in which the variables are added to the regression analysis. However, once the regression analysis includes all of the independent variable, the variables will have the same coefficients and significance independent of whether the variables were entered simultaneously or sequentially.
Confusion in Terminology over the Meaning of the Designation of a Control Variable • Control variables are entered in the first block in a hierarchical regression, and we describe this as “controlling for the effect of some variables…” Sometimes, we mistakenly interpret this to imply that the shared effects between a variable and the control variables is removed from the explanation of the dependent variable credited to the predictor. • In fact, the coefficient for each variable is computed to control for all other variables in the analysis, whether the variable is designated as a control or a predictor variable, and regardless of when it was entered into the equation. • In a general sense, each variable in a regression analysis is credited only with what it uniquely explains, and the shared variance with all other variables is removed.
Research Questions for which Hierarchical Regression is Useful • Previous research has found that variables a, b, and c have statistically significant relationships to the dependent variable z, both collectively and individually. You believe that variables d and e should also be included and, in fact, would substantially increase the proportion of variability in z that we can explain. You treat a, b, and c as controls and d and e as predictors in a hierarchical regression. • You have found that variables a and b account for a substantial proportion of the differences in the dependent variable z. In presenting your findings, someone asserts that while a and b may have a relationship to z, differences in z are better understood with the relationship that variables d and e have with z. You treat d and e as controls to show that predictors a and b have a significant relationship to z that is not explicable by d and e. • You find that there are significant differences in the demographic characteristics (a, b, and c) of groups of subjects in your control and treatment groups (d). To isolate the relationship between group membership (d) and the treatment effect (z), you do a hierarchical regression with a, b, and c as controls and d as the predictor.
The Problem in Blackboard • The problem statement tells us: • the variables included in the analysis • whether each variable should be treated as metric or non-metric • the reference category for non-metric variables to be dummy coded • the alpha for both the statistical relationships and for diagnostic tests NOTE: these problems use a data set, GSS2002_PrejudiceAndAltruism.SAV that we have not used before.
The Statement about Level of Measurement The first statement in the problem asks about level of measurement. Hierarchical multiple regression requires the dependent variable and the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous. The only way we would violate the level of measurement would be to use a nominal variable as the dependent variable, or to attempt to dummy-code an interval level variable that was not grouped.
Marking the Statement about Level of Measurement • Mark the check box as a correct statement because: • "Accuracy of the description of being a pretty soft-hearted person" [empathy7] is ordinal level, but the problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level. • "Degree of religious fundamentalism" [fund] is ordinal level, but the problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level. • "Description of political views" [polviews] is ordinal level, but the problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level. • The non-metric independent variable "sex" [sex] was dichotomous level, satisfying the requirement for independent variables. • The non-metric independent variable "race of the household" [hhrace] was nominal level, but will satisfy the requirement for independent variables when dummy coded.
Satisfying the Assumptions of Multiple Regression The next four statements identify the strategies that we will follow to attempt to satisfy the assumptions of multiple regression. If we fail to satisfy the requirement for independence of variables, we halt interpretation of the problem. Once we satisfy the assumptions, we skip the remaining strategies. If we do not satisfy the assumptions with any of the strategies, we will continue to interpret the analysis with all cases and untransformed variables, but we should mention the failure to satisfy assumptions in our findings.
Using the Script to Evaluate Assumptions We will use the new script to evaluate regression assumptions for analyses that include metric and non-metric variables. Assuming that you have downloaded the script from the course web site to My Documents, select the Run Script command from the Utilities menu.
Opening the Script In the Run Script dialog box, navigate to My Documents and highlight the file named: SatisfyingRegressionAssumptionsWithMetricAnd NonmetricVariables.SBS With the script file highlighted, click on the Run button to use the script.
Select the Variables First, move the dependent variable "accuracy of the description of being a pretty soft-hearted person" [empathy7] to the DV text box. Second, move the variables “degree of religious fundamentalism" [fund and "description of political views" [polviews] to the metric independent variables list box. Third, move the variables "sex" [sex] and "race of the household" [hhrace] to the non-metric independent variables list box.
Select the Reference Category for Sex With the variable "sex" [sex] highlighted, select 2=FEMALE as the reference category.
Select the Reference Category for Race With the variable "race of the household" [hhrace] highlighted, select 3=OTHER as the reference category.
Request the Tests of Assumptions for Multiple Regression Having included all of the variables mentioned in the problem and identified the reference categories for non-metric variables, we click on the OK button to request the output.
The Variables Included in the Analysis We can look at the variables listed in the table of Descriptive Statistics to make certain we have included the correct variables. The list includes three dummy-coded variables for sex and race. The list includes two independent variables treated as metric.
Evaluating the Assumption of Independence of Variables • The tolerance values for all of the independent variables are larger than 0.10: • "degree of religious fundamentalism" [fund] (0.876), • "description of political views" [polviews] (0.952), • "survey respondents who were male" [sex_1] (0.976), • "survey respondents who were white" [hhrace_1] (0.970) and • "survey respondents who were black" [hhrace_2] (0.896). Multicollinearity is not a problem in this regression analysis.
Evaluating the Assumption of Linearity In the lack of fit test, the probability of the F test statistic (F=1.66) was p = .004, less than or equal to the alpha level of significance of 0.01. The null hypothesis that "a linear regression model is appropriate" is rejected. The research hypothesis that "a linear regression model is not appropriate" is supported by this test. The assumption of linearity is violated.
Evaluating the Assumption of Homoscedasticity The homogeneity of error variance is tested with the Breusch-Pagan test. For this analysis, the Breusch-Pagan statistic was 17.643. The probability of the statistic was p = .003, which was less than or equal to the alpha level for diagnostic tests (p = .010). The null hypothesis that "the variance of the residuals is the same for all values of the independent variable" is rejected. The research hypothesis that "the variance of the residuals is different for some values of the independent variable" is supported. The assumption of homogeneity of error variance is violated.
Evaluating the Assumption of Normality Regression analysis assumes that the errors or residuals are normally distributed. The Shapiro-Wilk test of studentized residuals yielded a statistical value of 0.868, which had a probability of p < .001, which was less than or equal to the alpha level for diagnostic tests (p = .010). The null hypothesis that "the distribution of the residuals is normally distributed" is rejected. The research hypothesis that "the distribution of the residuals is not normally distributed" is supported. The assumption of normality of errors is violated.
Evaluating the Assumption of Independence of Errors Regression analysis assumes that the errors (residuals) are independent and there is no serial correlation. No serial correlation implies that the size of the residual for one case has no impact on the size of the residual for the next case. The Durbin-Watson statistic tests for the presence of serial correlation among the residuals. The value of the Durbin-Watson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are not correlated if the Durbin-Watson statistic is approximately 2, and an acceptable range is 1.50 - 2.50. The Durbin-Watson statistic for this problem is 1.94 which falls within the acceptable range from 1.50 to 2.50. The analysis satisfies the assumption of independence of errors.
Excluding Extreme Outliers to Try to Achieve Normality • Three assumptions were violated: • linearity, • homogeneity of error variance, and • normality of the residuals. We will run the analysis without extreme outliers to see if we can satisfy the assumptions with that model. Click on the Excludeextreme outliers button.
Feedback on Extreme Outliers When we clicked on the Exclude extreme outliers button, the script provided feedback that there were no extreme outliers to remove.
Outliers and Extreme Outliers in SPSS Output If we look back at the output from the previous regression analysis, we see that it produced a list of 16 outliers. The absence of a table of extreme outliers implies that none were found for the analysis.
Testing Normality for Variables Treated as Metric: empathy7 Since there are no extreme outliers to remove, we will test transformations of the metric variables. First, highlight the dependent variable, empathy7. Second, click on the Test normality button.
Test of Normality in SPSS Output The logarithmic transformation of "accuracy of the description of being a pretty soft-hearted person" [empathy7] was used in this analysis because it had the largest value of the Shapiro-Wilk statistic (0.758).
Adding the Transformation of Empathy7 to Data Set First, make certain the variable that we want to transform is still selected. Second, select the option button for the transformation we want to use - Logarithm in this example. Third, click on the Apply transformation button.
Testing Normality for Variables Treated as Metric: polviews The name of the variable in the Dependent variable text box is changed to reflect the transformation. Next, we select the first Metric independent variable to test for normality, polviews. And, we click on the Test normality button.
Test of Normality in SPSS Output No transformation of "description of political views" [polviews] was used in this analysis. None of the transformations had a value of the Shapiro-Wilk statistic that was at least 0.01 larger than the value for description of political views (0.924).
Testing Normality for Variables Treated as Metric: fund Next, we select the second Metric independent variable to test for normality, fund. And, we click on the Test normality button.
No transformation of "degree of religious fundamentalism" [fund] was used in this analysis. None of the transformations had a value of the Shapiro-Wilk statistic that was at least 0.01 larger than the value for degree of religious fundamentalism (0.809).
The Hierarchical Regression with Transformed Variables Having completed the tests for normality and having transformed empathy7, we click on the OK button to run the regression with the transformed variables. If none of the transformations had improved normality, this would be the same model as the one with untransformed variables and we would conclude that we cannot improve the model with transformations.
Evaluating the Assumption of Independence of Variables Multicollinearity is not a problem in this regression analysis. • The tolerance values for all of the independent variables are larger than 0.10: • "degree of religious fundamentalism" [fund] (0.876), • "description of political views" [polviews] (0.952), • "survey respondents who were male" [sex_1] (0.976), • "survey respondents who were white" [hhrace_1] (0.970) and • "survey respondents who were black" [hhrace_2] (0.896).
Evaluating the Assumption of Linearity In the lack of fit test, the probability of the F test statistic (F=1.51) was p = .015, greater than the alpha level of significance of 0.01. The null hypothesis that "a linear regression model is appropriate" is not rejected. The research hypothesis that "a linear regression model is not appropriate" is not supported by this test. The assumption of linearity is satisfied.
Evaluating the Assumption of Homoscedasticity The homogeneity of error variance is tested with the Breusch-Pagan test. For this analysis, the Breusch-Pagan statistic was 8.811. The probability of the statistic was p = .117, which was greater than the alpha level for diagnostic tests (p = .010). The null hypothesis that "the variance of the residuals is the same for all values of the independent variable" is not rejected. The research hypothesis that "the variance of the residuals is different for some values of the independent variable" is not supported. The assumption of homogeneity of error variance is satisfied.
Evaluating the Assumption of Normality Regression analysis assumes that the errors or residuals are normally distributed. The Shapiro-Wilk test of studentized residuals yielded a statistical value of 0.908, which had a probability of p < .001, which was less than or equal to the alpha level for diagnostic tests (p = .010). The null hypothesis that "the distribution of the residuals is normally distributed" is rejected. The research hypothesis that "the distribution of the residuals is not normally distributed" is supported. The assumption of normality of errors is violated.
Evaluating the Assumption of Independence of Errors Regression analysis assumes that the errors (residuals) are independent and there is no serial correlation. No serial correlation implies that the size of the residual for one case has no impact on the size of the residual for the next case. The Durbin-Watson statistic tests for the presence of serial correlation among the residuals. The value of the Durbin-Watson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are not correlated if the Durbin-Watson statistic is approximately 2, and an acceptable range is 1.50 - 2.50. The Durbin-Watson statistic for this problem is 1.96 which falls within the acceptable range from 1.50 to 2.50. The analysis satisfies the assumption of independence of errors.
Excluding Extreme Outliers • One assumption was violated: • normality of the residuals. We will run the analysis without extreme outliers to see if we can satisfy all of the assumptions with that model. Click on the Excludeextreme outliers button.
Feedback on Extreme Outliers in the Script When we clicked on the Exclude extreme outliers button, the script provided feedback that there were no extreme outliers to remove.
Outliers and Extreme Outliers in SPSS Output If we look back at the output from the regression analysis with the transformed dependent variable, we see that it produced a list of 14 outliers. The absence of a table of extreme outliers implies that none were found for the analysis.
Marking the Check Boxes for Regression Assumptions None of the models using transformed variables and excluding extreme outliers satisfied all of the assumptions of multiple regression. None of the check boxes are marked. The model including the original variables and all cases will be interpreted. The violations of assumptions should be mentioned as limitations to the analysis.
Removing the Transformed Variable Since the transform of the dependent variable did not enable us to satisfy all of the assumptions, we will remove it from the data set. First, make certain the transformed variable is highlighted. Second, click on the No transformation option button to select it. Third, click on the Apply transformation button, which will restore the original variable name and delete the transformed variable from the data set.
Retaining the Dummy-coded Variables for the Hierarchical Regression To retain the dummy-coded variables so that they can be used in the hierarchical regression, clear the check box for Delete variables created in this analysis. Click on the Cancel button to close the script dialog box.
The Dummy-coded Variables in the Data Editor If we scroll to the right in the data editor, we see that the dummy-coded variables have been retained.
Running the Hierarchical Regression Model in SPSS To initiate a hierarchical regression in SPSS, select the Regression > Linear command from the Analyze menu.
Including the Dependent and Control Variables in the Analysis First, move the dependent variable, empathy7, to the Dependent text box. Third, click on the Next button to start a new block of variables. We will put the predictor independent variables in the second block. Second, move the control variables (hhrace_1, hhrace_2) , and polviews to the Independent(s) list box.
Adding the Predictor Independent Variables to the Analysis First, note that the block label changed to Block 2 of 2 when we clicked on the Next button. Second, note that the list box for independent variables is empty. To see the list we had entered in Block 1, click on the Previous button. To add the two predictors (sex_1 and fund) to the analysis, highlight the variable names and click on the right array button.
Requesting Additional Statistical Output To request additional statistical output, including the very important R squared change statistic, click on the Statistics button.
Additional Statistical Output Estimates and Model fit are selected by default. We mark the check boxes for Durbin-Watson, R squared change, Descriptives, and Collinearity diagnostics. When we have selected the statistics, we click on the Continue button to close the dialog box. We could bypass the selection of Collinearity diagnostics and Durbin-Watson, since the script requested these statistics and we have already evaluated our variables ability to satisfy the assumptions of regression