Hierarchical Binary Logistic Regression

Hierarchical Binary Logistic Regression

Hierarchical Binary Logistic Regression • In hierarchical binary logistic regression, we are testing a hypothesis or research question that some predictor independent variables improve our ability to predict membership in the modeled category of the dependent variable, after taking into account the relationship between some control independent variables and the dependent variable. • In multiple regression, we evaluated this question by looking at R2 change, the increase in R2 associated with adding the predictors to the regression analysis. • The analog to R2 in logistic regression is the Block Chi-square, which is the increase in Model Chi-square associated with the inclusion of the predictors. • In standard binary logistic regression, we interpreted the SPSS output that compared Block 0, a model with no independent variables, to Block 1, the model that included the independent variables. • In hierarchical binary logistic regression, the control variables are added SPSS in Block 1, and the predictor variables are added in Block 2, and the interpretation of the overall relationship is based on the change in the relationship from Block 1 to Block 2.

Output for Hierarchical Binary Logistic Regression after control variables are added This output is for the sample problem worked below. In this example, the control variables do not have a statistically significant relationship to the dependent variable, but they can still serve their purpose as controls. After the controls are added, the measure of error, -2 Log Likelihood, is 195.412.

Output for Hierarchical Binary Logistic Regression after predictor variables are added The hierarchical relationship is based on the reduction in error associated with the inclusion of the predictor variables. After the predictors are added, the measure of error, -2 Log Likelihood, is 168.542. The difference between the -2 log likelihood at Block 1 (195.412) and the -2 log likelihood at Block 2 (168.542) is Block Chi-square (26.870) which is significant at p < .001. Model Chi-square is the cumulative reduction in -2 log likelihood for the controls and the predictors.

The Problem in Blackboard The Problem in Blackboard • The problem statement tells us: • the variables included in the analysis • whether each variable should be treated as metric or non-metric • the type of dummy coding and reference category for non-metric variables • the alpha for both the statistical relationships and for diagnostic tests

The Statement about Level of Measurement The first statement in the problem asks about level of measurement. Hierarchical binary logistic regression requires that the dependent variable be dichotomous, the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous. SPSS Binary Logistic Regressioncalls non-metric variables “categorical.” SPSS Binary Logistic Regression will dummy-code categorical variables for us, provided it is useful to use either the first or last category as the reference category.

Marking the Statement about Level of Measurement • Mark the check box as a correct statement because: • The dependent variable "should marijuana be made legal" [grass] is dichotomous level, satisfying the requirement for the dependent variable. • The independent variable "age" [age] is interval level, satisfying the requirement for independent variables. • The independent variable "sex" [sex] is dichotomous level, satisfying the requirement for independent variables. • The independent variable "strength of religious affiliation" [reliten] is ordinal level, which the problem instructs us to dummy-code as a non-metric variable. • The independent variable "general happiness" [happy] is ordinal level, which the problem instructs us to dummy-code as a non-metric variable.

The Statement about Outliers While we do not need to be concerned about normality, linearity, and homogeneity of variance, we need to determine whether or not outliers were substantially reducing the classification accuracy of the model. To test for outliers, we run the binary logistic regression in SPSS and check for outliers. Next, we exclude the outliers and run the logistic regression a second time. We then compare the accuracy rates of the models with and without the outliers. If the accuracy of the model without outliers is 2% or more accurate than the model with outliers, we interpret the model excluding outliers.

Running the hierarchical binary logistic regression Select the Regression | Binary Logistic… command from the Analyze menu.

Selecting the dependent variable First, highlight the dependent variable grass in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box.

Selecting the control independent variables First, move the control independent variables stated in the problem (age and sex) to the Covariates list box. Second, click on the Next button to start a new block and add the predictor independent variables.

Selecting the predictor independent variables Note that the block is now labeled at 2 of 2. First, move the predictor independent variables stated in the problem (reliten and happy) to the Covariates list box. Second, click on the Categorical button to specify which variables should be dummy coded.

Declare the categorical variables - 1 Move the variables sex, reliten, and happy to the Categorical Covariates list box. SPSS assigns its default method for dummy-coding, Indicator coding, to each variable, placing the name of the coding scheme in parentheses after each variable name.

Declare the categorical variables - 2 We accept the default of using the Indicator method for dummy-coding variable.. Click on the Continue button to close the dialog box. We will also accept the default of using the last category as the reference category for each variable.

Specifying the method for including variables Since the problem calls for a hierarchical binary logistic regression, we accept the default Enter method for including variables in both blocks.

Adding the values for outliers to the data set - 1 SPSS will calculate the values for standardized residuals and save them to the data set so that we can check for outliers and remove the outliers easily if we need to run a model excluding outliers. Click on the Save… button to request the statistics that we want to save.

Adding the values for outliers to the data set - 2 Second, click on the Continue button to complete the specifications. First, mark the checkbox for Standardized residuals in the Residuals panel.

Requesting the output Click on the OK button to request the output. While optional statistical output is available, we do not need to request any optional statistics.

Detecting the presence of outliers - 1 SPSS created a new variable, ZRE_1, which contains the standardized residual. If SPSS finds that the data set already contains a ZRE_1 variable, it will create ZRE_2. I find it easier to delete the ZRE_1 variable after each analysis rather than have multiple ZRE_ variables in the data set, requiring that I remember which one goes with which analysis.

Detecting the presence of outliers - 2 • To detect outliers, we will sort the ZRE_1 column twice: • first, in ascending order to identify outliers with a standardized residual of +2.58 or greater. • second, in descending order to identify outliers with a standardized residual of -2.58 or less. Click the right mouse button on the column header and select Sort Ascending from the pop-up menu.

Detecting the presence of outliers - 3 After scrolling down past the cases with missing data (. in the ZRE_1 column), we see that we have one outlier that has a standardized residual of -2.58 or less.

Detecting the presence of outliers - 4 To check for outliers with large positive standardized residuals, click the right mouse button on the column header and select Sort Ascending from the pop-up menu.

Detecting the presence of outliers - 5 After scrolling up to the top of the data set, we see that there are no outliers that have standardized residuals of +2.58 or more. Since we found outliers, we will run the model excluding them and compare accuracy rates to determine which one we will interpret. Had there been no outliers, we would move on to the issue of sample size.

Running the model excluding outliers - 1 We will use a Select Cases command to exclude the outliers from the analysis.

Running the model excluding outliers - 2 First, in the Select Cases dialog box, mark the option button If condition is satisfied. Second, click on the If button to specify the condition.

Running the model excluding outliers - 3 To eliminate the outliers, we request the cases that are not outliers be selected into the analysis. The formula specifies that we should include cases if the standard score for the standardized residual (ZRE_1) is less than 2.58. The abs() or absolute value function tells SPSS to ignore the sign of the value. After typing in the formula, click on the Continue button to close the dialog box.

Running the model excluding outliers - 4 SPSS displays the condition we entered on the Select Cases dialog box. Click on the OK button to close the dialog box.

Running the model excluding outliers - 5 SPSS indicates which cases are excluded by drawing a slash across the case number. Scrolling down in the data, we see that the outliers and cases with missing values are excluded.

Running the model excluding outliers - 6 To run the logistic regression excluding outliers, select Logistic Regression from the Dialog Recall menu.

Running the model excluding outliers - 7 The only change we will make is to clear the check box for saving standardized residuals. Click on the Save button to open the dialog box.

Running the model excluding outliers - 8 Second, click on the Continue button to close the dialog box. First, clear the check box for Standardized residuals.

Running the model excluding outliers - 9 Finally, click on the OK button to request the output.

Accuracy rate of the baseline model including all cases The accuracy rate for the model with all cases is 71.3%. Navigate to the Classification Table for the logistic regression with all cases. To distinguish the two models, I often refer to the first one as the baseline model.

Accuracy rate of the revised model excluding outliers The accuracy rate for the model excluding outliers is 71.1%. Navigate to the Classification Table for the logistic regression excluding outliers. To distinguish the two models, I often refer to the first one as the revised model.

Marking the statement for excluding outliers In the initial logistic regression model, 1 case had a standardized residual of +2.58 or greater or -2.58 or lower: - Case 20001058 had a standardized residual of -2.78 The classification accuracy of the model that excluded outliers (71.14%) was not greater by 2% or more than the classification accuracy for the model that included all cases (71.33%). The model including all cases should be interpreted. The check box is nor marked because removing outliers did not increase the accuracy of the model. All of the remaining statements will be evaluated based on the output for the model that includes all cases.

The statement about multicollinearity and other numerical problems Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, cells with a zero count for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.

Checking for multicollinearity The standard errors for the variables included in the analysis were: the standard error for "age" [age] was .01, the standard error for survey respondents who said that overall they were not too happy was .92, the standard error for survey respondents who said that overall they were pretty happy was .47, the standard error for survey respondents who said they had no religious affiliation was .53, the standard error for survey respondents who said they had a somewhat strong religious affiliation was .70, the standard error for survey respondents who said they had a not very strong religious affiliation was .47 and the standard error for survey respondents who were male was .39.

Marking the statement about multicollinearity and other numerical problems Since none of the independent variables in this analysis had a standard error larger than 2.0, we mark the check box to indicate there was no evidence of multicollinearity.

The statement about sample size Hosmer and Lemeshow, who wrote the widely used text on logistic regression, suggest that the sample size should be 10 cases for every independent variable.

The output for sample size We find the number of cases included in the analysis in the Case Processing Summary. The 150 cases available for the analysis satisfied the recommended sample size of 70 (10 cases per independent variable) for logistic regression recommended by Hosmer and Lemeshow. .

Marking the statement for sample size Since we satisfy the sample size requirement, we mark the check box.

The hierarchical relationship between the dependent and independent variables In a hierarchical logistic regression, the presence of a relationship between the dependent variable and combination of independent variables entered after the control variables have been taken into account is based on the statistical significance of the block chi-square for the second block of variables in which the predictor independent variables are included.

The output for the hierarchical relationship In this analysis, the probability of the block chi-square was was less than or equal to the alpha of 0.05 (χ²(5, N = 150) = 26.87, p < .001). The null hypothesis that there is no difference between the model with only the control variables versus the model with the predictor independent variables was rejected. The existence of the hierarchical relationship between the predictor independent variables and the dependent variable was supported.

Marking the statement for hierarchical relationship Since the hierarchical relationship was statistically significant, we mark the check box.

The statement about the relationship between age and legalization of marijuana Having satisfied the criteria for the hierarchical relationship, we examine the findings for individual relationships with the dependent variable. If the overall relationship were not significant, we would not interpret the individual relationships. The first statement concerns the relationship between age and legalization of marijuana.

Output for the relationship between age and legalization of marijuana The probability of the Wald statistic for the control independent variable "age" [age] (χ²(1, N = 150) = 1.83, p = .176) was greater than the level of significance of .05. The null hypothesis that the b coefficient for "age" [age] was equal to zero was not rejected. "Age" [age] does not have an impact on the odds that survey respondents supported the legalization of marijuana. The analysis does not support the relationship that 'For each unit increase in "age", survey respondents were 1.7% less likely to supported the legalization of marijuana'

Marking the statement for relationship between age and legalization of marijuana Since the relationship was not statistically significant, we do not mark the check box for the statement.

Statement for relationship between general happiness and legalization of marijuana The next statement concerns the relationship between the dummy-coded variable for general happiness and legalization of marijuana.

Output for relationship between general happiness and legalization of marijuana The probability of the Wald statistic for the predictor independent variable survey respondents who said that overall they were not too happy (χ²(1, N = 150) = 13.96, p < .001) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that overall they were not too happy was equal to zero was rejected. The value of Exp(B) for the variable survey respondents who said that overall they were not too happy was 31.642 which implies the odds were multiplied by approximately 31.6 times. The statement that 'Survey respondents who said that overall they were not too happy were approximately 31.6 times more likely to supported the legalization of marijuana compared to those who said that overall they were very happy' is correct.

Marking the relationship between general happiness and legalization of marijuana Since the relationship was statistically significant, and survey respondents who said that overall they were not too happy were approximately 31.6 times more likely to supported the legalization of marijuana compared to those who said that overall they were very happy is correct, the statement is marked.

Hierarchical Binary Logistic Regression

Hierarchical Binary Logistic Regression

Presentation Transcript

Logistic regression

Logistic regression for binary response variables

Logistic Regression

Standard Binary Logistic Regression

Stepwise Binary Logistic Regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression

Assessing Binary Outcomes: Logistic Regression

Binary Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Special Topic: Logistic Regression for Binary outcomes

Logistic Regression for binary outcomes

Binary Logistic Regression

Logistic Regression

Special Topic: Logistic Regression for Binary outcomes

Logistic regression

Logistic Regression