Logistic Regression

Logistic Regression – Stepwise Entry of Variables Sample Problem Steps in Solving Problems Homework Problems

Level of Measurement - question The first question requires us to examine the level of measurement requirements for binary logistic regression. Binary logistic regression requires that the dependent variable be dichotomous and the independent variables be metric or dichotomous.

Level of Measurement – evidence and answer True with caution is the correct answer, since we satisfy the level of measurement requirements, but include ordinal level variables in the analysis.

Sample Size - question The second question asks about the sample size requirements for binary logistic regression. To answer this question, we will run the a baseline logistic regression to obtain some basic data about the problem and solution. The phrase “stepwise entry” dictates the method for including variables in the model.

Request stepwise logistic regression Select the Regression | Binary Logistic… command from the Analyze menu.

Selecting the dependent variable First, highlight the dependent variable uswary in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box.

Adding the independent variables First, move the predictors to the Covariates list box.

Specifying the method for including variables In our stepwise logistic regression, we specify the Forward Conditional method for adding variables. This is one of the available methods for doing stepwise logistic regression.

Adding options to the output To add a summary of steps at the end of the analysis and specifications for stepwise method, click on the Options… button.

Set the option for listing outliers First, mark the checkbox for Casewise listing of residuals, accepting the default of outliers outside 2 standard deviations. Second, click on the At last step option to display the table of outliers only at the end of the analysis.

Specifications for stepwise method Click on the Continue button to close the dialog box. We can change the criteria for adding and removing variables from the analysis by changing the probability for entry and removal. We will use the default level of significance of 0.05 for entry and 0.10 for removal.

Completing the logistic regression request Click on the OK button to request the output for the logistic regression.

Sample size – ratio of cases to variables The minimum ratio of valid cases to independent variables for stepwise logistic regression is 10 to 1, with a preferred ratio of 50 to 1. In this analysis, there are 136 valid cases and 3 independent variables. The ratio of cases to independent variables is 45.33 to 1, which satisfies the minimum requirement. However, the ratio of 45.33 to 1 does not satisfy the preferred ratio of 50 to 1. A caution should be added to the interpretation of the analysis and a split sample validation should be conducted. True with caution is the correct answer to the question about sample size.

Outliers in the analysis - question Outliers in logistic regression are defined as cases that have a studentized residual of +/-2.0 or larger.

Outliers in the analysis – evidence and answer Using the criteria of studentized residuals greater than +/- 2.0, SPSS did not identify any outliers and did not print the Casewise List. SPSS informs us in a footnote to the Casewise List output which is not printed. The correct answer to the outlier question is true. Since there were no outliers, there is no revised model to run and no decision to use one or the other model. We will interpret the baseline model.

Multicollinearity and Numerical Problems - question Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, cells with a zero count for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.

Multicollinearity and Numerical Problems – evidence and answer The standard errors for the variables included in the analysis were: "total family income" (.033). None of the independent variables in this analysis had a standard error larger than 2.0. True is the correct answer. SPSS does not output the standard error for variables not included in the equation, so we cannot tell if any variables were excluded because of multicollinearity.

Overall Relationship - question The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the model chi-square at the step when the last variable was entered into the analysis. Only one variable, total family income, is indicated to be a useful predictor of group membership. Total family income is the most useful predictor if it is the first variable entered into the analysis.

Overall Relationship – evidence and answer - 1 There was only one step in this stepwise analysis. The selection of variables stopped at step one because neither of the other potential predictors could improve the fit at a statistically significant level. At the end of that step, the probability of the model chi-square (9.001) was p=0.003, less than or equal to the level of significance of 0.05. The null hypothesis that there is no difference between the model with only a constant and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported.

Overall Relationship – evidence and answer - 2 On step 1, the variable INCOME98, or total family income was included in the analysis. The statement that it is the most useful predictor is supported. The answer to this question is true with caution. Caution in interpreting the relationship should be exercised because of the ordinal level variable "highest academic degree" [degree] was treated as metric; the ordinal level variable "total family income" [income98] was treated as metric; the ordinal level variable "satisfaction with financial situation" [satfin] was treated as metric; and the available sample was less than the preferred number of cases.

Relationship of Individual Independent Variables to Dependent Variable The probability of the Wald statistic for the variable total family income was p=0.004, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for total family income was equal to zero was rejected. Total family income is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who had higher total family incomes. The value of Exp(B) was 0.909 which implies a decrease in the odds of 9.1% (0.909 - 1.0 = -0.091) This supports the relationship that "survey respondents who had higher total family incomes were 9.1% less likely to have been more positive that the United States would fight in another world war within the next ten years."

Individual Relationships – Academic degree - question To answer the question about an individual relationship, we look to the significance of the Wald test of the B coefficient and the interpretation of the odds ratio and the step summary which lists the order of entry.

Individual Relationships – Academic degree – evidence and answer The independent variable "highest academic degree" [degree] was not included in the stepwise logistic regression analysis. False is the correct answer.

Individual Relationships – Family income – question To answer the question about an individual relationship, we look to the significance of the Wald test of the B coefficient and the interpretation of the odds ratio and the step summary which lists the order of entry.

Individual Relationships – Family income – evidence and answer In the Step Summary table, "total family income" [income98] was added to the logistic regression analysis in step 1. This makes it the best predictor.

Individual Relationships – Family income – evidence and answer The probability of the Wald statistic for the variable "total family income" [income98] was p=0.004, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for "total family income" [income98] was equal to zero was rejected. "Total family income" [income98] is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who had higher total family incomes. The value of Exp(B) was 0.909 which implies a decrease in the odds of 9.1% (0.909 - 1.0 = -0.091). The correct interpretation of the relationship is that 'survey respondents who had higher total family incomes were 9.1% less likely to have been more positive that the United States would fight in another world war within the next ten years.'

Individual Relationships – Family income – evidence and answer True with caution is the correct answer. Caution in interpreting the relationship should be exercised because of the ordinal level variable "total family income" [income98] was treated as metric; and the available sample was less than the preferred number of cases.

Individual Relationships – financial satisfaction – question To answer the question about an individual relationship, we look to the significance of the Wald test of the B coefficient and the interpretation of the odds ratio and the step summary which lists the order of entry.

Individual Relationships – financial satisfaction – evidence and answer The independent variable "satisfaction with financial situation" [satfin] was not included in the stepwise logistic regression analysis. False is the correct answer.

Classification Accuracy - question The independent variables could be characterized as useful predictors distinguishing survey respondents who have been more supportive that the use of marijuana should be made legal from survey respondents who have been less supportive that the use of marijuana should be made legal if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.

Classification Accuracy – evidence and answerby chance accuracy rate The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0. The proportion in the No group was 0.603, making the proportion in the Yes group 0.397 (1.0 – 0.603). The proportion of cases in each group are then squared and summed (0.603² + 0.397² = 0.521). The proportional by chance accuracy criteria is 25% higher, or 65.2% (1.25 x 52.1% = 65.2%).

Classification Accuracy – evidence and answer The classification accuracy rate computed by SPSS was 67.6% which was greater than or equal to the proportional by chance accuracy criteria of 65.2% (1.25 x 52.1% = 65.2%). The criteria for classification accuracy is satisfied. The criteria for classification accuracy is satisfied. The answer to the question is true.

Validation - question For a stepwise logistic regression, the 75%-25% cross-validation must verify the overall contribution of the independent variables included in the analysis. In addition, the pattern of significance for the individual relationships between the dependent variable and the predictors for the training sample should be the same as the pattern for the full data set. And finally, the classification accuracy rate for the validation sample must be within 2% of the accuracy rate for the training sample.

Validation analysis:set the random number seed To set the random number seed, select the Random Number Seed… command from the Transform menu.

Set the random number seed First, click on the Set seed to option button to activate the text box. Second, type in the random seed stated in the problem. Third, click on the OK button to complete the dialog box. Note that SPSS does not provide you with any feedback about the change.

Validation analysis:compute the split variable To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.

The formula for the split variable First, type the name for the new variable, split, into the Target Variable text box. Second, the formula for the value of split is shown in the text box. The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0. 75. If the random number is less than or equal to 0.75, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.75, the formula will return a 0, the SPSS numeric equivalent to false. Third, click on the OK button to complete the dialog box.

Running the logistic regression again with the training sample We repeat the logistic regression analysis for the first validation sample. Select the Regression | Binary Logistic… command from the Analyze menu.

Using "split" as the selection variable First, scroll down the list of variables and highlight the variable split. Second, click on the right arrow button to move the split variable to the Selection Variable text box.

Setting the value of split to select cases When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the Rule… button to enter a value for split.

Completing the value selection First, type the value for the first half of the sample, 1, into the Value text box. Second, click on the Continue button to complete the value entry.

Requesting output for the validation sample Click on the OK button to request the output. When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable.

CROSS-VALIDATION - 1 In the cross-validation analysis, the relationship between the independent variables and the dependent variable was statistically significant. The probability for the model chi-square (7.572) testing overall relationship was p=0.006. The significance of the overall relationship between the individual independent variables and the dependent variable supports the validation analysis.

CROSS-VALIDATION - 2 The relationship between family income and “expectation about war" [uswary] was statistically significant for the model using the full data set (p=0.004). Similarly, the relationship in the cross-validation analysis was statistically significant. In the cross-validation analysis, the probability for the test of relationship between family income and “expectation about war" [uswary] was p=0.008, which was less than or equal to the level of significance of 0.05 and statistically significant.

CROSS-VALIDATION - 5 The classification accuracy rate for the model using the training sample was 67.0%, compared to 66.7% for the validation sample. The shrinkage in classification accuracy for the validation analysis is the difference between the accuracy for the training sample (67.0%) and the accuracy for the validation sample (66.7%), which equals 0.3% in this analysis. The shrinkage was within the 2% criteria for minimal shrinkage, small enough to support a conclusion that the logistic regression model based on this analysis would be effective in predicting scores for cases other than those included in the calculation of the regression analysis. The validation analysis supports the generalizability of the findings. The answer to the question is true.

Summary of Findings - question The final question is a summary of the findings of the analysis: overall relationship, individual relationships, and usefulness of the model. Cautions are added, if needed, for sample size and level of measurement issues.

Summary of Findings – evidence and answer True with caution is the correct answer.

Yes No Stepwise binary logistic regression: level of measurement Question: Variables included in the analysis satisfy the level of measurement requirements? Dependent dichotomous? Independent variables metric or dichotomous? Inappropriate application of a statistic No Yes Ordinal independent variable included in analysis? True with caution True

No No Yes Yes Yes Yes Stepwise binary logistic regression: sample size Question: Number of variables and cases satisfy sample size requirements? Run baseline logistic regression, using stepwise method for including variables identified in the research question. Record classification accuracy for evaluation of the effect of removing outliers. Ratio of cases to independent variables at least 10 to 1? Inappropriate application of a statistic Ratio of cases to independent variables at least 50 to 1? True with caution True

No Yes Stepwise binary logistic regression: detecting outliers Question: Outliers were not detected in the analysis? Outliers for the solution identified by studentized residuals > ±2.0? True False

Logistic Regression – Stepwise Entry of Variables