Logistic Regression

Logistic Regression

Who intends to vote? • Scholars and politicians would both like to understand who voted. • Imagine that you did a survey of voters after an election and you ask people if they voted. • You want to be able to use their responses to understand what factors influence turnout decisions. • How would you analyze who voted? What factors do you think influence whether or not people voted – or to say that they voted?

Problem: Dichotomous Variable • No one factor influences turnout. • A multivariate analysis makes the most sense to include independent variables like education, interest in politics, feelings towards the candidates and perceived closeness of the election. • The problem is that turnout is a dichotomous variable. • You either voted or you did not vote. • Linear (OLS) regression is best used for explaining variation in continuous variables. • Multivariate analyses with dichotomous dependent variables require logistic regression.

Logistic Regression • Like linear (OLS) regression, binomial logistic regression models the relationship between multiple independent variables and a dependent variable. • For logistic regressions, though, the dependent variable is dichotomous • Usually coded 0 and 1

Dichotomous Variables • A variable is dichotomous when there are only two possible options, like yes and no. • Sometimes dichotomous variables are called binary variables since the values are often coded as one and zero. • This is a common dependent variable in political science because scholars are often interested in Yes/No questions like: • Did you vote? • Do you approve of the President’s performance? • Does a country have an independent judiciary?

Dichotomous Variable Conventions • Dichotomous variables have only two values. • Typically, inaction, absences or negative outcomes are coded as 0. • Examples: Did not vote, does not have an independent judiciary, did not riot, does not have the death penalty. • An action or an occurrence of an event, the presence of something, or if a person agreed with a statement is coded as 1.

When to use Logistic Regression • Logistic regression is a multivariate analysis designed to gauge how independent variables influence the likelihood of an outcome occurring. • This outcome could be: • an event, like a war, occurring, • a choice being made, like deciding to vote Democrat. • or an action being taken, like voting or joining a protest.

Summary: Interpretation of Logistic Results • Logistic regression coefficients cannot be interpreted like linear regression coefficients. • You can assess whether the independent variable increases (or decreases) the chances of the event occurring, choosing an option, or partaking in the action like voting by looking at the sign of the coefficient (negative or positive). • You can see if the effect of the independent variable on the dependent variable is due to chance by looking at familiar measures of statistical significance. • Measures similar to R-squared as well as a classification table indicate model goodness of fit.

Did poor states vote for George W. Bush? • Lets say you want to test the hypothesis that George W. Bush in 2000 was more likely to win a plurality of votes in a state if people in that state tended to be poor. • Poor states include “red” states like Mississippi and Alabama. • The independent variable is the median household income of the state (an interval variable, “medhhinc”) in 100’s of dollars. • The dependent variable is whether or not a plurality of voters in that state cast votes for Bush (“votebush”), giving Bush those electoral college votes. • This is a dichotomous dependent variable.

Problems with Linear Regression • Linear (or Ordinary Least Squares [OLS]) regression is inappropriate for explaining dependent variables that are dichotomous because the regression model tries to fit a straight line between the observations, and this line does a very poor job of fitting the data. • Linear regression is fine for dichotomous independent “dummy” variables. • To illustrate, I am going to use a bivariate regression that depicts the relationship between the wealth of a state and whether or not a plurality in that state voted for Bush or Gore in 2000.

Example In the example to the right, all of the observations are at two values on the Y axis, at one and at zero. Observations are depicted at one if Bush won a plurality of votes in that state in 2000, zero if the state voted for Gore. The X-axis is the median income of American states. To simplify, I include only the ten poorest states and the ten richest states. On the graph you can see in the top left that almost all of the poorest states (including Gore's home state, Tennessee) voted for Bush. In the bottom right most of the rich states voted for Gore.

Example The model predicts the value of Vote for Bush (y-axis) if median household income ≈ $40,000 (x-axis)? Look at where the line crosses $40,000 (solid red arrow) and then read the value of the y-axis at that point (dashed red arrow). The answer looks to be about 0.6-0.65… Which is impossible for a variable that is either zero or one.

Example What does the model predict is the value of Vote for Bush if median household income ≈ $50,000? Look at where the line crosses $50,000 (solid red arrow) and then read the value of the y-axis at that point (dashed red arrow). The answer looks to be about 0.3… Which is impossible for a variable that is only either zero or one. 0.3 is not even very close to either zero or one!.

There’s no such thing as a little pregnant! • In our example, when we fit a straight regression line to the data, the line predicts that at most levels of median household income, states vote a little for Bush or a little for Gore… • This is fine if the dependent variable is the percentage of the vote, but when the variable is dichotomous, voting a little for Bush (or Gore) is like being a little pregnant… A plurality in each state either votes for Gore or Bush! • Remember that all that matters in the Electoral College is if a candidate wins a plurality – the margin is unimportant. • As a result, analysts might like to study whether or not Bush won, not the margin of Bush’ victory.

Problems with fitting the straight line • In the preceding graph, the linear regression line predicts that for most levels of X (median household income), Y (a plurality vote for Bush) should be between one and zero. • This is problematic because a Y (plurality vote for Bush) can only be either one (yes) or zero (no). • Even worse problems are not uncommon as linear regression lines can predict values from infinity to negative infinity (“unbounded”). • A better model would reflect the reality that only two options – one (yes) or zero (no) are possible.

Implications of linear regression • When linear regression is applied to a model with a categorical dependent variable, the distance between almost all observed points and the regression line is quite large. • By predicting values that are both impossible and far from the actual values, standard errors increase and we explain little total variation. • Linear regression also assumes that the error terms be normally distributed, an assumption that logit does not make.

A solution By fitting a “sigmoidal” or S-shaped, curved line to the data (see chart on left), we can do a much better job of minimizing the errors. For much of the range, the black line in the middle of the graph is very close to 1 or 0. Note: This line has a negative slope, so it looks more like the letter Z, but the positive slope looks more like the letter S, running from the bottom-left to the top right.

Curves are better than sticks • This S-curve does a much better job of minimizing the errors than a linear line • The range of values of X for which Y is predicted to be between one and zero is minimized to a narrow range in the middle of the distribution. • Most computer programs round up values predicted by the S-curve to be over 0.5 (by default) to gauge how many observations the model correctly predicts.

Curve dynamics • We can interpret the data in terms of increasing the odds, chances or probability that the choice is one (a plurality for Bush). • Because the curve tends to flatten out as it approaches the extreme ends of the range of X, the probabilities of choosing 1 or 0 also tend to flatten as the values of X increase. • In our example, poor states were more likely to vote for Bush, but the likelihood of voting for Bush does not change much for the three poorest states compared to the other poor states.

Logit and Probit • There are two types of similar S-curves used to analyze these data, logit and probit. • The two tend to yield similar results. • Probit coefficients more quickly reach probabilities asymptotically closer to zero or one, so logit models tend to be more sensitive when dealing with rare events. • Logit analyses appear more frequently in political science largely because they can be more readily interpreted in terms of odds and odds ratios.

Maximum Likelihood Analysis • Both logit and probit are examples of maximum likelihood estimation techniques that find the parameter that maximizes the likelihood of observing the sample data if the models’ assumptions are accurate. • The techniques work by fitting an equation to the observed values and then repeatedly changing the equation a little to find a better fit until the new equation hardly improves on the previous model. • These techniques can also be used to explain the number of times an event takes place. • See King (1989), Long and Freese (2006).

Beyond Logit and Probit • In this lecture, we will only discuss modeling choices between dichotomous variables, but there are ways of analyzing more than two choices. • Ordered logit (or probit) fits multiple S-curves like steps on ordinal dependent variables. • Multinomial logit (or probit) enables scholars to explain variation in nominal dependent variables. • These methods bridge the gap between the logit and probit models of dichotomous choices and OLS regression models best used when the dependent variable is at least ordinal with many value categories.

Running Logit • To analyze a logistic regression with a computer program, one must specify a dependent variable and at least one independent variable in much the same way that they are specified in a linear regression. • Independent variables can be used just like they are in linear regressions. • All independent variables must be ordinal or “dummies.” • Interaction terms can be used, using the same rules as linear regression.

Dichotomous Dependent Variable • The dependent variable must be dichotomous • Ensure that options like “don’t know” or “maybe” are declared missing or recoded. • Some computer programs require that the dependent variable be coded 0 and 1. • Even if this is not the case, recoding the variable as 0 and 1 is recommended. • Run a frequency table before running the regression to make sure you did the recoding correctly!

Logit in STATA • In the menus, select “Statistics”, then Binary Outcomes and finally, Logistic Regression. • A new window will open; • Select the dependent variable from the drop down menu on the left. • Select independent variables from the drop down menu on the left (you can just keep clicking to add more variables to the analysis). • Click “OK”

Logit in STATA: command line • Or, at the command line, simply type: logitdepvar var1 var2 • Replace depvar with the name/identifier of your dependent variable, var1, var2 with the name/indicator of your independent variables. • For example, when asking whether poorer states (medhhinc) were more likely to be won by Bush (votebush), the command would be: logitvotebushmedhhinc

Logit in SPSS, using menus • To run logit in SPSS using the menu interface, Go the Analyze menu, select Regression and then Binary Logistic to see a dialogue window. • Select a dependent variable in the box at the top of the window. • Select independent variables (labeled “covariates” in the middle of the window).

SPSS Syntax • You can also manually enter the logit regression syntax. LOGISTIC REGRESSION VARIABLES depvar /METHOD=ENTER var1 var2 var3 /PRINT=CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5). • Replace depvar with the name/locator of your dependent variable and var1, var2… by name/locators of your independent variable(s). • The confidence intervals can be omitted, and the criteria can be adjusted as desired.

Example SPSS Syntax For example, when asking whether poorer states (medhhinc) were more likely to be won by Bush (votebush), the command would be: LOGISTIC REGRESSION VARIABLES votebush /METHOD=ENTER medhhinc /CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).

R Syntax • R requires two steps to run logit: • a call to the theglm function. • a “summary” command to print the output. • The glm function, with ordinal independent variables: glm(depvar~var1+var2, family=binomial(link="logit"), na.action=na.pass) • If “var2” is categorical, then type: glm(depvar~var1+as.factor(var2),…

R Summary Command • One of the easiest ways of running glm is to give the model a name, like “logit1”, at the start of the glm line. logit1<-glm(depvar~var1+var2, family=binomial(link="logit"), na.action=na.pass) • Then request a summary, using that name. summary(logit1) • This is especially useful when you are trying several different versions of the model.

Logit using Interactive Menus in R • Deducer includes logit commands. • After loading the Deducer package, click on the “Analysis” menu and then “Logistic Model” • Put the dependent variable in the top box marked “Outcome” • Put any ordinal independent variables in the next box, labeled “As Numeric.” • Put any categorical independent variables in the box labeled “As Factor.”

Logit output • The output generated by the computer will look – at least at the bottom of the screen – a lot like a linear regression analysis. • Look for a list of the independent variables followed by columns of coefficients, standard errors, Z-scores or Wald Chi-Square scores, and a test of significance.

Logit output • After that, the output varies by program. • STATA includes 95% confidence intervals unless log odds are specified. • SPSS includes the degrees of freedom and log odds - Exp(B) • SAS includes odds ratio point estimates and confidence intervals for the Wald Chi-Square test. • R just includes stars to indicate significance levels. • What else is presented varies widely between programs.

Did poor states vote for Bush? Logit • Lets return to the sample analysis of whether poorer states were likely to vote for Bush. • Earlier, we saw slides that illustrated the S-curve that fit the relationship between state median household income and whether or not the state voted for Bush. • In the next slides, I will present the logit regression output made by statistical programs of that relationship.

STATA logit output

STATA logit output The maximum likelihood process. Dependent variable Model goodness of fit Independent variable and coefficient. Significance of coefficient.

SPSS Logit Output - 1 SPSS presents a long set of outputs, not all of which are relevant to most researchers and can be confusing. After three tables presented under the heading “Block 0: Beginning Block” (which includes no independent variables), look for the label, Block 1: Method=Enter in big black letters. Under this label are your logit results. The first three tables, “Omnibus Tests of Model Coefficients,” “Model Summary,” and “Classification Table,” are all model goodness of fit measures.

SPSS logit output - 2 • After the three goodness of fit tables, the “Variables in the Equation” table displays the independent variable(s), coefficients and significance.

SPSS logit output - 2 Significance Independent variable and coefficient. • In this example, there is only one independent variable, State Median Household Income (medhhinr), measured in $100’s. • The coefficient is boxed in red, the significance is circled in green.

Logit Presentation • Scholars typically publish the results of a logistic regression much like they present the results of a linear regression analyses. • Emphasizing coefficients and statistical significance. • Normally also present measures indicating goodness of fit (how well the model as a whole explains variation in the dependent variable).

Presentation Example This model explains how members of Congress voted on the North American Free Trade Agreement (NAFTA). From: C. Don Livingston, & Wink, Kenneth A. (1997). “The passage of the North American Free Trade agreement in the U.S. House of Representatives: Presidential leadership or presidential luck?“ Presidential Studies Quarterly, 27(1), pp. 52-70.

Presentation Example Coefficients Coefficients show the effect of each independent variable on the likelihood of voting in favor of NAFTA. From: C. Don Livingston, & Wink, Kenneth A. (1997). “The passage of the North American Free Trade agreement in the U.S. House of Representatives: Presidential leadership or presidential luck?“ Presidential Studies Quarterly, 27(1), pp. 52-70. Goodness of fit measures Significance

Interpretation • Not a linear model, so coefficients are not the slope of a line. • As a result, logistic regression coefficients cannot be interpreted in a simple, straightforward fashion. • Coefficients must be transformed to get an easily understood measure of the magnitude of the effect of the independent variable on the dependent variable.

At a glance conclusions • Although gauging magnitude of the impact of each logistic regression coefficient is difficult, one can still readily ascertain: • Whether the independent has a negative or positive effect on the dependent variable by looking at the sign of the coefficient. • If the coefficient is significant, we can be confident that variation in the independent variable is associated with variation in the dependent variable.

Negative or positive • Like linear regression, the sign on the coefficient tells whether that variable has a positive or negative effect on the dependent variable. • A positive coefficient means that as values on the independent value go up, the outcome or choice described by the dependent variable becomes more likely. • A negative coefficient means that as values on the independent value go up, the outcome or choice described by the dependent variable becomes less likely.

Statistical significance • Statistical significance means exactly the same thing as in linear regression. • If the coefficient is significant, we can be confident that the results are not due to chance. • So, if the coefficient is significant, we can be confident that variation in the independent variable explains variation in the dependent variable.

Interpreting coefficients (STATA): Example from did poor states vote for Bush? • In the output presented earlier, the coefficient (highlighted by a red square) is negative and significant at P < 0.01 (green circle). • We can conclude that wealthier states are less likely to vote for Bush.

Interpreting coefficients (SPSS): Example from did poor states vote for Bush? • In the output presented earlier, the coefficient (highlighted by a red square) is negative and significant at P < 0.01 (green circle). • We can conclude that wealthier states are less likely to vote for Bush.

NAFTA Vote Coefficient Example Voting in favor of NAFTA was coded as 1. Therefore, positive coefficients like for the independent variable HISPANIC (the proportion of Latinos in a Congressional district [green rectangle]), indicates that Representatives with high proportions of Latino constituents were more likely to vote for NAFTA when controlling for all other variables. Three stars indicates that this effect is statistically significant at p < 0.01. Source: C. Don Livingston, & Wink, Kenneth A. (1997). “The passage of the North American Free Trade agreement in the U.S. House of Representatives: Presidential leadership or presidential luck?“ Presidential Studies Quarterly, 27(1), pp. 52-70.

Logistic Regression