Logistic Regression

Logistic Regression Predicting Dichotomous Data

Predicting a Dichotomy • Response variable has only two states: male/female, present/absent, yes/no, etc • Linear regression fails because we cannot keep the prediction within the bounds of 0 – 1 • Continuous and non-continuous predictors possible

Logistic Model • Explanatory variables used to predict the probability that the response will be present (male, yes, etc) • We fit a linear model to the log of the odds that an event will occur • If the probability that an event will occur is p, then the odds = p/(1-p)

logits • Equations: • logit(p) = log(p/(1-p)) • logit(p) = b0 + b1x1 + b2x2 . . . • So logistic regression is a linear regression of logits (logs of odds ratios)

Assumptions • Dichotomous response (only two states possible) • Outcomes statistically independent • Model contains all relevant predictors and no irrelevant ones • Samples sizes of about 50 cases per predictor

Two Approaches • Data consisting of individual cases with a dichotomous variable • Grouped data where the number present and number absent are known for each combination of explanatory variables (in practice these will usually be categorical/ ordinal)

Inverting Snodgrass • Instead of seeing if houses inside the white wall are larger than those outside, we can use area to predict where the house is located.

# Use Rcmdr to create a dichotomous variable In Snodgrass$In<- with(Snodgrass, ifelse(Inside=="Inside", 1, 0)) # Use Rcmdr to bin Area into 10 bins using numbers to Snodgrass$AreaBin<- bin.var(Snodgrass$Area, bins=10, method='intervals', labels=FALSE) # Use Rcmdr to aggregate compute mean Area and In for each AreaBin AggregatedData<- aggregate(Snodgrass[,c("Area","In"), drop=FALSE], by=list(AreaBin=Snodgrass$AreaBin), FUN=mean) # Plot raw data plot(In~Area, data=Snodgrass, las=1) # Plot means by AreaBin groups points(AggregatedData[,2:3], type="b", pch=16)

Fitting a Simple Model • We start with a simple model using Area only • Statistics | Fit Models | Generalized Linear Model • In is the response, Area is the explanatory variable • Family is binomial, Link function is logit

> GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) > summary(GLM.1) Call: glm(formula = In ~ Area, family = binomial(logit), data = Snodgrass) Deviance Residuals: Min 1Q Median 3Q Max -2.1103 -0.4815 -0.1836 0.2885 2.5706 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.663071 1.818444 -4.764 1.90e-06 *** Area 0.034760 0.007515 4.626 3.74e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 123.669 on 90 degrees of freedom Residual deviance: 57.728 on 89 degrees of freedom AIC: 61.728 Number of Fisher Scoring iterations: 6

Results • Slope value for Area is highly significant – Area is a significant predictor of the odds of being inside the white wall • The residual deviance is less than the degrees of freedom (an indicator that the binomial model fits)

# Rcmdr command > GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) # Typed commands > x <- seq(20, 470, 5) > y <- predict(GLM.1, data.frame(Area=x), type="response") > plot(In~Area, data=Snodgrass, las=2) > points(AggregatedData[,2:3], type="b", lty=2, pch=16) > lines(x, y, col="red", lwd=2) # Rcmdr command > Snodgrass$Predicted<- with(Snodgrass, + factor(ifelse(fitted.GLM.1 < .5, "Outside", "Inside"))) # Use Rcmdr to produce a crosstabulation of Inside and Predicted >.Table <- xtabs(~Inside+Predicted, data=Snodgrass) >.Table Predicted Inside Inside Outside Inside 29 9 Outside 5 48 >(29 + 48)/(29 + 9 + 5 + 48) [1] 0.8461538 Predictions are correct 84.6% of the time

Expanding the Model • Expand the model by adding Total and Types • Check the results – neither of the new variables is significant, but this could be the high correlation between the two (+.94) • Delete Types and try again

Third Model • Without Types, Total is now highly significant • ANOVA comparing 2nd and 3rd models show no difference so the 3rd (simpler) model is preferred • Also AIC, Akaike’s Information Criterion is lower (which is better) • New model is 89% accurate

Akaike Information Criterion • AIC measures relative goodness of fit of a statistical model • Roughly it describes the tradeoff between accuracy and complexity of the model • A method of comparing different statistical models – generally prefer model with lower AIC

Logistic Regression