Logistic Regression
Logistic regression is a powerful statistical method used for predicting dichotomous outcomes, such as yes/no or male/female. Unlike linear regression, logistic regression maintains predictions within the bounds of 0 to 1 by modeling the log-odds of the response variable. It allows both continuous and categorical predictors to assess the probability of a particular outcome. Key assumptions include independence of observations and relevance of predictors. This technique facilitates robust data analysis in various fields, making it essential for researchers and data scientists alike.
Logistic Regression
E N D
Presentation Transcript
Logistic Regression Predicting Dichotomous Data
Predicting a Dichotomy • Response variable has only two states: male/female, present/absent, yes/no, etc • Linear regression fails because we cannot keep the prediction within the bounds of 0 – 1 • Continuous and non-continuous predictors possible
Logistic Model • Explanatory variables used to predict the probability that the response will be present (male, yes, etc) • We fit a linear model to the log of the odds that an event will occur • If the probability that an event will occur is p, then the odds = p/(1-p)
logits • Equations: • logit(p) = log(p/(1-p)) • logit(p) = b0 + b1x1 + b2x2 . . . • So logistic regression is a linear regression of logits (logs of odds ratios)
Assumptions • Dichotomous response (only two states possible) • Outcomes statistically independent • Model contains all relevant predictors and no irrelevant ones • Samples sizes of about 50 cases per predictor
Two Approaches • Data consisting of individual cases with a dichotomous variable • Grouped data where the number present and number absent are known for each combination of explanatory variables (in practice these will usually be categorical/ ordinal)
Inverting Snodgrass • Instead of seeing if houses inside the white wall are larger than those outside, we can use area to predict where the house is located.
# Use Rcmdr to create a dichotomous variable In Snodgrass$In<- with(Snodgrass, ifelse(Inside=="Inside", 1, 0)) # Use Rcmdr to bin Area into 10 bins using numbers to Snodgrass$AreaBin<- bin.var(Snodgrass$Area, bins=10, method='intervals', labels=FALSE) # Use Rcmdr to aggregate compute mean Area and In for each AreaBin AggregatedData<- aggregate(Snodgrass[,c("Area","In"), drop=FALSE], by=list(AreaBin=Snodgrass$AreaBin), FUN=mean) # Plot raw data plot(In~Area, data=Snodgrass, las=1) # Plot means by AreaBin groups points(AggregatedData[,2:3], type="b", pch=16)
Fitting a Simple Model • We start with a simple model using Area only • Statistics | Fit Models | Generalized Linear Model • In is the response, Area is the explanatory variable • Family is binomial, Link function is logit
> GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) > summary(GLM.1) Call: glm(formula = In ~ Area, family = binomial(logit), data = Snodgrass) Deviance Residuals: Min 1Q Median 3Q Max -2.1103 -0.4815 -0.1836 0.2885 2.5706 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.663071 1.818444 -4.764 1.90e-06 *** Area 0.034760 0.007515 4.626 3.74e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 123.669 on 90 degrees of freedom Residual deviance: 57.728 on 89 degrees of freedom AIC: 61.728 Number of Fisher Scoring iterations: 6
Results • Slope value for Area is highly significant – Area is a significant predictor of the odds of being inside the white wall • The residual deviance is less than the degrees of freedom (an indicator that the binomial model fits)
# Rcmdr command > GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) # Typed commands > x <- seq(20, 470, 5) > y <- predict(GLM.1, data.frame(Area=x), type="response") > plot(In~Area, data=Snodgrass, las=2) > points(AggregatedData[,2:3], type="b", lty=2, pch=16) > lines(x, y, col="red", lwd=2) # Rcmdr command > Snodgrass$Predicted<- with(Snodgrass, + factor(ifelse(fitted.GLM.1 < .5, "Outside", "Inside"))) # Use Rcmdr to produce a crosstabulation of Inside and Predicted >.Table <- xtabs(~Inside+Predicted, data=Snodgrass) >.Table Predicted Inside Inside Outside Inside 29 9 Outside 5 48 >(29 + 48)/(29 + 9 + 5 + 48) [1] 0.8461538 Predictions are correct 84.6% of the time
Expanding the Model • Expand the model by adding Total and Types • Check the results – neither of the new variables is significant, but this could be the high correlation between the two (+.94) • Delete Types and try again
Third Model • Without Types, Total is now highly significant • ANOVA comparing 2nd and 3rd models show no difference so the 3rd (simpler) model is preferred • Also AIC, Akaike’s Information Criterion is lower (which is better) • New model is 89% accurate
Akaike Information Criterion • AIC measures relative goodness of fit of a statistical model • Roughly it describes the tradeoff between accuracy and complexity of the model • A method of comparing different statistical models – generally prefer model with lower AIC