Logistic Regression

Logistic Regression (An Introduction)

Objective: Modeling a binary response (success, failure) or probability through a set of predictors X1…Xp Let’s consider the following example. In 1846, a group of people (the Donner Party) attempted to cross the Sierra desert and mountains. (Journal of Anthropological Research46 (1990), 223-42 and the Statistical Sleuth (1997) by Pamsey & Sheaffer) . The available data for the journey of these adults are: survival (yes/no), age (continuous, in years) and gender (categorical): Y (survival)ageGender No 23 Male Yes 40 Female Yes 40 Male . . . . . . Yes 25 Female

In formulating the logistic regression model, we need to focus on the fact that response is binary or a fraction of successes. Assuming each Y is Bernoulli/binary, then the assumptions are going to be very different from a typical linear regression models with a continuous response and errors. Here we have As in multiple linear regression, it advantageous if we can express our predictors in a linear manner The logit function is the log-odds or the log of the odds of a success as opposed to a failure.The linear right-hand side is often written as .

The inverse function is the direct connection between the predictors in and the population proportion . Since is E(y|X1…Xp), we are ‘linking’ the mean of Y to the predictors through the logit function and in this way, it’s a ‘generalized’ linear model. Why use the logit and logistic functions ? The logit or log-odds will range from negative infinity to positive infinity and the probability itself will range from 0 to 1.

The logit scale is also intuitively reasonable since extremely small probabilities will result in very negative odds ratios (logits) and very high probabilities will correspond to very high positive odd ratios (logits).Note that for the S-shaped logistic function , the linear part of the model may involve one predictor or several.

Also in support of this S-shaped logistic function is the fact that a monotonic nonlinear relationship of this form often exists between a predictor x (eg. age) and the probability of the event . In this case of a single predictor, differing interpretations of the slope (rate of change) will be based on its sign and magnitude. Linear approximations for certain values of x are also be possible.

The odds ratio and the meaning of coefficients in logistic regression The odds P(‘survive’/’not survive’) or P(Y=1)/P(Y=0) is and we’re modeling as a function of X1..Xp Suppose we only originally have one categorical predictor X1 (gender, for example) and hence we’re considering our model to be: Assuming we code X1=0 for male and X1 = 1 for female then the log-odds that the response (survival) is 1 as compared to 0 (not survived) for males is . The ODDS of survival for males is (written as exp( ))

Now consider adding a continuous predictor (age, X2,) so that our model is the additive model = and now let’s the interpret some possible odds ratios: The ratio odds of survival for a 45 year old woman as compared to a 30 year old woman: exp[ ] /exp[ ] = exp[ ] =exp[15 ] In general, for fixed values of the other predictors (in this case gender is held constant while age has changed), the odds ratio in going from A to B of the predictor Xk will be exp[ (A-B)] Let’s return to the earlier example :

20/45 survived the journey. 10 of 15 females survived while only 10 of 30 males survived. The logistic regression model fit with both predictors as done by SAS’s PROC LOGISTIC (or GENMOD) is as follows: The estimated odds of survival for females over male (of equal/constant age) is exp( ) = exp(1.5973) = 4.94. This certainly seems reasonable given the larger fraction of females surviving. A point estimate for the odds of a 45 year old woman surviving as opposed to a 30 year old woman: exp[15 ]=exp[(15)(-.0782)] = .309 (note that gender is held constant)

Looking at the actual probability of survival for a particular gender and age, what was the estimated probability of surviving the journey for a 25 year old male ? logit() = 1.633 + 1.5973(0) -.0782(25) = -0.322 Using our the logistic function (or our inverse link) the estimated probability for such a man to have survived the journey was exp(-0.322)/[1 + exp(-0.322)] = 0.420.

Returning to the Donner Survival example, the following estimates wereobtained: Maximum likelihood estimation (using a binomial model for the response) is used by SAS and for the given data, the most likely estimates for are 1.6331,1.5973 and -0.0782 respectively. What about inference and tests statistics such as those above ?

Testing & Inference In the long run, we believe that these maximum likelihood estimates will be approximately normally distributed(for reasonably large samples). This approximate sampling distribution (and st.error for ) leads to tests of Ho: =0 that are analogous to linear multiple regression. An approximate Z statistic of the form or is a Wald Z statistic or as a Wald chi-square on 1 df (above in SAS output) are possible. In this example, both predictors (gender,age) are significant using this approximate test. Assuming this approximate normal distribution for the estimate, a 95% confidence interval for (as the log odds for a unit change in age) will be or (-.1511,-.00489). Exponentiating the point estimate -.078 = 0.9249 (odds increase, 1year) Exponentiating the interval endpoints gives the 95% Wald confidence limits as (0.860,0.995) for the survival odds (given by SAS’s PROC LOGISTIC for the fitted model).

Note that this interval (0.860,0.995) doesn’t contain 1, and this corresponds to the earlier evidence (p=.036) against Ho: . The endpoints of this approximate confidence interval (C.I.) both being less than 1 suggests less likely survival with increased age. What is an approximate C.I. for the odds of survival on this journey for a difference in 25 years (gender held constant) ? The approximate C.I. for (as the log-odds for a unit change) was (-.1511,-.00489) The endpoints of the desired interval for log-odds are: [(25)*( -.1511),25*(-.0048)] =[-3.76,-0.137]. Back-transforming, this interval gives a 95% confidence interval for the odds ratio of (.023,.872).

SAS code (additive model): PROC GENMOD is similar but differences in syntax and capabilities (eg. model selection) exist title 'Donner Party Survival Example'; title2 'Proc Logistic Results'; proclogistic ; class gender / param=GLM descending; model survival (event ='1')= gender age ; run;

Logistic Regression