Logistic Regression Analysis

LogisticRegressionAnalysis Gerrit Rooks 30-03-10

Thislecture • Why do we have to know and sometimesuselogisticregression? • What is the model? What is maximum likelihoodestimation? • Logistics of logisticregressionanalysis • Estimatecoefficients • Assess model fit • Interpretcoefficients • Check residuals • An SPSS example

Suppose we have 100 observationswithinformationaboutanindividualsage and wetherornotthisindivual had some kind of a heartdisease (CHD)

A graphicrepresentation of the data

Suppose, as a researcher I aminterested in the relationbetweenage and the probability of CHD

To try to predict the probability of CHD, I canregress CHD onAge pr(CHD|age) = -.54 +.0218107*Age

However, linearregression is not a suitable model forprobalities. pr(CHD|age) = -.54 +.0218107*Age

In thisgraphfor 8 agegroups, I plotted the probability of having a heartdisease (proportion)

Instead of a linearprobality model, I need a non-linearone

Somethinglikethis

This is the logisticregression model

Predictedprobabilities are alwaysbetween 0 and 1 similar to classic regression analysis

Logistics of logisticregression • How do we estimate the coefficients? • How do we assess model fit? • How do we interpret coefficients? • How do we check regression assumptions?

Logistics of logisticregression • How do we estimate the coefficients? • How do we assess model fit? • How do we interpret coefficients? • How do we check regression? assumptions ?

Maximum likelihoodestimation • Method of maximum likelihoodyieldsvaluesfor the unknown parameters whichmaximize the probability of obtaining the observed set of data. Unknown parameters

Maximum likelihoodestimation • First we have to construct the likelihoodfunction (probability of obtaining the observed set of data). Likelihood = pr(obs1)*pr(obs2)*pr(obs3)…*pr(obsn) Assumingthatobservations are independent

The likelihoodfunction (for the CHD data) Giventhat we have 100 observations I summarize the function

Log-likelihood • For technicalreasons the likelihood is transformed in the log-likelihood LL= ln[pr(obs1)]+ln[pr(obs2)]+ln[pr(obs3)]…+ln[pr(obsn)]

The likelihoodfunction (for the CHD data) A cleveralgorithmgivesusvaluesfor the parameters b0 and b1 thatmaximize the likelihood of this data

Estimation of coefficients: SPSS Results

Thisfunction fits verygood, othervalues of b0 and b1 giveworseresults

Illustration 1: suppose we chose .05X instead of .11X

Illustration 2: suppose we chose .40X instead of .11X

Logistics of logisticregression • Estimate the coefficients • Assess model fit • Interpret coefficients • Check regression assumptions

Logistics of logisticregression • Estimate the coefficients • Assess model fit • Between model comparisons • Pseudo R2 (similar to multiple regression) • Predictiveaccuracy • Interpret coefficients • Check regression assumptions

Model fit: Between model comparison The log-likelihood ratio test statistic can be used to test the fit of a model full model reducedmodel The test statistic has a chi-square distribution

Between model comparisons: likelihood ratio test full model reducedmodel The model includingonlyanintercept Is oftencalled the empty model. SPSS usesthis model as a default.

Between model comparisons: Test canbeusedforindividualcoefficients full model reducedmodel

Between model comparison: SPSS output This is the test statistic, and it’sassociated significance 29.31 = -107,35 – 2LL(baseline)  -2LL(baseline) = 136,66

Just like in multiple regression, pseudo R2 ranges 0.0 to 1.0 Cox and Snell cannottheoreticallyreach 1 Nagelkerke adjustedsothatitcanreach 1 Overall model fitpseudo R2 log-likelihood of the model that you want to test log-likelihood of model before any predictors were entered NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression

Overall model fit: Classificationtable We correctlypredict 74% of ourobservation

Overall model fit: Classificationtable 14 cases had a CHD whileaccording to our model thisshouldnt have happened.

Overall model fit: Classificationtable 12 cases didnt have a CHD whileaccording to our model thisshould have happened.

Logistics of logisticregression • Estimate the coefficients • Assess model fit • Interpret coefficients • Check regression assumptions

Logistics of logisticregression • Estimate the coefficients • Assess model fit • Interpret coefficients • Direction • Significance • Magnitude • Check regression assumptions

Interpreting coefficients: direction We canrewriteour LRM as follows: into:

Interpreting coefficients: direction original b reflects changes in logit: b>0 -> positive relationship exponentiated b reflects the changes in odds: exp(b) > 1 -> positive relationship 39

Interpreting coefficients: direction We canrewriteour LRM as follows: into:

Interpreting coefficients: direction original b reflects changes in logit: b>0 -> positive relationship exponentiated b reflects the changes in odds: exp(b) > 1 -> positive relationship 41

Testing significance of coefficients • In linear regression analysis this statistic is used to test significance • In logistic regression something similar exists • however, when b is large, standard error tends to become inflated, hence underestimation (Type II errors are more likely) estimate t-distribution standard error of estimate Note: This is not the WaldStatistic SPSS presents!!!

Interpreting coefficients: significance SPSS presents While Andy Field thinks SPSS presents this:

3. Interpreting coefficients: magnitude The slope coefficient (b) is interpreted as the rate of change in the "log odds" as X changes … not very useful. exp(b) is the effect of the independent variable on the odds, more useful for calculating the size of an effect 44

Magnitude of association: Percentage change in odds (Exponentiatedcoefficienti- 1.0) * 100

For our age variable: Percentage change in odds = (exponentiated coefficient – 1) * 100 = 12% A one unit increase in previous will result in 12% increase in the odds that the person will have a CHD So if a soccer player is one year older, the odds that (s)he will have CHD is 12% higher Magnitude of association 46

Anotherway: Calculatingpredictedprobabilities So, forsomebody 20 yearsold, the predictedprobability is .04 For somebody 70 yearsold, the predictedprobability is .91

Checking assumptions • Influential data points & Residuals • FollowSamanthas tips • Hosmer & Lemeshow • Divides sample in subgroups • Checkswhetherthere are differencesbetweenobserved and predictedbetweensubgroups • Test shouldnotbe significant, ifso: indication of lack of fit

Hosmer & Lemeshow Test divides sample in subgroups, checkswhetherdifferencebetweenobserved and predicted is aboutequal in these groups Test shouldnotbe significant (indicatingnodifference)

Examiningresiduals in lR • Isolatepointsforwhich the model fits poorly • Isolateinfluential data points

Logistic Regression Analysis

Logistic Regression Analysis

Presentation Transcript

Logistic regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression