700 likes | 1.27k Vues
Logistic Regression Analysis. Gerrit Rooks 30-03-10. This lecture. Why do we have to know and sometimes use logistic regression ? What is the model? What is maximum likelihood estimation ? Logistics of logistic regression analysis Estimate coefficients Assess model fit
E N D
LogisticRegressionAnalysis Gerrit Rooks 30-03-10
Thislecture • Why do we have to know and sometimesuselogisticregression? • What is the model? What is maximum likelihoodestimation? • Logistics of logisticregressionanalysis • Estimatecoefficients • Assess model fit • Interpretcoefficients • Check residuals • An SPSS example
Suppose we have 100 observationswithinformationaboutanindividualsage and wetherornotthisindivual had some kind of a heartdisease (CHD)
Suppose, as a researcher I aminterested in the relationbetweenage and the probability of CHD
To try to predict the probability of CHD, I canregress CHD onAge pr(CHD|age) = -.54 +.0218107*Age
However, linearregression is not a suitable model forprobalities. pr(CHD|age) = -.54 +.0218107*Age
In thisgraphfor 8 agegroups, I plotted the probability of having a heartdisease (proportion)
Predictedprobabilities are alwaysbetween 0 and 1 similar to classic regression analysis
Logistics of logisticregression • How do we estimate the coefficients? • How do we assess model fit? • How do we interpret coefficients? • How do we check regression assumptions?
Logistics of logisticregression • How do we estimate the coefficients? • How do we assess model fit? • How do we interpret coefficients? • How do we check regression? assumptions ?
Maximum likelihoodestimation • Method of maximum likelihoodyieldsvaluesfor the unknown parameters whichmaximize the probability of obtaining the observed set of data. Unknown parameters
Maximum likelihoodestimation • First we have to construct the likelihoodfunction (probability of obtaining the observed set of data). Likelihood = pr(obs1)*pr(obs2)*pr(obs3)…*pr(obsn) Assumingthatobservations are independent
The likelihoodfunction (for the CHD data) Giventhat we have 100 observations I summarize the function
Log-likelihood • For technicalreasons the likelihood is transformed in the log-likelihood LL= ln[pr(obs1)]+ln[pr(obs2)]+ln[pr(obs3)]…+ln[pr(obsn)]
The likelihoodfunction (for the CHD data) A cleveralgorithmgivesusvaluesfor the parameters b0 and b1 thatmaximize the likelihood of this data
Thisfunction fits verygood, othervalues of b0 and b1 giveworseresults
Logistics of logisticregression • Estimate the coefficients • Assess model fit • Interpret coefficients • Check regression assumptions
Logistics of logisticregression • Estimate the coefficients • Assess model fit • Between model comparisons • Pseudo R2 (similar to multiple regression) • Predictiveaccuracy • Interpret coefficients • Check regression assumptions
Model fit: Between model comparison The log-likelihood ratio test statistic can be used to test the fit of a model full model reducedmodel The test statistic has a chi-square distribution
Between model comparisons: likelihood ratio test full model reducedmodel The model includingonlyanintercept Is oftencalled the empty model. SPSS usesthis model as a default.
Between model comparisons: Test canbeusedforindividualcoefficients full model reducedmodel
Between model comparison: SPSS output This is the test statistic, and it’sassociated significance 29.31 = -107,35 – 2LL(baseline) -2LL(baseline) = 136,66
Just like in multiple regression, pseudo R2 ranges 0.0 to 1.0 Cox and Snell cannottheoreticallyreach 1 Nagelkerke adjustedsothatitcanreach 1 Overall model fitpseudo R2 log-likelihood of the model that you want to test log-likelihood of model before any predictors were entered NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression
Overall model fit: Classificationtable We correctlypredict 74% of ourobservation
Overall model fit: Classificationtable 14 cases had a CHD whileaccording to our model thisshouldnt have happened.
Overall model fit: Classificationtable 12 cases didnt have a CHD whileaccording to our model thisshould have happened.
Logistics of logisticregression • Estimate the coefficients • Assess model fit • Interpret coefficients • Check regression assumptions
Logistics of logisticregression • Estimate the coefficients • Assess model fit • Interpret coefficients • Direction • Significance • Magnitude • Check regression assumptions
Interpreting coefficients: direction We canrewriteour LRM as follows: into:
Interpreting coefficients: direction original b reflects changes in logit: b>0 -> positive relationship exponentiated b reflects the changes in odds: exp(b) > 1 -> positive relationship 39
Interpreting coefficients: direction We canrewriteour LRM as follows: into:
Interpreting coefficients: direction original b reflects changes in logit: b>0 -> positive relationship exponentiated b reflects the changes in odds: exp(b) > 1 -> positive relationship 41
Testing significance of coefficients • In linear regression analysis this statistic is used to test significance • In logistic regression something similar exists • however, when b is large, standard error tends to become inflated, hence underestimation (Type II errors are more likely) estimate t-distribution standard error of estimate Note: This is not the WaldStatistic SPSS presents!!!
Interpreting coefficients: significance SPSS presents While Andy Field thinks SPSS presents this:
3. Interpreting coefficients: magnitude The slope coefficient (b) is interpreted as the rate of change in the "log odds" as X changes … not very useful. exp(b) is the effect of the independent variable on the odds, more useful for calculating the size of an effect 44
Magnitude of association: Percentage change in odds (Exponentiatedcoefficienti- 1.0) * 100
For our age variable: Percentage change in odds = (exponentiated coefficient – 1) * 100 = 12% A one unit increase in previous will result in 12% increase in the odds that the person will have a CHD So if a soccer player is one year older, the odds that (s)he will have CHD is 12% higher Magnitude of association 46
Anotherway: Calculatingpredictedprobabilities So, forsomebody 20 yearsold, the predictedprobability is .04 For somebody 70 yearsold, the predictedprobability is .91
Checking assumptions • Influential data points & Residuals • FollowSamanthas tips • Hosmer & Lemeshow • Divides sample in subgroups • Checkswhetherthere are differencesbetweenobserved and predictedbetweensubgroups • Test shouldnotbe significant, ifso: indication of lack of fit
Hosmer & Lemeshow Test divides sample in subgroups, checkswhetherdifferencebetweenobserved and predicted is aboutequal in these groups Test shouldnotbe significant (indicatingnodifference)
Examiningresiduals in lR • Isolatepointsforwhich the model fits poorly • Isolateinfluential data points