Statistical Analysis SC504/HS927 Spring Term 2008

Statistical AnalysisSC504/HS927Spring Term 2008 Introduction to Logistic Regression Dr. Daniel Nehring

Outline • Preliminaries: The SPSS syntax • Linear regression and logistic regression • OLS with a binary dependent variable • Principles of logistic regression • Interpreting logistic regression coefficients • Advanced principles of logistic regression (for self-study) • Source: http://privatewww.essex.ac.uk/~dfnehr

PRELIMINARIES

The SPSS syntax • Simple programming language allowing access to all SPSS operations • Access to operations not covered in the main interface • Accessible through syntax windows • Accessible through ‘Paste’ buttons in every window of the main interface • Documentation available in ‘Help’ menu

Using SPSS syntax files • Saved in a separate file format through the syntax window • Run commands by highlighting them and pressing the arrow button. • Comments can be entered into the syntax. • Copy-paste operations allow easy learning of the syntax. • The syntax is preferable at all times to the main interface to keep a log of work and identify and correct mistakes.

PART I

Simple linear regression • Relation between 2 continuous variables Regression coefficient b1 • Measures associationbetween y and x • Amount by which y changes on average when x changes by one unit • Least squares method y Slope x

Multiple linear regression • Relation between a continuous variable and a setof i continuous variables • Partial regression coefficients bi • Amount by which y changes on average when xi changes by one unit and all the other xis remain constant • Measures association between xi and y adjusted for all other xi

Multiple linear regression Predicted Predictor variables Response variable Explanatory variables Dependent Independent variables

OLS with a binary dependent variable • Binary variables can take only 2 possible values: • yes/no (e.g. educated to degree level, smoker/non-smoker) • success/failure (e.g. of a medical treatment) • Coded 1 or 0 (by convention 1=yes/ success) • Using OLS for a binary dependent variable  predicted values can be interpreted as probabilities; expected to lie between 0 and 1 • But nothing to constrain the regression model to predict values between 0 and 1; less than 0 & greater than 1 are possible and have no logical interpretation • Approaches which ensure that predicted values lie between 0 & 1 are required such as logistic regression

Fitting equation to the data • Linear regression: Least squares • Logistic regression: Maximum likelihood • Likelihood function • Estimates parameters with property that likelihood (probability) of observed data is higher than for any other values • Practically easier to work with log-likelihood

Maximum Likelihood Estimation (MLE) • OLS cannot be used for logistic regression since the relationship between the dependent and independent variable is non-linear • MLE is used instead to estimate coefficients on independent variables (parameters) • Of all possible values of these parameters, MLE chooses those under which the model would have been most likely to generate the observed sample

Logistic regression • Models relationship betweenset of variables xi • dichotomous (yes/no) • categorical (social class, ...) • continuous (age, ...) and • dichotomous (binary) variable Y

PART II

Logistic regression (1) • ‘Logistic regression’ or ‘logit’ • p is the probability of an event occurring • 1-p is the probability of the event not occurring • p can take any value from 0 to 1 • the odds of the event occurring = • the dependent variable in a logistic regression is the natural log of the odds:

Logistic regression (2) • ln (.) can take any value, p will always range from 0 to 1 • the equation to be estimated is:

{ logit of P(y|x) Logistic regression (3) Logistic transformation

Predicting p let then to predict p for individual i,

Logistic function (1) Probability ofevent y x

PART III

Interpreting logistic regression coefficients • intercept is value of ‘log of the odds’ when all independent variables are zero • each slope coefficient is the change in log odds from a 1-unit increase in the independent variable, controlling for the effects of other variables • two problems: • log odds not easy to interpret • change in log odds from 1-unit increase in one independent depends on values of other independent variables • but the exponent of b (eb) is not dependent on values of other independent variables and is the odds ratio

Odds ratio • odds ratio for coefficient on a dummy variable, e.g. female=1 for women, 0 for men • odds ratio = ratio of the odds of event occurring for women to the odds of its occurring for men • odds for women are eb times odds for men

General rules for interpreting logistic regression coefficients if b1 > 0, X1 increases p if b1 < 0, X1 decreases p if odds ratio >1, X1 increases p if odds ratio < 1, X1 decreases p if CI for b1 includes 0, X1 does not have a statistically significant effect on p if CI for odds ratio includes 1, X1 does not have a statistically significant effect on p

An example: modelling the relationship between disability, age and income in the 65+ population • dependent variable = presence of disability (1=yes,0=no) • independent variables: X1 age in years (in excess of 65 i.e. 650, 70  5) X2 whether has low income (in lowest 3rd of the income distribution) • data: Health Survey for England, 2000

Example: logistic regression estimate for probability of being disabled, people aged 65+

PART IV

Odds, log odds, odds ratios and probabilities

Odds, odd ratios and probabilities • pj= 0.2 i.e. a 20% probability • oddsj = 0.2/(1-0.2) = 0.2/0.8 = 0.25 • pk = 0.4 • oddsk= 0.4/0.6 = 0.67 • relative probability/risk pj/pk = 0.2/0.4 = 0.5 • odds ratio, oddsi/oddsj = 0.25/0.67 = 0.37 • odds ratio is not equal to relative probability/risk • exceptapproximately if pj and pk are small………

Points to note from logit example.xls • if you see an odds ratio of e.g. 1.5 for a dummy variable indicating female, beware of saying ‘women have a probability 50% higher than men’. Only if both p’s are small can you say this. • better to calculate probabilities for example cases and compare these

Predicting p let then to predict p for individual i,

E.g.: Predicting a probability from our model • Predict disability for someone on low income aged 75: • Add up the linear equation a(=-.912) + [age over 65 i.e.]10*0.078+1*-0.27 =-0.402 • Take the exponent of it to get to the odds of being disabled =.669 • Put the odds over 1+the odds to give the probability =c.0.4 – or a 40 per cent chance of being disabled

Goodness of fit in logistic regressions • based on improvements in the likelihood of observing the sample • use a chi-square test with the test statistic = • where R and U indicate restricted and unrestricted models • unrestricted – all independent variables in model • restricted – all or a subset of variables excluded from the model (their coefficients restricted to be 0)

Statistical significance of coefficient estimates in logistic regressions • Calculated using standard errors as in OLS • for large n, t > 1.96 means that there is a 5% or lower probability that the true value of the coefficient is 0. or p  0.05

95% confidence intervals for logistic regression coefficient estimates • For CIs of odds ratios calculate CIs for coefficients and take their exponents

Statistical Analysis SC504/HS927 Spring Term 2008