Logistic Regression using STATA

Logistic Regression using STATA

In this STATA lab session we will explore the following • How we invoke logistic regression using STATA • What the logistic regression output from STATA tells us • How we prepare data for logistic regression analysis using STATA • Seeking corroborative support for apparently confusing logistic regression results • How we re-organize our data from one format for logistic regression to another • How we conduct simple likelihood ratio tests using STATA • How logistic regression with categorical variables is performed • Special precautions to be taken in the logistic analysis of categorical variables • Interpreting logistic regression output with continuous and categorical variables • Default reference variable levels for logistic regression • Controlling variable reference level in logistic regression • Obtaining stratified logistic regression results • Tabulating stratified results • Sensitivity and specificity analysis using STATA • Model assessment

Starting a Logistic Regression analysis The first data set comes from Collett (1991) and uses fibrinogen and gamma globulin data (each plasma levels [gm/l]) from 32 people in attempting to create a model to predict erythrocytic sedimentation rate (ESR). ESR is coded 0 if its value is <20 [mm/hr] and 1 if >= 20 [mm/hr]. Thus the dichotomous outcome is to be predicted using two continuous variates. We’ll load the data set and describe the data . use c:\stata\ep521\session_3\esr, clear . describe Contains data from c:\stata\ep521\session_3\esr.dta obs: 32 vars: 5 18 Mar 1999 06:12 size: 480 (99.9% of memory free) --------------------------------------------------------------- 1. fibrin float %9.0g 2. id byte %8.0g 3. globulin byte %8.0g 4. response byte %8.0g 5. in_order float %9.0g --------------------------------------------------------------- Note the variable in_order. This was created to restore the order of the original data set as needed.

We can summarize the data as follows: . summarize Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- fibrin | 32 2.78875 .6370686 2.09 5.06 id | 32 16.5 9.380832 1 32 globulin | 32 35.65625 4.583346 28 46 response | 32 .1875 .3965578 0 1 in_order | 32 16.5 9.380832 1 32 The entire data set is as follows: . list fibrin id globulin response in_order 1. 2.52 1 38 0 1 2. 3.41 5 37 0 2 3. 3.15 9 39 0 3 4. 5.06 13 37 1 4 5. 3.53 17 46 1 5 6. 2.88 21 30 0 6 7. 2.67 25 39 0 7 8. 3.93 29 32 1 8 9. 2.56 2 31 0 9 10. 2.46 6 36 0 10 11. 2.6 10 41 0 11 12. 3.34 14 32 1 12 13. 2.68 18 34 0 13 14. 2.65 22 46 0 14 15. 2.29 26 31 0 15 16. 3.34 30 30 0 16 17. 2.19 3 33 0 17 . . . fibrin id globulin response in_order 18. 3.22 7 38 0 18 19. 2.29 11 36 0 19 20. 2.38 15 37 1 20 21. 2.6 19 38 0 21 22. 2.09 23 44 1 22 23. 2.15 27 31 0 23 24. 2.99 31 36 0 24 25. 2.18 4 31 0 25 26. 2.21 8 37 0 26 27. 2.35 12 29 0 27 28. 3.15 16 36 0 28 29. 2.23 20 37 0 29 30. 2.28 24 36 0 30 31. 2.54 28 28 0 31 32. 3.32 32 35 0 32

A good guide to the likelihood of a relationship between ESR and the two measurements can be obtained as follows: . table response, contents(mean fibrin sd f mean globulin sd g) ----------+--------------------------------------------------------------- response | mean(fibrin) sd(fibrin) mean(globulin) sd(globulin) ----------+--------------------------------------------------------------- 0 | 2.650385 .4056647 35.1154 4.179253 1 | 3.388333 1.07821 38 5.899152 ----------+--------------------------------------------------------------- Which measurement would you think ‘esr’ is more closely related to?

We now fit a logistic model to the esr values using fibrin and globulin . logit response globulin fibrin Iteration 0: log likelihood = -15.442482 1 Iteration 1: log likelihood = -11.798793 Iteration 2: log likelihood = -11.491693 Iteration 3: log likelihood = -11.485565 Iteration 4: log likelihood = -11.485557 Logit estimates 3 Number of obs = 32 4 LR chi2(2) = 7.91 5 Prob > chi2 = 0.0191 Log likelihood = -11.485557 2 Pseudo R2 = 0.2562 ------------------------------------------------------------------------------ response | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- globulin | 6.1557781 7.119538 8 1.303 9 0.193 -.0785121 .3900684 fibrin | 1.910369 .9709966 1.967 0.049 .0072502 3.813487 _cons | -12.79208 5.796411 -2.207 0.027 -24.15283 -1.431319 ------------------------------------------------------------------------------ 1 - Saturated model log likelihood 6 - Log odds coefficient = amount by which the log 2 - Log likelihood for our model odds are changed for each unit change in the 3 - Number of observations fitted covariate 4 - Model chi-square value = 2(Lmodel - Lsat) 7 - Error of the coefficient estimate (< |coefficient|) 5 - p-value for the chi-square statistic (check against df) 8 - Wald statistic for particular covariate 5 - d.f. for chi-square is number of model parameters 9 - p-value for the covariate 5 - null hypothesis… model is unable to change the likelihood value

How does Logistic Regression work?

We fit the ESR logistic model using the method of maximum likelihood Prob(O|x) = 1/(1 + exp(-b.x) Prob(O|x) = 1- 1/(1 + exp(-b.x) = exp(-b.x)/(1 + exp(-b.x)) Here b.x = b0 + b1.globulin + b2.fibrin The likelihood of all positive (O=1) outcomes is the joint probability of these all occurring given the ‘x’ configuration (values) l(O=1|x) = S[ - log(1 + exp(-b.xi))] and that for the ‘0’ outcomes is … l(O=0=O|x) = S[ - b.xi - log(1 + exp(-b.xi))] To find the values for `b’ we maximize the composite likelihood … including the ‘1’ outcomes (ESR=1) and the ‘0’ outcomes (ESR=0) [Note: O denotes outcome, here ESR, O denotes NOT O]

Let’s see what happens when we request odds ratios instead of log odds . logit, or Logit estimates Number of obs = 32 LR chi2(2) = 7.91 Prob > chi2 = 0.0191 Log likelihood = -11.485557 Pseudo R2 = 0.2562 ------------------------------------------------------------------------------ response | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- globulin | 1.168567 .1396882 1.303 0.193 .9244909 1.477082 fibrin | 6.755578 6.559644 1.967 0.049 1.007276 45.30815 ------------------------------------------------------------------------------ Odds ratios are calculated Odds ratio = exp(coefficient) Std. Err. Of Odds Ratio = Std. Err. Of Coefficient * Odds Ratio logodds (esr | fibrin, globulin) = logit = 0.156 glob + 1.91 fibrin - 12.8 Thus logodds (esr | fibrin, globulin+1) - logodds (esr | fibrin, globulin) = 0.156 Or odds[(esr | fibrin, globulin+1)] / odds[(esr | fibrin, globulin)] = exp(0.156) = 1.169 Odds Ratio = fractional change in the odds relative to some reference odds per unit increase in the covariate. Odds of esr moving to acceptable range increase by 16% for each 1 gm/l of gamma globulin increase The reference for the odds ratio is a null state, or absent state, or ‘unit decremented state’

We can examine the adequacy of our model using the Hosmer-Lemeshow statistic . lfit, group(10) table Logistic model for response, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) _Group _Prob _Obs_1 _Exp_1 _Obs_0 _Exp_0 _Total 1 0.0269 0 0.1 4 3.9 4 2 0.0443 0 0.1 3 2.9 3 3 0.0570 0 0.2 3 2.8 3 4 0.0769 0 0.2 3 2.8 3 5 0.1191 1 0.3 2 2.7 3 6 0.1658 1 0.6 3 3.4 4 7 0.1936 1 0.6 2 2.4 3 8 0.3271 0 0.8 3 2.2 3 9 0.3743 0 1.1 3 1.9 3 10 0.9333 3 2.1 0 0.9 3 number of observations = 32 number of groups = 10 Hosmer-Lemeshow chi2(8) = 7.55 Prob > chi2 = 0.4789 The fit seems quite acceptable To see how this is calculated we conduct the following: . predict p . xtile z=p, n(10) . by z: drop if _n~=_N . li z p z p 1. 1 .026904 2. 2 .0442598 3. 3 .0569809 4. 4 .076948 5. 5 .1132086 6. 6 .1657512 7. 7 .1936377 8. 8 .3271465 9. 9 .3742733 10. 10 .9332787

A good way to examine the importance of a (set of) variables in regard to predicting or explaining an outcome is with the Likelihood Ratio Test Here we first fit a so-called full model including the variable(s) under investigation . quietly logit response fibrin globulin and we save the log likelihood value using a particular name, e.g. ‘mall’ . lrtest, s(mall) Then we remove the variable(s) under investigation and refit the model . quietly logit response fibrin Finally, using our saved log likelihood value, we test the new value to see if it is Significantly ‘smaller’ than the full model value. Again the null hypothesis is that the removed variables are of no influence to the ‘goodness’ of the fit. Rejecting this hypothesis leads to ‘possible’ retention of the tested variables . lrtest, u(mall) Logit: likelihood-ratio test chi2(1) = 1.87 Prob > chi2 = 0.1716 In the case here, globulin seems not to have a significant influence on esr. Prior to performing tests such as the above the full model should be determined to be able to predict response significantly well.

Next we want to demonstrate how different data organizations can be examined using logistic regression The data are taken from Kleinbaum’s (‘91) report on coronary heart disease amongst males in Georgia and reflected an attempt to relate personal health and socioeconomic factors to CHD risk The information assembled included: CHD status; 1 disease present, 0 absent, CAT status; 1 unsafe, 0 safe … considered the exposure AGE status; 1 >=50, 0 <=50, and ECG status; 1 unsafe, 0 safe The data is listed as follows. Note the stratification . list chd n cat age ecg 1. 17 274 0 0 0 2. 15 122 0 1 0 3. 7 59 0 0 1 4. 5 32 0 1 1 5. 1 8 1 0 0 6. 9 39 1 1 0 7. 3 17 1 0 1 8. 14 58 1 1 1

This data looks different from our erythrocytic sedimentation rate data in three major regards: 1. The variables are all dichotomous 2. The data is stratified 3. There is a structured, epidemiologic basis for the analysis To examine such data via ‘logistic techniques’ using STATA we invoke STATA’s blogit procedure . blogit chd n cat age ecg, or Logit estimates Number of obs = 609 LR chi2(3) = 20.38 Prob > chi2 = 0.0001 Log likelihood = -209.09035 Pseudo R2 = 0.0465 ------------------------------------------------------------------------------ _outcome | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- cat | 1.863132 .5949992 1.948 0.051 .9963432 3.484003 age | 1.850907 .5253473 2.169 0.030 1.061173 3.228369 ecg | 1.436151 .417012 1.247 0.213 .8129023 2.53724 Notice that we didn’t worry about the fact that the variables were dichotomous… dichotomization is the only categorization which we need not worry about. Here the odds ratios are all with respect to a non-risk CAT, young, non-risk ECG individual. We see that moving from the non-risk group to the risk group increases the odds of CHD as follows CAT - 86% AGE - 85% ECG - 44% The model is clearly significant and exposure (cat) and age, but not ecg, seem associated with ‘chd’

We’ll now return to analysis of this data using the familiar logistic procedure But first we’ll need to get the data into ‘binary’ form, as opposed the above blocked (stratified) form Two points: 1. We can only perform logistic regression on dichotomous (0/1) variates 2. Binary data represents a weighting pattern in which outcomes in either class are frequency recorded Prepare for duplication to represent both outcomes and their frequencies . expand 2 Generate a dichotomous variate for our logistic regression . gen x=_n<=_N/2 Correct the frequencies to appropriately represent all outcome classes . replace chd=n-chd[_n-8] in -8/l . list x chd cat ecg age x chd cat ecg age 1. 1 17 0 0 0 2. 1 15 0 0 1 3. 1 7 0 1 0 4. 1 5 0 1 1 5. 1 1 1 0 0 6. 1 9 1 0 1 7. 1 3 1 1 0 8. 1 14 1 1 1 9. 0 257 0 0 0 10. 0 107 0 0 1 11. 0 52 0 1 0 12. 0 27 0 1 1 13. 0 7 1 0 0 14. 0 30 1 0 1 15. 0 14 1 1 0 16. 0 44 1 1 1 This data is now in the correct form for frequency weighted logistic regression

Fit the data using the logistic command. . logistic x cat age ecg [fwe=chd] Logit estimates Number of obs = 609 LR chi2(3) = 20.38 Prob > chi2 = 0.0001 Log likelihood = -209.09035 Pseudo R2 = 0.0465 ------------------------------------------------------------------------------ x | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- cat | 1.863132 .5949992 1.948 0.051 .9963432 3.484003 age | 1.850907 .5253473 2.169 0.030 1.061173 3.228369 ecg | 1.436151 .417012 1.247 0.213 .8129023 2.53724 ------------------------------------------------------------------------------ Note that the results are identical to those obtained using the blogit command Again we didn’t concern ourselves with the fact that the variables were dichotomous

Finally, let’s make our data into casewise form and see if our logistic regression produces the same results . expand chd Create the casewise records by expanding appropriately (593 observations created) This number plus the original 16 makes up all cases . drop chd n Remove un-needed variables . li in 1/10 cat age ecg x 1. 0 0 0 1 2. 0 1 0 1 3. 0 0 1 1 4. 0 1 1 1 5. 1 0 0 1 6. 1 1 0 1 7. 1 0 1 1 8. 1 1 1 1 9. 0 0 0 0 10. 0 1 0 0 The firsts 10 records of the casewise data set The logistic regression run for this data produced the following . logistic x cat ecg age Logit estimates Number of obs = 609 LR chi2(3) = 20.38 Prob > chi2 = 0.0001 Log likelihood = -209.09035 Pseudo R2 = 0.0465 ------------------------------------------------------------------------------ x | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- cat | 1.863132 .5949992 1.948 0.051 .9963432 3.484003 ecg | 1.436151 .417012 1.247 0.213 .8129023 2.53724 age | 1.850907 .5253473 2.169 0.030 1.061173 3.228369 ------------------------------------------------------------------------------

Now we will explore modeling categorical variables with more than two levels. For this purpose we will use the ‘mi’ data . su Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- agecat | 1976 3.11336 1.359189 1 5 smokelev | 1976 .792004 .7767226 0 2 ocuse | 1976 .082996 .2759459 0 1 mi | 1976 .1184211 .3231878 0 1 count | 1976 105.2439 54.48981 1 175 . table oc mi age, by(smoke) ----------+------------------------------------------------------------------- | agecat and mi smokelev | ---- 1 --- ---- 2 --- ---- 3 --- ---- 4 --- ---- 5 --- and ocuse | 0 1 0 1 0 1 0 1 0 1 ----------+------------------------------------------------------------------- 0 | 0 | 106 1 175 0 153 3 165 10 155 20 1 | 25 0 13 0 8 4 1 2 3 ----------+------------------------------------------------------------------- 1 | 0 | 79 0 142 5 119 11 130 21 96 42 1 | 25 1 10 1 11 1 4 1 ----------+------------------------------------------------------------------- 2 | 0 | 39 1 73 7 58 19 67 34 50 31 1 | 12 3 10 8 7 3 1 5 2 3 ----------+------------------------------------------------------------------- OC and MI are dichotomous Smokelev takes 3 values Agecat takes 5 values

Inappropriate approaches to analyzing this data are . logistic mi oc age Logit estimates Number of obs = 1976 LR chi2(2) = 149.19 Prob > chi2 = 0.0000 Log likelihood = -644.2106 Pseudo R2 = 0.1038 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 3.805339 .9398504 5.411 0.000 2.345103 6.174828 agecat | 2.031445 .1336688 10.771 0.000 1.785649 2.311074 ------------------------------------------------------------------------------ and . logistic mi oc age smoke Logit estimates Number of obs = 1976 LR chi2(3) = 272.70 Prob > chi2 = 0.0000 Log likelihood = -582.45183 Pseudo R2 = 0.1897 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 3.268099 .852567 4.539 0.000 1.959917 5.449452 agecat | 2.139077 .1506925 10.794 0.000 1.863209 2.455791 smokelev | 2.888646 .2920668 10.492 0.000 2.369358 3.521747 ------------------------------------------------------------------------------

The above approaches presume that the quantification of age and smoking levels translate to actual amounts of age (years) and smoking (cigarettes/day) which is not so and hence the results would be interpreted incorrectly. A guide to how that analysis should be handled comes from the fact that dichotomous variables are correctly managed as categorical variables without any modification. Thus if we could convert the polychotomous categorical variables to sets of dichotomous variables we may be ‘on safe ground’ STATA’s xi modifier accomplishes exactly this for us …. . li x 1. 1 2. 1 3. 1 4. 2 5. 2 6. 2 7. 3 8. 3 9. 3 . xi i.x i.x Ix_1-3 (naturally coded; Ix_1 omitted) . li x Ix_2 Ix_3 1. 1 0 0 2. 1 0 0 3. 1 0 0 4. 2 1 0 5. 2 1 0 6. 2 1 0 7. 3 0 1 8. 3 0 1 9. 3 0 1 Note that there were 2 dichotomous variables, or indicator or dummy variables automatically created for us by ‘xi’ to represent the x ‘classes’. If a categorical variable has L levels then L-1 indicators are need to represent the levels unambiguously

Note that level ‘1’ of variable ‘x’ was by default (automatically) omitted. We can control the omitted level using the ‘char’ command . drop I* . char x [omit] 3 . xi i.x i.x Ix_1-3 (naturally coded; Ix_3 omitted) . li x Ix_1 Ix_2 1. 1 1 0 2. 1 1 0 3. 1 1 0 4. 2 0 1 5. 2 0 1 6. 2 0 1 7. 3 0 0 8. 3 0 0 9. 3 0 0 Here we see that we have controlled the level of ‘x’ that was omitted and in fact level ‘3’ has been omitted.

Armed with a familiarity of the ‘xi’ modifier we use this to create the requisite set of ‘age’ dichotomizations . xi: logistic mi oc i.age i.age Iageca_1-5 (naturally coded; Iageca_1 omitted) Logit estimates Number of obs = 1976 LR chi2(5) = 151.47 Prob > chi2 = 0.0000 Log likelihood = -643.06749 Pseudo R2 = 0.1054 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 3.995528 1.001009 5.529 0.000 2.445236 6.528714 Iageca_2 | 3.121652 1.48894 2.387 0.017 1.225703 7.950304 Iageca_3 | 6.919897 3.171998 4.220 0.000 2.817857 16.9934 Iageca_4 | 14.12659 6.353765 5.888 0.000 5.850452 34.11027 Iageca_5 | 24.39291 10.91662 7.138 0.000 10.14666 58.64136 ----------------------------------------------------------------------------- In this table the referent group is the group ‘1’ age and non-oc users. Thus each odds ratio needs to be interpreted in regard to this group. We see that the odds of mi for the oc users is about 4 times that for the non oc users in the youngest age group. Furthermore the odds of mi for study participants in the second age group is about 3.12 times that for the youngest, non oc using group. Note that there is no controlling for smoking in this analysis. We see that 4 dichotomous variables (indicator or dummy variables) were created to represent the age categories

Most logistic analyses start by exploring strata-wise exposure odds ratio changes in regard to a (confounding) risk factor. For example, with epitab’s mhodds utility we can see how the odds ratio for mi with oc use changes by age. The referent group is the non-oc user within each stratum. . mhodds mi oc, by(age) Maximum likelihood estimate of the odds ratio Comparing ocuse==1 vs ocuse==0 by agecat ----------+-------------------------------------------------------------------- agecat | Odds ratio chi2(1) P>chi2 [95% Conf. Interval] ----------+-------------------------------------------------------------------- 1 | 7.225806 6.78 0.0092 1.263340 41.32877 2 | 8.863636 28.64 0.0000 3.369128 23.31881 3 | 1.538462 0.58 0.4450 0.504952 4.687303 4 | 3.712821 6.58 0.0103 1.266587 10.88361 5 | 3.883871 5.53 0.0187 1.147490 13.14561 ----------+-------------------------------------------------------------------- Mantel-Haenszel estimate controlling for agecat ---------------------------------------------------------------- Odds ratio chi2(1) P>chi2 [95% Conf. Interval] ---------------------------------------------------------------- 3.969895 34.72 0.0000 2.418041 6.517702 ---------------------------------------------------------------- Test of homogeneity of ORs (approx): chi2(4) = 6.27 Pr>chi2 = 0.1797

We can accomplish precisely this analysis using our logistic regression procedure as follows: . byvar age: logistic mi oc, nolog -> agecat==1 Logit estimates Number of obs = 292 LR chi2(1) = 5.42 Prob > chi2 = 0.0199 Log likelihood = -26.535604 Pseudo R2 = 0.0927 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 7.225806 6.343062 2.253 0.024 1.29322 40.37387 ------------------------------------------------------------------------------ -> agecat==2 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 8.863636 4.225934 4.577 0.000 3.481632 22.5653 ------------------------------------------------------------------------------ -> agecat==3 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 1.538462 .8727223 0.759 0.448 .5060879 4.676784 ------------------------------------------------------------------------------ -> agecat==4 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 3.712821 2.019736 2.411 0.016 1.278377 10.78323 ------------------------------------------------------------------------------ -> agecat==5 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 3.883871 2.396514 2.199 0.028 1.158877 13.01644 ------------------------------------------------------------------------------

For the smoking covariate we have (full output displayed) . byvar smoke: logistic mi oc, nolog -> smokelev==0 Logit estimates Number of obs = 844 LR chi2(1) = 0.84 Prob > chi2 = 0.3590 Log likelihood = -154.53224 Pseudo R2 = 0.0027 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 1.705882 .9342955 0.975 0.329 .5831149 4.9905 ------------------------------------------------------------------------------ -> smokelev==1 Logit estimates Number of obs = 699 LR chi2(1) = 2.58 Prob > chi2 = 0.1084 Log likelihood = -251.42202 Pseudo R2 = 0.0051 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | .4214445 .2554406 -1.426 0.154 .1284757 1.382484 ------------------------------------------------------------------------------ -> smokelev==2 Logit estimates Number of obs = 433 LR chi2(1) = 6.12 Prob > chi2 = 0.0134 Log likelihood = -246.54893 Pseudo R2 = 0.0123 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 2.144701 .6471826 2.529 0.011 1.18716 3.874578 ------------------------------------------------------------------------------

Notice the approximate test of homogeneity of odds ratios reported by the mhodds utility. Test of homogeneity of ORs (approx): chi2(4) = 6.27 Pr>chi2 = 0.1797 Using slightly more advanced approaches to logistic regression, in conjunction with the likelihood ratio test, we can confirm this value 1. Fit a full oc, age logistic model to the mi data including an interaction term . quietly xi: logistic mi i.oc*i.age 2. Save the log likelihood value . lrtest, s(int) 3. Fit a model involving just oc and age . quietly xi:logistic mi oc i.age 4. Test the new likelihood value against that including the interaction . lrtest, u(int) Logistic: likelihood-ratio test chi2(4) = 6.54 Prob > chi2 = 0.1626 This demonstration not only shows the homogeneity of odds ratios can be studied using logistic regression but also that heterogeneity is in fact the existence of a significant interaction between exposure and a confounding effect. Here the odds ratios are homogeneous

We can capture stratified logistic regression results into a table as the following demonstration shows . byvar smoke, b(oc) se(oc) e(chi2) tab: logistic mi oc smokelev | e(chi2) _b[oc] _se[oc] ----------+------------------------------------------ 0 | .84139909 .53408249 .54769045 1 | 2.5773654 -.86406712 .6061074 2 | 6.1158379 .76300019 .3017589 Here the ‘_b’ denotes log odds coefficient, and ’e(chi2)’ denotes the model chi-square value An interesting (and potentially important) anomaly exists in this data . tab oc sm, su(mi) mean Means of mi | smokelev ocuse | 0 1 2 | Total -----------+---------------------------------+---------- 0 | .04314721 .12248062 .24274406 | .11313466 1 | .07142857 .05555556 .40740741| .17682927 -----------+---------------------------------+---------- Total | .0450237 .11731044 .26327945 | .11842105

Sensitivity and Specificity We can demonstrate the concepts of sensitivity using logistic regression. We’ll see how sensitive mi is to oc and smoking together First regress (logistically) mi on oc and smoking level … . xi: logistic mi oc i.smoke i.smoke Ismoke_0-2 (naturally coded; Ismoke_0 omitted) Logit estimates Number of obs = 1976 LR chi2(3) = 125.07 Prob > chi2 = 0.0000 Log likelihood = -656.26727 Pseudo R2 = 0.0870 ------------------------------------------------------------------------------ mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- ocuse | 1.391117 .3164664 1.451 0.147 .8906832 2.172723 Ismoke_1 | 2.80938 .5716297 5.077 0.000 1.88545 4.186066 Ismoke_2 | 7.439769 1.481485 10.078 0.000 5.035694 10.99157 ------------------------------------------------------------------------------ Let’s see how well our model predicts the actual observations … . lpredict p . table oc sm, c(mean p mean mi) ----------+----------------------------- | smokelev ocuse | 0 1 2 ----------+----------------------------- 0 | .0439517 .1143808 .2548561 | .0431472 .1224806 .2427441 | 1 | .0601086 .1523036 .3223985 Much better for the non-users of oc than | .0714286 .0555556 .4074074 for the oc users ----------+-----------------------------

. tab p Pr(mi) | Freq. Percent Cum. ------------+----------------------------------- .0439517 | 788 39.88 39.88 .0601086 | 56 2.83 42.71 .1143808 | 645 32.64 75.35 .1523036 | 54 2.73 78.09 .2548561 | 379 19.18 97.27 .3223985 | 54 2.73 100.00 ------------+----------------------------------- Total | 1976 100.00 . lsens, gensens(se) genspec(sp) genprob(pp) . tabdisp oc smoke, cell(se sp pp) ----------+----------------------------- | smokelev ocuse | 0 1 2 ----------+----------------------------- 0 | 0.854701 0.500000 0.094017 | 0.432836 0.787600 0.981630 | 0.060109 0.152304 0.322399 | 1 | 0.837607 0.487179 0.000000 | 0.462687 0.816877 1.000000 | 0.114381 0.254856 1.000000 ----------+----------------------------- . count if p>pp[6] & mi 196 . local a=_result(1) . count if mi 234 . local n=_result(1) . local s=`a'/`n' . di in green "The sensitivity is " `s' The sensitivity is .83760684 . count if p<pp[6] & ~mi 754 . local b=_result(1) . count if ~mi 1742 . local m=_result(1) . local s=`b'/`m' . di in green "The specificity is " `s' The specificity is .43283582 se sp pp se sp pp

Linear Regression The problem: minimize S2 = S(Yi – Xib)2 with respect to b This leads, explicitly, to b, (an estimate of b) = [X`X]-1[X`Y] Logistic Regression With logistic regression we maximize the likelihood of all observed outcomes deriving from our assumed model. Here xi is the stratum pertaining to subject ‘i’ and yi is the outcome for subject ‘i’ . The problem is, then: maximize L = Pzi (xi) = P pi (xi)yi.(1-p(xi))1-yi [Remember P(n|r) = nCr . pr.(1-p)n-r is the probability of r individuals out of a group of n having an ‘outcome’ of interest … in our definition, p=p]. To ease the problem we use l=ln(L), as opposed to L The problem: maximize l = S ln(pi(=xib)|yi) or, maximize l(b)= S (yi.ln(p(xi)) + (1-yi).ln(1-p(xi))) with respect to b

program drop _all program define fit_esr version 7.0 args lnf theta quietly replace `lnf' = cond($ML_y1, /* */ -ln(1+exp(-`theta')), /* */ -(`theta')-ln(1+exp(-`theta'))) end use c:\stata\ep521\session_3\esr, clear ml model lf fit_esr (response=globulin fibrin) ml maximize, nolog Stata code to fit the mode logodds(O|x) = b0+b1.globulin+b2.fibrin Convergence to final coefficients initial: log likelihood = -22.18071 alternative: log likelihood = -18.170463 rescale: log likelihood = -16.024374 Iteration 0: log likelihood = -16.024374 Iteration 1: log likelihood = -11.68683 Iteration 2: log likelihood = -11.486454 Iteration 3: log likelihood = -11.485557 Iteration 4: log likelihood = -11.485557 Number of obs = 32 Wald chi2(2) = 4.76 Log likelihood = -11.485557 Prob > chi2 = 0.0927 ------------------------------------------------------------------------------ response | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- globulin | .1557781 .1195382 1.30 0.193 -.0785124 .3900687 fibrin | 1.910369 .9709982 1.97 0.049 .0072471 3.81349 _cons | -12.79208 5.796421 -2.21 0.027 -24.15285 -1.4313 ------------------------------------------------------------------------------

Logistic Regression using STATA

Logistic Regression using STATA

Presentation Transcript

Logistic regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Introduction to Logistic Regression In Stata

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression