1 / 32

Logistic Regression and Generalized Linear Models:

Blood Screening, Women’s Role in Society, and Colonic Polyps. Logistic Regression and Generalized Linear Models:. Blood Screening. ESR measurements (erythrocyte sedimentation rate) ‏ Study checks for the rate of increase of ESR when (blood proteins) fibrinogen and globulin increase

ginger
Télécharger la présentation

Logistic Regression and Generalized Linear Models:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Blood Screening, Women’s Role in Society, and Colonic Polyps Logistic Regression and Generalized Linear Models:

  2. Blood Screening • ESR measurements (erythrocyte sedimentation rate)‏ • Study checks for the rate of increase of ESR when (blood proteins) fibrinogen and globulin increase • Looking for an association between the probability of an ESR reading > 20mm/hr and the levels of the two plasma proteins. • Less than 20mm/hr indicates a healthy individual

  3. Multiple Regression Doesn’t Work • Not Normally distributed • The response variable is binary. • In fact, the distribution is Binomial. The can be seen by looking at how the error term relates to the probability. If y = 1 then the error term is 1- P(y=1). We will assume for our Random Variables Y that

  4. Logistic Regression, a Generalized Linear Model • Modelling the expected value of the response requires a transformation • This chapter is about the logit transformation of a model, which is of the form • If the response variable is 1, the log odds of the response is the logit function of a probability. • Fixing all explanatory variables but xj, exp(Bj) represents the odds that the response variable is 1 when xj increases by 1.

  5. R commands • Plasma_glm_1 <- glm(ESR ~ fibrinogen, data = plasma, family = binomial())‏ • Fits the model, in this example specifying the logistic function is implied because of the binomial() parameter is already passed. • Layout(matrix(1:2, ncol = 2))‏ • Lets you choose the layout of your graph screen. • Cdplot(ESR ~ fibrinogen, data = plasma)‏ • Cdplot(ESR ~ globulin, data = plasma)‏ • cdplot “plots conditional densities describing how the conditional distribution of a categorical variable y changes over a numerical variable x” (help(“cdplot”))‏

  6. Interpretation • The area of the dark region is the probability that ESR < 20 mm/hr, and this decreases as the protein levels increase • There is not much shape to the globulin density function.

  7. R Commands • Confint(plasma_glm_1, parm = “fibrinogen”)‏ • Gives a confidence interval • Output: • Waiting for profiling to be done... • 2.5 % 97.5 % • 0.3389465 3.9988602 • Exp(coef(plasma_glm_1)[“fibrinogen”])‏ • This is necessary to do the reverse transformation to get the model coefficients • Output: • fibrinogen • 6.215715

  8. R Commands • Summary(plasma_glm_1)‏ • Output: • Deviance Residuals: • Min 1Q Median 3Q Max • -0.9298 -0.5399 -0.4382 -0.3356 2.4794 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) -6.8451 2.7703 -2.471 0.0135 * • fibrinogen 1.8271 0.9009 2.028 0.0425 * • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for binomial family taken to be 1) • Null deviance: 30.885 on 31 degrees of freedom • Residual deviance: 24.840 on 30 degrees of freedom

  9. R Commands • Exp(confint(plasma_glm_1, parm = “fibrinogen”))‏ • Output: • 2.5 % 97.5 % • 1.403468 54.535954 • Plasma_glm_2 <- glm(ESR ~ fibrinogen + globulin, data = plasma, family = binomial())‏ • Use logistic regression for both variables fibrinogen and globulin • Summary(Plasma_glm_2)‏

  10. R Commands • Summary(Plasma_glm_2)‏ • Output • Deviance Residuals: • Min 1Q Median 3Q Max • -0.9683 -0.6122 -0.3458 -0.2116 2.2636 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) -12.7921 5.7963 -2.207 0.0273 * • fibrinogen 1.9104 0.9710 1.967 0.0491 * • globulin 0.1558 0.1195 1.303 0.1925 • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for binomial family taken to be 1) • Null deviance: 30.885 on 31 degrees of freedom • Residual deviance: 22.971 on 29 degrees of freedom

  11. Interpretation • large confidence interval is because there are not many observations in all, and even fewer ESR > 20mm/hr • The globulin coefficient is basically zero

  12. Anova(plasma_glm_1, plasma_glm_2, test = “Chisq”)‏ Compares the two models one of just fibrinogen, the other of both fibrinogen and globulin Why Chisquared test? ANOVA assumes normal distribution for each data set. Why is this OK? Output Analysis of Deviance Table Model 1: ESR ~ fibrinogen Model 2: ESR ~ fibrinogen + globulin Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 30 24.8404 2 29 22.9711 1 1.8692 0.1716 Comparing the residual deviance between the two models, we see that the difference is not significant. We must surmise that there is no association between globulin and ESR level. R Commands

  13. R Commands • Prob <- predict(plasma_glm_1, type = “response” • The model of just fibrinogen useful in creating a neat looking bubble plot. First we must take the predicted values from the first model and use them to determine the size of the bubbles • Plot(globulin ~ fibrinogen, data = plasma, xlim = c(2, 6), ylim = c(25, 50), pch = “.”)‏ • Plots the second model • Symbols(plasma$fibrinogen, plasma$globulin, circles = prob, add = TRUE)‏ • Uses the values of the first model to create different bubble sizes.

  14. The plot shows how the probability of having an ESR > 20mm/hr increases as fibrinogen increases Bubbles size is Probability

  15. Generalized Linear Model • Unifies the logistic regression, Analysis of Variance and multiple regression techniques. • GLMs have three essential parts • An error distribution- the distribution of the response variable. • Normal for Analysis of Variance and Multiple Regression • Binomial for Logistic Regression

  16. Main parts of GLM (continued)‏ • A link function which links the explanatory variables to the expected value of the response. • Logit function for logistic regression • Identity function for ANOVA and multiple regression • A variance function which shows the dependency of the response variable variability on the mean

  17. Measure of Fit • The deviance shows how well the model fits the data • Comparing two model’s deviances • Use a likelihood ratio test • Compare using Chi-square distribution

  18. Women’s Role in Society • Response to “Women should take care of running their homes and leave the running the country up to men”. • Factors: Education, Sex • Response: Agree or Disagree • data is presented as categories with counts for each education, sex combination

  19. R Commands • Womensrole_glm_1 <- glm(cbind(agree, disagree) ~ sex + education, data = womensrole, family= binomial())‏ • This uses the cbind function to change data from two responses to one response that is a matrix of agree, disagree counts. • Summary(womensrole_glm_1)‏ • We can see that education is a significant factor to the response.

  20. Summary Output • Deviance Residuals: • Min 1Q Median 3Q Max • -2.72544 -0.86302 -0.06525 0.84340 3.13315 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) 2.50937 0.18389 13.646 <2e-16 *** • Why is there a P-value for the intercept? • sexFemale -0.01145 0.08415 -0.136 0.892 • education -0.27062 0.01541 -17.560 <2e-16 *** • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for binomial family taken to be 1) • Null deviance: 451.722 on 40 degrees of freedom • Residual deviance: 64.007 on 38 degrees of freedom

  21. Declaring a function in R • Myplot <- function(role.fitted) { • Declares the function myplot and passes it an object “role.fitted” to be used in the function • f <- womensrole$sex == “Female” • Stores everything in the data with sex = femail in f • plot(womensrole$education, role.fitted, type = "n", ylab = "probability of agreeing", xlab = "Education", ylim = c(0,1))‏ • Plots education against the role.fitted object which will be some predicted values from our GLM defined as • Role.fitted1 <- predict(womensrole_glm_1, type = “response”)‏

  22. Myplot function (cont)‏ • lines(womensrole$education[!f], role.fitted[!f], lty = 1) • lines(womensrole$education[f], role.fitted[f], lty = 2) • These graph the lines, one for males and one for femails. Lty indicates the kind of line (in this case solid or dotted)‏ • lgtxt <- c("Fitted (Males)", "Fitted (Females)") • legend("topright", lgtxt, lty = 1:2, bty = "n") • These add a legend for each line.

  23. Myplot Function (cont)‏ • y<-womensrole$agree/ (womensrole$agree + womensrole$disagree)‏ • A basic calculation of the proportion of women that agree • text(womensrole$education, y, ifelse(f, "\\VE", "\\MA"), family = "HersheySerif", cex = 1.25)‏ • Plots y and not y using male/female symbols

  24. Interpretation • The two fitted lines indicate that sex does not change a probability of agreeing vs education • The symbols of unfitted observations may indicate an interaction between sex and education

  25. MyPlot Sex:Education • By running the same analysis, e.g. • Womensrole_glm_2 <- glm(cbind(agree, disagree) ~ sex*education, data = womensrole, family = binomial())‏ • Summary(womensrole_glm_2)‏ • This summary shows a significant p-value for the sex:education interaction • Role.fitted2 <- predict(womensrole_glm_1, type = “response”)‏ • Myplot(role.fitted2)‏ • The plot shows that less education is associated with agreement that women belong in the home.

  26. The Deviance Residual • One of the many methods for checking the adequacy of the model fit • The deviance residual is the square root of the part of each observation that contributes to the deviance • Res <- residuals( womensrole_glm_2, type = “deviance”)‏ • Pulls the residuals for many other models as well • Plot(predict(womensrole_glm_2), res, xlab = “fitted values”, ylab = “Residuals”, lim = max(abs(res))*c(-1,1))‏ • No visible pattern: fit appears ok • Abline(h = 0, lty = 2)‏ • Adds a dotted line at height 0

  27. Drug Treatment TestingFamlilial Andenomatous Polyposis (FAP)‏ • Counts of colonic polyps after 12 months of treatment • Don’t want to be the guy that did that • Placebo controlled (Binary factor)‏ • Age

  28. GLM Analysis • Multiple Regression won’t work. • Count data strictly positive • Normality is not probable • Poisson Regression- GLM with a log link function • Ensures a Poisson Distribution • Ensures positive fitted amounts • R command: • Polyps_glm_1 <- glm(number ~ treat + age, data = polyps, family = poisson())‏

  29. Model Summary • Call: • glm(formula = number ~ treat + age, family = poisson(), data = polyps)‏ • Deviance Residuals: • Min 1Q Median 3Q Max • -4.2212 -3.0536 -0.1802 1.4459 5.8301 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) 4.529024 0.146872 30.84 < 2e-16 *** • treatdrug -1.359083 0.117643 -11.55 < 2e-16 *** • age -0.038830 0.005955 -6.52 7.02e-11 *** • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for poisson family taken to be 1)‏ • Null deviance: 378.66 on 19 degrees of freedom • Residual deviance: 179.54 on 17 degrees of freedom • AIC: 273.88 • Number of Fisher Scoring iterations: 5

  30. Model Problem: Overdispersion • In previous models the variance can be seen as dependent solely on the mean • E.G. Binomial, Poisson • In practice, this doesn’t always work. • Sometimes the raw data points are not independent, there is some correlation, in this case, possible “clustering” • Compare residual deviance and degrees of freedom to determine • These should be basically equal

  31. Model Solution: Quasi-Likelihood • This procedure estimates the other factors that might contribute to the variance • R Command • Polyps_glm_2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson())‏ • Summary(polyps_glm_2)‏ • The coefficients are still significant, but less so

  32. Homework • Please run through the commands for the myplot function (pg 100) and send me the command script. • Please send me what you thought of the presentation, give me a grade and add any constructive criticism. • zweihanderdawg@gmail.com

More Related