1 / 25

Logistic Regression

Logistic Regression. Database Marketing Instructor: N. Kumar. Logistic Regression vs TGDA. Two-Group Discriminant Analysis Implicitly assumes that the Xs are Multivariate Normally (MVN) Distributed This assumption is violated if Xs are categorical variables

becka
Télécharger la présentation

Logistic Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logistic Regression Database Marketing Instructor: N. Kumar

  2. Logistic Regression vs TGDA • Two-Group Discriminant Analysis • Implicitly assumes that the Xs are Multivariate Normally (MVN) Distributed • This assumption is violated if Xs are categorical variables • Logistic Regression does not impose any restriction on the distribution of the Xs • Logistic Regression is the recommended approach if at least some of the Xs are categorical variables

  3. Data

  4. Contingency Table

  5. Basic Concepts • Probability • Probability of being a preferred stock = 12/24 = 0.5 • Probability that a company’s stock is preferred given that the company is large = 10/11 = 0.909 • Probability that a company’s stock is preferred given that the company is small = 2/13 = 0.154

  6. Concepts … contd. • Odds • Odds of a preferred stock = 12/12 = 1 • Odds of a preferred stock given that the company is large = 10/1 = 10 • Odds of a preferred stock given that the company is small = 2/11 = 0.182

  7. Odds and Probability • Odds(Event) = Prob(Event)/(1-Prob(Event)) • Prob(Event) = Odds(Event)/(1+Odds(Event))

  8. Logistic Regression • Take Natural Log of the odds: • ln(odds(Preferred|Large)) = ln(10) = 2.303 • ln(odds(Preferred|Small)) = ln(0.182) = -1.704 • Combining these relationships • ln(odds(Preferred|Size)) = -1.704 + 4.007*Size • Log of the odds is a linear function of size • The coefficient of size can be interpreted like the coefficient in regression analysis

  9. Interpretation • Positive sign  ln(odds) is increasing in size of the company i.e. a large company is more likely to have a preferred stock vis-à-vis a small company • Magnitude of the coefficient gives a measure of how much more likely

  10. General Model • ln(odds) = 0 + 1X1 + 2X2 +…+ kXK (1) • Recall: • Odds = p/(1-p) • ln(p/1-p) = 0 + 1X1 + 2X2 +…+ kXK (2) • p = • p =

  11. Logistic Function

  12. Estimation • Coefficients in the regression model are estimated by minimizing the sum of squared errors • Since, p is non-linear in the parameter estimates we need a non-linear estimation technique • Maximum-Likelihood Approach • Non-Linear Least Squares

  13. Maximum Likelihood Approach • Conditional on parameter , write out the probability of observing the data • Write this probability out for each observation • Multiply the probability of each observation out to get the joint probability of observing the data condition on  • Find the  that maximizes the conditional probability of realizing this data

  14. Logistic Regression • Logistic Regression with one categorical explanatory variable reduces to an analysis of the contingency table

  15. Interpretation of Results Look at the –2 Log L statistic • Intercept only: 33.271 • Intercept and Covariates: 17.864 • Difference: 15.407 with 1 DF (p=0.0001) • Means that the size variable is explaining a lot

  16. Do the Variables Have a Significant Impact? • Like testing whether the coefficients in the regression model are different from zero • Look at the output from Analysis of Maximum Likelihood Estimates • Loosely, the column Pr>Chi-Square gives you the probability of realizing the estimate in the Parameter estimate column if the estimate were truly zero – if this value is < 0.05 the estimate is considered to be significant

  17. Other things to Look for • Akaike’s Information Criterion (AIC), Schwartz’s Criterion (SC) – this like Adj-R2 – so there is a penalty for having additional covariates • The larger the difference between the second and third columns – the better the model fit

  18. Interpretation of the Parameter Estimates • ln(p/(1-p)) = -1.705 + 4.007*Size • p/(1-p) = e(-1.705) e(4.007*Size) • For a unit increase in size, odds of being a favored stock go up by e4.007 = 54.982

  19. Predicted Probabilities and Observed Responses • The response variable (success) classifies an observation into an event or a no-event • A concordant pair is defined as that pair formed by an event with a PHAT higher than that of the no-event • Higher the Concordant pair % the better

  20. Classification • For a set of new observations where you have information on size alone • You can use the model to predict the probability that success = 1 i.e. the stock is favored • If PHAT > 0.5 success = 1else success=2

  21. Logistic Regression with multiple independent variables • Independent variables a mixture of continuous and categorical variables

  22. Data

  23. General Model • ln(odds) = 0 + 1Size + 2FP • ln(p/1-p) = 0 + 1Size + 2FP • p = • p =

  24. Estimation & Interpretation of the Results • Identical to the case with one categorical variable

  25. Summary • Logistic Regression or Discriminant Analysis • Techniques differ in underlying assumptions about the distribution of the explanatory (independent) variables • Use logistic regression if you have a mix of categorical and continuous variables

More Related