1 / 22

MIS2502: Data Analytics Classification using Decision Trees

MIS2502: Data Analytics Classification using Decision Trees. What is classification?. Determining to what group a data element belongs Or “attributes” of that “entity” Examples Determining whether a customer should be given a loan Flagging a credit card transaction as a fraudulent charge

mateja
Télécharger la présentation

MIS2502: Data Analytics Classification using Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MIS2502:Data AnalyticsClassification using Decision Trees

  2. What is classification? • Determining to what group a data element belongs • Or “attributes” of that “entity” • Examples • Determining whether a customer should be given a loan • Flagging a credit card transaction as a fraudulent charge • Categorizing a news story as finance, entertainment, or sports

  3. How classification works

  4. Decision Tree Learning Training Set Derive model Classification software Validation Set Apply model

  5. Goals

  6. Classification Method: The Decision Tree • A model to predict membership of cases or values of a dependent variable based on one or more predictor variables (Tan, Steinback, and Kumar 2004)

  7. Example: Credit Card Default Classification Predictors Training Data

  8. Example: Credit Card Default Leafnode Classification Predictors Child node Root node Training Data

  9. Same Data, Different Tree Classification Predictors We just changed the order of the predictors. Training Data

  10. Apply to new (validation) data Validation Data How well did the decision tree do in predicting the outcome? When it’s “good enough,” we’ve got our model for future decisions.

  11. In a real situation…

  12. Tree Induction Algorithms 0.8 For instance, you may find: When income > 40k, debt < 20%, and the customers rents no default occurs 80% of the time.

  13. How the induction algorithm works Start with single node with all training data Yes Are samples all of same classification? No No Are there predictor(s) that will split the data? Yes Partition node into child nodes according to predictor(s) No Are there more nodes (i.e., new child nodes)? DONE! Yes Go to next node.

  14. Start with root node • There are both “defaults” and “no defaults” in the set • So we need to look for predictors to split the data

  15. Split on income • Income is a factor • (Income < 40, Debt > 20, Owns = Default) but (Income > 40, Debt > 20, Owns = No default) • But there are also a combination of defaults and no defaults within each income group • So look for another split

  16. Split on debt • Debt is a factor • (Income < 40, Debt < 20, Owns = No default) but (Income < 40, Debt < 20, Rents = Default) • But there are also a combination of defaults and no defaults within some debt groups • So look for another split

  17. Split on Owns/Rents • Owns/Rents is a factor • For some cases it doesn’t matter, but for some it does • So you group similar branches • …And we stop because we’re out of predictors!

  18. How does it know when and how to split? There are statistics that show • When a predictor variable maximizes distinct outcomes(if age is a predictor, we should see that older people buy; younger people don’t) • When a predictor variable separates outcomes(if age is a predictor, we should not see older people who buy mixed up with older people who don’t)

  19. Example: Chi-squared test

  20. Chi-squared test Small p-values (i.e., less than 0.05 mean it’s very unlikely the groups are the same) So Owns/Rents is a predictor that creates two different groups

  21. Bottom line: Interpreting the Chi-Squared Test • High statistic (low p-value) from chi-squared test means the groups are different • SAS shows you the logworthvalue • Which is -log(p-value) • Way to compare split variables (big logworths = ) • Low p-values are better: -log(0.05)  2.99 -log(0.0001)  9.21

  22. Reading the SAS Decision Tree Outcome cases are not 100% certain • There are probabilities attached to each outcome in a node • So let’s code “Default” as 1 and “No Default” as 0 So what is the chance that: A renter making more than $40,000 and debt more than 20% of income will default? A home owner making less than $40,000 and debt more than 20% of income will default?

More Related