1 / 41

Chapter 2 Data Mining: A Closer Look

Chapter 2 Data Mining: A Closer Look. Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99223 chen@jepson.gonzaga.edu. 2.1 Data Mining Strategies. Data Mining Strategies. Unsupervised Clustering. Market Basket Analysis.

tuyet
Télécharger la présentation

Chapter 2 Data Mining: A Closer Look

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 2Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99223 chen@jepson.gonzaga.edu

  2. 2.1 Data Mining Strategies

  3. Data Mining Strategies Unsupervised Clustering Market Basket Analysis Supervised Learning Prediction Classification Estimation Figure 2.1 - A hierarchy of data mining strategies No output attributes Categorical/discrete (current behavior) Numeric Future outcome (categorical/numeric)

  4. Estimation Classification • Learning is supervised. • The dependent variable is numeric. • Well-defined classes. • Current rather than future behavior. Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior. Prediction • The emphasis is on predicting future rather than current outcomes. • The output attribute may be categorical or numeric.

  5. The Cardiology Patient Dataset

  6. Table 2.1 • Cardiology Patient Data

  7. Table 2.2 • Most and Least Typical Instances from the Cardiology Domain

  8. A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17% A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55%

  9. Unsupervised Clustering Determine if concepts can be found in the data. Evaluate the likely performance of a supervised model. Determine a best set of input attributes for supervised learning. Detect Outliers.

  10. Find interesting relationships among retail products. Uses association rule algorithms. The results of a market basket analysis help retailers Design promotions, Arrange shelf or catalog items, and Develop cross-marketing strategies Market Basket Analysis

  11. 2.2 Supervised Data Mining Techniques

  12. The Credit Card Promotion Database

  13. Table 2.3 • The Credit Card Promotion Database

  14. Data file: CreditCardPromotion.xls

  15. A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.

  16. A Production Rule for theCredit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67% Question: Can we assume that two-thirds of all females in the specified age range will take advantage of the promotion? • Rule accuracy is a between-class measure. • Rule coverage is a within-class measure.

  17. Neural Networks • A neural network is a set of interconnected nodes designed to imitate the functioning of the human brain. • Two phases of operations • Learning phase at the input layer until it reaches a predetermined minimum error rate • Fixing weights and recompute output values for new instances • A major shortcoming of the neural network approach is a lack of explanation about what has been learned.

  18. Table 2.3 • The Credit Card Promotion Database (Note that blue: input attributes, red: output attributes) Therefore, there four input nodes and one output node and chose five hidden-layer nodes.

  19. Table 2.4 - Neural Network Training: Actual and Computed Output

  20. Table 2.4 - Neural Network Training: Actual and Computed Output

  21. Statistical Regression • Statistical regression is a supervised learning technique that generalizes a set of numeric data by creating a mathematical equation relating one or more input attributes to a single numeric output attributes. • Linear regression model is characterized by an output attribute whose value is determined by a linear sum of weighted input attribute values.

  22. Statistical Regression Example, a female who does not have credit card insurance is a likely candidate for the life insurance promotion. Life insurance promotion = 0.5909 (credit card insurance) - 0.5455 (sex) + 0.7727 Life insurance promotion = 0.5909 (0) – 0.5455 (0) + 0.7727 = 0.7727 Because the value 0.7727 is close to 1.0, we conclude that the individual is likely to take advantage of the promotional offer.

  23. 2.3 Association Rules • Association rule mining techniques are used to discover interesting associations between attributes contained in a database. • Unlike traditional association rules, association rules can have one or several output attributes. • An output attribute for one rule can be an input attribute for another rule. • A popular technique for ‘market basket’ analysis. • Problem: we may have several rules with little value.

  24. An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes

  25. 2.4 Clustering Techniques • By applying unsupervised clustering to the instances of the Acme Credit Card Company database, we will find a subset of input attributes that differentiate card holders who have taken advantage of the life insurance promotion from those cardholders who have not accepted the promotion offer.

  26. IF Sex=Female & 43>=Age>=35 & Credit Card Insurance=NO THEN Class = 3 Rule Accuracy: 100% Rule Coverage: 66.67%

  27. End here for now!

  28. 2.5 Evaluating Performance • Performance evaluation is probably the most critical of all the steps in the data mining process. Three general questions: • 1. Will the benefits received from a data mining project more than offset the cost of the data mining process? • 2. How do we interpret the results of a data mining session? • 3. Can we use the results of a data mining process with confidence?

  29. Evaluating Supervised Learner Models • Supervised learner models are designed to classify, estimate, and/or predict future outcome. • Applications on classification correctness: • Develop a model to accept to reject credit card applications • Develop a model to accept or reject home mortgage applicants • Develop a model to decide whether or not to drill for oil

  30. Confusion Matrix A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors. • Classification correctness is best calculated by presenting previously unseen data in the form of a test to the model being evaluated. • A confusion matrix is of little use for evaluating supervised learner models offering numeric output.

  31. Rule 1: Values along the main diagonal represent correct classification e.g., C11 represents the total number of class C1 instance correctly classified by the model Table 2.5 • A Three-Class Confusion Matrix Computed Decision C1 C2 C3 C1 C11 C12 C13 C2 C21 C22 C23 C3 C31 C32 C33 Rule 2: for C2, C21,C22,C23 are all actually members of C2; but C21 and C23 are incorrectly classified as members of another class. Rule 3: for C2, C12and C32are instances are incorrectly classified as members of class C2. Computation questions: #1

  32. Two-Class Error Analysis Table 2.6 • A Simple Confusion Matrix The more the better The less the better

  33. Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate Which model is better?

  34. Evaluating Numeric Mean absolute error Mean squared error Root mean squared error

  35. Comparing Models by Measuring Lift • Marketing applications that focus on response rates from mass mailings are less concerned with test set classification error and more interested in building models able to extract bias samples from large populations. • Supervised learner models designed for extracting bias samples from a general population are often evaluated by a measure that comes directly from marketing known as lift.

  36. Computing Lift Ci is the class of al zero-balance customers who, given the opportunity, will take advantage of the promotional offer.

  37. Figure 2.4 – A lift chart (Targeted vs. mass mailing)

  38. Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25

  39. Unsupervised Model Evaluation • Evaluating unsupervised data mining is, in general, a more difficult task than supervised evaluation. This is true because the goals of an unsupervised data mining session are frequently not as clear as the goals for supervised learning.

More Related