1 / 32

Some slide material taken from: Groth, Han and Kamber, SAS Education

DSCI 4520/5240 (DATA MINING). DSCI 4520/5240 Lecture 2 The SEMMA process The CRISP-DM process. Some slide material taken from: Groth, Han and Kamber, SAS Education. Objectives. Define SEMMA. Introduce CRISP-DM. SEMMA. S ample E xplore M odify M odel A ssess. Input Data Source

webb
Télécharger la présentation

Some slide material taken from: Groth, Han and Kamber, SAS Education

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DSCI 4520/5240 (DATA MINING) DSCI 4520/5240 Lecture 2 The SEMMA process The CRISP-DM process Some slide material taken from: Groth, Han and Kamber, SAS Education

  2. Objectives • Define SEMMA. • Introduce CRISP-DM

  3. SEMMA • Sample • Explore • Modify • Model • Assess

  4. Input Data Source Sampling Data Partition Sample

  5. Distribution Explorer Multiplot Insight Association Variable Selection Link Analysis Explore

  6. Data Set Attributes Transform Variables Filter Outliers Replacement Clustering Self-Organized Maps Kohonen Networks Time Series Modify

  7. Regression Tree Neural Network Princomp/ Dmneural User Defined Model Ensemble Memory Based Reasoning Two-Stage Model Model

  8. Assessment Reporter Assess

  9. Score C*Score Other Types of Nodes – Scoring Nodes

  10. Group Processing Data Mining Database SAS Code Control Point Subdiagram Other Types of Nodes – Utility Nodes

  11. Readings on SAS EM nodes • Read Sarma text, chapter 2, for more info on: • 2.6.1 Input Data node • 2.6.2.1 StatExplore node • 2.6.2.2 MultiPlot node • 2.6.3 Impute node • 2.6.4 Data Partition node • 2.6.6 Variable Selection node • 2.6.7 Transform Variables node • 2.6.8 SAS Code node

  12. Reducing armed robberies in South Africa • SAS helped Absa, a Major South African Bank reduce armed robberies by 41 percent over two years (2002-2003), netting a 38 percent reduction in cash loss and an 11 percent increase in customer satisfaction ratings • Absa, one of South Africa's largest banks, uses SAS' data mining capabilities to leverage their data for better customer relationships and more targeted marketing campaigns. With SAS analytics, the bank can also track which branches are more likely to fall victim to a robbery and take effective preventive measures. http://www.sas.com/success/absa.html See The ABSA Video(4min 19sec)

  13. Detecting Credit Card Fraud at PlastikCash • Joe Analyst at PlastikCash credit card company wants to find a way to monitor new transactions and detect those made on stolen credit cards. His goal is to detect the fraud while it is taking place. • Joe knows that the data he plans to use consists of many transaction records for a large number of customers. He also knows that a number of these transactions will turn out to be fraudulent, since stolen credit cards are reported and entered into the database every day. • Joe puts some of the transactional data aside to use as a validation data set. In a few weeks he will know which of the transactions were fraudulent and which were not, and he can then use this data to validate his fraud detection and prediction scheme. • Joe decides to follow SAS Institute's SEMMA process (Sample, Explore, Modify, Model, Assess) to lead him to the results he needs.

  14. Sampling • Looking at old PlastikCash data, Joe determines that there is an average ratio of 1 fraudulent transaction for every 5 valid transactions in the database. So he selects a sample of 20,000 fraudulent transactions and 100,000 valid transactions. This will provide him with enough detail to represent both categories of transactions accurately, while keeping the files manageable.

  15. Exploration • The next step is to explore the data from this sample. Joe creates some plots to get a first impression about the data. For a number of stolen cards, he plots the amount of purchases against the date of purchase. He begins looking for interesting features like whether purchases after the theft are more expensive or more frequent. • Joe sorts the products by the number of times a product was bought with a stolen credit card, for all customers in the sample. Then he applies a color spectrum to the products so that the product most often bought with a stolen credit card is associated with red. He studies it carefully: does the plot reveal predictive clues?

  16. Exploration (cont’d) • Instead of coloring just by product, Joe decides to incorporate information about the region and venue of purchase. He uses a multi-dimensional volume visualization tool to view multiple variables at one time.

  17. Modification • Joe realizes he could continue exploring this way forever, there are just so many ways to go. He decides to use an automated search to find a predictive relationship. He begins to prepare for the modeling process. The plots have provided some basic understanding of how transactions differ after a card is stolen, and they have also given Joe an idea as to how far back before the date of theft he needs to look to detect this difference. For example, he has determined that the data set for the modeling process would only need to contain information on each transaction for the previous three months.

  18. Modeling • Looking in his data mining toolbox, Joe decides that neural networks and logistic regression (a regression technique used to model a binary outcome, in this case fraudulent or not fraudulent) are appropriate modeling techniques. He knows that neural networks search over a wide variety of candidate relationships, whereas logistic regression analysis can come up with a more restricted yet more interpretable prediction. The result of these modeling techniques is a scoring function that estimates the probability that a transaction is fraudulent.

  19. Assessment • Joe wants to convert the scoring function into a rule that decides whether a transaction should be questioned as fraudulent. He chooses a probability threshold of 0.8. This means that if the scoring function assigns .85 as the probability that a transaction is fraudulent, then this model classifies the transaction as fraudulent and some investigative action ensues. If the probability is .7, then the transaction is probably not fraudulent. • Before implementing his rule, Joe wants to know how good his model is. In order to determine the appropriate threshhold he should use, he decides to look at the misclassification rate - the proportion of transactions correctly classified to those incorrectly classified. By constructing a plot of misclassification rates versus all possible thresholds, he chooses a threshold that minimizes misclassification. • To further assess the model and his new rule, Joe uses the validation data to obtain misclassification rates. Since these rates are comparable to those obtained in training phases, he decides to implement the model for a one month testing period to see how well it performs in practice.

  20. Assessment (cont’d) • After some adjustment of the scoring function, the system is used in production on a routine basis. It turns out that the system classifies 87% of the credit fraud correctly (tested six months later when the final outcome is known). The automatic system generates a significant manpower savings and also allows a faster response to any transactions detected as fraudulent. • Joe's data mining ultimately saves PlastikCash about $10 million a year. Joe is promoted to Vice President of Marketing and gets a hefty raise!

  21. CRISP-DM • A de facto industry standard for data mining • Created between 1997-1999 by DaimlerChrysler, SPSS and NCR • Acronym stands for Cross-Industry Standard Process for Data Mining • Consists of 6 phases, intended as a cyclical process • Not all phases are necessary in every analysis • Somewhat similar to SAS Institute’s Sample – Explore – Modify – Model – Assess (SEMMA)

  22. Phases in CRISP-DM Business understanding Data understanding Data preparation Modeling Evaluation Deployment

  23. Business Understanding • Determining business objectives • Assessing the current situation • Establishing Data Mining goals • Developing a project plan

  24. Data Understanding • Initial data collection • Data description • Data exploration • Verification of data quality

  25. Data Preparation • Data cleaning • Data transformation • Data formatting

  26. Modeling • Separation of data into training and test sets • Visualization techniques • Cluster analysis • Decision trees • Parametric predictive models • Regression • Neural Networks

  27. Evaluation • Results should be evaluated in the context of business objectives • This may lead to identification of other needs • Gaining understanding is an iterative process

  28. Deployment • Verification of previously held hypotheses • Identification of unexpected and useful relationships (knowledge discovery) • Sound models are applied to business operations • Models need to be monitored for changes in operating conditions (they may have to be redone) • Documentation for future reference

  29. Rexer Analytics 2011 Survey: Overview • SURVEY & PARTICIPANTS: 52-item survey of data miners, conducted on-line in 2011. Participants: 1,319 data miners from over 60 countries. • FIELDS & GOALS: CRM/Marketing has been the #1 field for the past five years. “Improving the understanding of customers”, “retaining customers” and other CRM goals continue to be the primary goals. • ALGORITHMS: Decision trees, regression, and cluster analysis continue to form the top three algorithms for most data miners. A third of data miners currently use text mining and another third plan to do so in the future. • TOOLS: R continued its rise this year and is now being used by close to half of all data miners (47%). R users prefer it for being free, open source, and having a wide variety of algorithms. STATISTICA is selected as the primary data mining tool (17%). STATISTICA, KNIME, Rapid Miner and Salford Systems received the strongest satisfaction ratings. • ANALYTIC CAPABILITY AND SUCCESS MEASUREMENT: Only 12% of corporate respondents rate their company as having very high analytic sophistication. Measures of analytic success: Return on Investment (ROI), and predictive validity or accuracy of their models. Challenges to measuring success: user cooperation and data availability/quality.

  30. Rexer Analytics 2011 Survey: The positive impact of Data Mining In the 5th Annual Survey (2011) of Rexer Analytics (1,319 participant data miners from over 60 countries) data miners shared examples of situations where data mining is having a positive impact on society. The five areas mentioned most often were: • Health / Medical Progress • Business Improvements • Personalized Communications & Marketing • Fraud Detection • Environmental

  31. Rexer Analytics 2011 Survey: Best practices in measuring success In the 5th Annual Survey (2011) data miners shared their best practices in how they measure analytics project performance / success. 236 data miners shared their best practices. There is great diversity in data miners' performance / success measurement methodologies, and many data miners described using multiple measurement techniques. The five methodologies mentioned by the most data miners were: • Model performance (accuracy, F, ROC, AUC, lift) • Financial performance (ROI and other financial measures) • Performance in a control or other group • Feedback from users, clients, or management • Cross-validation

  32. Rexer Analytics: Overcoming Data Mining challenges In the four annual data miner surveys, these key challenges have been identified by data miners more than any others: • Dirty Data • Explaining Data Mining to Others • Unavailability of Data / Difficult Access to Data

More Related