1 / 154

Datamining: Techniques and Applications in Economics

Datamining: Techniques and Applications in Economics. Rob Potharst Econometric Institute. Outline of this lecture. Part 1: Intelligent Decisions in Direct Mailing Part 2: Brand Choice using Ensemble Methods Part 3: Ensemble techniques for Choice Problems, especially Churn.

wyatt
Télécharger la présentation

Datamining: Techniques and Applications in Economics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Datamining:Techniques and Applications in Economics Rob Potharst Econometric Institute

  2. Outline of this lecture Part 1: Intelligent Decisions in Direct Mailing Part 2: Brand Choice using Ensemble Methods Part 3: Ensemble techniques for Choice Problems, especially Churn Datamining for ICT & Economics,

  3. Part 1Intelligent Decisions in Direct Mailing Rob Potharst, Uzay Kaymak, Wim Pijls Erasmus University RotterdamFaculty of Economics, Dept. of Computer Science Jedid-Jah Jonker, SCP and Nanda Piersma, HES

  4. Outline • Decision problems in direct mailing • The charity organization case • Target selection • models: logreg, CHAID, neural networks, association rules, fuzzy modelling • The frequency problem • models: MDP, reinforcement learning(italic: CI methods) Datamining for ICT & Economics, part 1: Direct Mailing

  5. Classical literature • Optimal mailing policies:Bitran & Mondschein (1996),Mailing Decisions in the Catalog Sales Industry • on Target Selection:Bult & Wansbeek (1995),Optimal Selection for Direct Mail Datamining for ICT & Economics, part 1: Direct Mailing

  6. This part of the lecture is based on: • R.Potharst, U.Kaymak & W.Pijls (2001),Neural Networks for Target Selection in Direct Marketing • W.Pijls, R. Potharst & U.Kaymak (2001),Pattern-based Target Selection Applied to Fund Raising (2001) • U.Kaymak (2001), Fuzzy Target Selection using RFM variables • J.J.Jonker, N.Piersma & R.Potharst (2002),Direct Mailing Decisions for a Dutch Fundraiser http://www.few.eur.nl/few/people/potharst/ Datamining for ICT & Economics, part 1: Direct Mailing

  7. Thanks to: • Jedid-Jah Jonker (Soc.Cult.Planb., DenHaag) • Uzay Kaymak (Erasmus University, R’dam) • Nanda Piersma (HES, A’dam) • Wim Pijls (Erasmus University, R’dam) • an anonymous charity organization Datamining for ICT & Economics, part 1: Direct Mailing

  8. Decisions in direct mailing • Target Selection: To which addresses are we going to send the next mailing? • Frequency:How often are we going to send a mailing to each separate address? • Inventory Size:How many items of each product should we have on stock? • etc. Datamining for ICT & Economics, part 1: Direct Mailing

  9. Charity case • A large Dutch charity organization • Goal: to stimulate social and scientific research on a frequent disease • More than 700 000 supporters • Annual budget larger than 15M euro • Multiple mailing campaigns a year, asking for donations Datamining for ICT & Economics, part 1: Direct Mailing

  10. Database • Information about over 700000 supporters • About 675000 considered for mailings • Supporter’s donation history is traced after first-ever donation (cumulative database) • Recorded data (about 0.5 GB) • mailing dates • donation amount • donation time • administrative data Datamining for ICT & Economics, part 1: Direct Mailing

  11. Target selection • Problem from (direct) marketing • Generation of customer profiles (models) who could be interested in a product • Models built by analyzing data from similar (previous) campaigns • Classification problem • separate positive cases from negative cases and determine their characteristics Datamining for ICT & Economics, part 1: Direct Mailing

  12. customers Target selection cycle product test campaign data gathering model conceptualization target selection purchase Datamining for ICT & Economics, part 1: Direct Mailing

  13. Charity donations • Charity organizations have supporters who donate money for the good cause • Invite supporters to donate through several mailings per year • Charity organizations may have different strategies for mailing supporters • Select those supporters who are likely to donate in a particular mailing Datamining for ICT & Economics, part 1: Direct Mailing

  14. supporters Target selection for supporters data gathering, past donation behavior model target selection more donations Datamining for ICT & Economics, part 1: Direct Mailing

  15. Target selection models • Segmentation based, e.g. CHAID • divide customer base into disjoint segments • select most promising segments • segments assumed to be homogeneous • Scoring based, e.g. logistic regression • score each customer in the customer base • select customers with highest scores • individual approach Datamining for ICT & Economics, part 1: Direct Mailing

  16. 1 0.9 0.8 0.7 0.6 Fraction of responders 0.5 0.4 0.3 ideal 0.2 typical random 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction selected Gain chart Datamining for ICT & Economics, part 1: Direct Mailing

  17. 1 ideal typical random 0.9 0.8 0.7 Response fraction 0.6 0.5 0.4 0.3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction selected Hit probability chart Datamining for ICT & Economics, part 1: Direct Mailing

  18. Data sources • External databases: rental list • maintained by specialized companies • household-specific information • demographic information at ZIP code level • Internal databases: house list • maintained by the company itself • traces purchase history of customer • most reliable and relevant information about the customer Datamining for ICT & Economics, part 1: Direct Mailing

  19. RFM variables • RecencyHow recent was the last purchase?E.g. number of days since last purchase • FrequencyHow frequent are the purchases?E.g fraction of responded mailings • Monetary valueHow much has the customer spent?E.g. average spending per mailing Datamining for ICT & Economics, part 1: Direct Mailing

  20. Feature selection • RFM variables • often appropriate to capture specifics of customers • relatively small number of variables • not suitable for identifying new or future prospects • feature selection (and sometimes reduction) still needed to select most relevant variables Datamining for ICT & Economics, part 1: Direct Mailing

  21. Why neural networks? • Neural networks can hopefully be used for building good target selection models that can predict likely charity supporters successfully • Performance might be better than segmentation models like CHAID, and scoring methods like logistic regression Datamining for ICT & Economics, part 1: Direct Mailing

  22. Feature selection • R1=Number of weeks since last response • R2=Number of months since first-ever donation • F1=Fraction of responded mailings • F2=Response time for last response • M1=Average donated amount per mailing • M2=Last donated amount • M3=Average donation per year Datamining for ICT & Economics, part 1: Direct Mailing

  23. Data preparation • Data set selection • which previous mailing to use for modeling? • influence of mailing strategy • select most recent full mailings (1998,1999) • Data set size • about 5000 randomly selected supporters • independent training and test sets • training set 1998 - 4057 samplestest set 1998 - 4080 samplestraining set 1999 - 4111 samplestest set 1999 - 4131 samples

  24. input layer hidden layer output layer Feedforward neural network • 7 inputs • 1 hidden layer • 4 hidden neurons • 1 output logistic linear • normalized inputs and outputs • initial weights random in (-0.1,0.1) Datamining for ICT & Economics, part 1: Direct Mailing

  25. Results on 1999 data set 1 0.9 0.8 0.7 0.6 Fraction of responders 0.5 0.4 0.3 0.2 ideal nn trained on 1998 data 0.1 nn trained on 1999 data random 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction selected Datamining for ICT & Economics, part 1: Direct Mailing

  26. Results on 1999 data set 0.8 nn trained on 1998 data nn trained on 1999 data 0.75 0.7 0.65 Response fraction 0.6 0.55 0.5 0.45 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Fraction selected Datamining for ICT & Economics, part 1: Direct Mailing

  27. NN vs. logistic regression 1 0.9 0.8 0.7 0.6 Fraction responded 0.5 0.4 0.3 0.2 ideal neural network 0.1 logistic regression random 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction selected Training set 1998, test set 1999 Datamining for ICT & Economics, part 1: Direct Mailing

  28. NN vs. logistic regression 0.8 neural network logistic regression 0.75 0.7 0.65 Response fraction 0.6 0.55 0.5 0.45 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Fraction selected Training set 1998, test set 1999 Datamining for ICT & Economics, part 1: Direct Mailing

  29. Neural network vs. CHAID 1 0.9 0.8 0.7 0.6 Fraction of responders 0.5 0.4 0.3 0.2 ideal neural network 0.1 CHAID random 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction selected Training set 1998, test set 1998 Datamining for ICT & Economics, part 1: Direct Mailing

  30. Conclusions • Neural networks can be used to build target selection models successfully • They outperform segmentation methods like CHAID, but performance is comparable to statistical regression methods • There is evidence that a neural network model can be used for target selection in multiple mailing campaigns Datamining for ICT & Economics, part 1: Direct Mailing

  31. Why patterns/association rules? Question: Is it possible to have + , + ? Answer: this study! = pattern-based Datamining for ICT & Economics, part 1: Direct Mailing

  32. Patterns and their support Datamining for ICT & Economics, part 1: Direct Mailing

  33. Definitions • a pattern is a set of attribute/value combinations • a record R is a supporter of a pattern P if all attr/val combinations of P match those of R • Example: (3,1,2) is a supporter of ( b = 1, c = 2 ) • the support of a pattern P is the number of supporters of P Datamining for ICT & Economics, part 1: Direct Mailing

  34. Frequent patterns • Given a minimum support minsup a pattern P is said to be frequent if support( P )  minsup • The set of frequent patterns can be represented by a trie • An algorithm for finding frequent itemsets (like Apriori by Agrawal c.s.) can also be used to find frequent patterns Datamining for ICT & Economics, part 1: Direct Mailing

  35. The trie of frequent patterns Datamining for ICT & Economics, part 1: Direct Mailing

  36. Support and response counts Datamining for ICT & Economics, part 1: Direct Mailing

  37. With response rates Datamining for ICT & Economics, part 1: Direct Mailing

  38. Selecting the target group Target group: The first record (1,1,2) matches the following freq.patterns: ( a = 1 ) => resp. rate = 50 % ( b = 1 ) => resp. rate = 80 % ( a = 1, b = 2 ) => resp. rate = 100 % => max (mrr) Datamining for ICT & Economics, part 1: Direct Mailing

  39. PatSelect Input: a set of records Output: a subset of size n: the target group • 1. For all records R in the given set do: • let P be the set of all frequent patterns that match R • let mrr( R ) = max {resp.rate( P ) | P inP } • 2. Sort all records according to decreasing mrr • 3. Select the topmost n records Datamining for ICT & Economics, part 1: Direct Mailing

  40. Fund raising application • Dutch charity organization • more than 700 000 supporters • 26 mailing campaigns (dates, targets, responses) • spread over six years (‘94 - ‘99) • database of over 400 MB Datamining for ICT & Economics, part 1: Direct Mailing

  41. Research questions 1) How to select a target group with as high a response rate as possible, on the basis of history data 2) How to select a target group with as high a total amount donated as possible, again on the basis of history data This study: question 1. Datamining for ICT & Economics, part 1: Direct Mailing

  42. RFM features R1: # weeks since last response R2: # months since first donation F1: fraction of mailings supporter has responded to F2: median response time of supporter M1: etc. Datamining for ICT & Economics, part 1: Direct Mailing

  43. Model construction • Choose only full mailing campaigns 98/99 • random split: • training set 50 % • test set 50 % • resulting datasets: • tr98, tr99 • test98, test99 • each somewhat less than 200 000 cases!! Datamining for ICT & Economics, part 1: Direct Mailing

  44. Results‘99, trained on‘98 data Datamining for ICT & Economics, part 1: Direct Mailing

  45. Results‘99, trained on‘99 data Datamining for ICT & Economics, part 1: Direct Mailing

  46. Datamining for ICT & Economics, part 1: Direct Mailing

  47. Comparison • Neither a pure scoring, nor a pure segmentation method • not segments, since patterns can be overlapping! • many patterns => many different scores => performance comparable with scoring methods • but also: Datamining for ICT & Economics, part 1: Direct Mailing

  48. Interpretability high, since each supporter’s presence in the target group can be explained by its inclusion in a pattern with high response rate!!! Datamining for ICT & Economics, part 1: Direct Mailing

  49. Conclusions • New method based on patterns and association rule algorithms with following characteristics: • response rate high • interpretability high • interesting method, especially for large databases Datamining for ICT & Economics, part 1: Direct Mailing

  50. Why fuzzy? Advantages of fuzzy target selection models in marketing • prediction power larger than conventional statistical models • large degree of transparency due to the linguistic rules that can be derived from data Datamining for ICT & Economics, part 1: Direct Mailing

More Related