200 likes | 413 Vues
PKDD Discovery Challenge (not only) on Financial Data. Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz. Cups, Challenges, Competitions. KDD Cups (since 1997) KDD Sisyphus at ECML 1998 PKDD Discovery Challenges (since 1999) COIL Competition 2000
E N D
PKDD Discovery Challenge(not only)on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz
Cups, Challenges, Competitions • KDD Cups (since 1997) • KDD Sisyphus at ECML 1998 • PKDD Discovery Challenges (since 1999) • COIL Competition 2000 • PAKDD Challenge 2000 • PT Challenge 2000, 2001 • JSAI KDD Challenge 2001 • EUNITE Competition 2001, 2002 • . . . DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
PKDD Discovery Challenge Idea • Realistic data mining conditions • collaborative rather then competitive nature • rather vague specification of the problem • Differences to real KDD projects • short time for analysis (2-3 months) • only indirect access to domain and data experts during KDD process DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Challenge Settings • Data and their full description available on the web for all participants • Submissions evaluated by domain experts (but no ordering, no winners and losers) • Workshop at PKDD to present the results and discus them with domain experts • Results and comments of experts available on the web (after the workshop) DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
PKDD Challenges http://lisp.vse.cz/challenge • 1999, Prague • financial data, thrombosis data • 2000, Lyon • financial data, modified thrombosis data • 2001, Freiburg • modified thrombosis data • 2002, Helsinki • atherosclerosis data, hepatitis data DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Financial Challenge Background • Czech bank offering private accounts • Available data for pilot study (29000 clients) • personal characteristics • basic info about accounts • transactions for three months • Proposed tasks • segmentation (defining different types of clients w.r.t. debt) • early detection of debts DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Financial Challenge Data DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Contributions • Method oriented • show a method/system working on the data • Problem oriented (prototype solutions) • loan and/or credit cards description • loan and/or credit cards classification • initial exploration • relation between branches • clients segmentation DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Description of loans • Relations between loan category and account characteristics [Coufal et al, 1999 - GUHA] [Mikšovský et al, 1999 - EXCEL] DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Classification of loans • Detecting risky clients before they are granted a loan [Mikšovský et al, 1999 - C5.0] • decision tree to find the relevance of attributes • decision tree for classification (using misclassification costs) DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Credit Cards Promotion • Description - find characteristics of a card holder • deviation detection • Classification - predict score for „card value“ • k-nearest neighbour [Putten, 1999] DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Description - segmentation of clients according to transactions [Hotho, Meadche, 2000] Kohonen map + decision trees Rule #1 for Cluster 3 If ATTR5 > 9945 and ATTR13 > 0 Then -> Cluster 3 (115, 0.983) Clients Segmentation DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Challenge Organizing Lessons • To get and prepare real data is difficult • The time for analyzes should be as long as possible • The response rate was rather low (~ 10%) • No synergy effect observed DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
DM Lessons (1/4) • Cooperate with experts • domain experts • data experts • . . . • … and with users DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
DM Lessons (2/4) • Use knowledge intensive preprocessing methods • … • compute age and sex from birth_number • set flags for different types of operations • compute monthly characteristics of transactions (sum, avg, min, max) lbalance = 1/30 ibalance(i) days(i). • … DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
DM Lessons (3/4) • Make the results understandable [Werner, Fogarty 2001] DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
DM Lessons (4/4) • Show some (even preliminary) results soon • experts are interested in solutions not in applying sophisticated methods DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Discovery Challenge Benefits • Experts • deeper insight into the data • Participants • experience with analyzing large real data • motivations for further research • ML/KDD Community • prototype tasks/solutions(like the MiningMart project?) • Organizators • … invitation to DMLL Workshop :-) DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Thank You DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Contributions DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002