COMP 4332 Tutorial 2 Feb 18 Chen Zhao

Project 1: KDD 2009 Orange Challenge COMP 4332 Tutorial 2 Feb 18 Chen Zhao

All information on this website • http://www.kddcup-orange.com/

Record KDD Cup Participation

The story behind the challenge • French Telecom company Orange. • Task: predict the propensity of customers to • switch provider (churn), • buy new products or services (appetency), or • buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling) • Estimate the churn, appetency and up-sellingprobability of customers.

Data, constraints and requirements • Train and deploy requirements • About one hundred models per month • Fast data preparation and modeling • Fast deployment • Model requirements • Robust • Accurate • Understandable • Business requirement • Return of investment for the whole process • Input data • Relational databases • Numerical or categorical • Noisy • Missing values • Heavily unbalanced distribution • Train data • Hundreds of thousands of instances • Tens of thousand of variables • Deployment • Tens of millions of instances

Design of the challenge • Orange business objective • Benchmark the in-house system against state of the art techniques • Data • Data store • Not an option • Data warehouse • Confidentiality and scalability issues • Relational data requires domain knowledge and specialized skills • Tabular format • Standard format for the data mining community • Domain knowledge incorporated using feature construction (PAC) • Easy anonymization • Tasks • Three representative marketing tasks • Requirements • Fast data preparation and modeling (fully automatic) • Accurate • Fast deployment • Robust • Understandable

Data sets extraction and preparation • Input data • 10 relational table • A few hundreds of fields • One million customers • Instance selection • Resampling given the three marketing tasks • Keep 100 000 instances, with less unbalanced target distributions • Variable construction • Using PAC technology • 20000 constructed variables to get a tabular representation • Keep 15 000 variables (discard constant variables) • Small track: subset of 230 variables related to classical domain knowledge • Anonymization • Discard variable names, discard identifiers • Randomize order of variables • Rescale each numerical variable by a random factor • Recode each categorical variable using random category names • Data samples • 50 000 train and test instances sampled randomly • 5000 validation instances sampled randomly from the test set

Scientific and technical challenge • Scientific objective • Fast data preparation and modeling: within five days • Large scale: 50 000 train and test data, 15 000 variables • Hetegeneous data • Numerical with missing values • Categorical with hundreds of values • Heavily unbalanced distribution • KDD social meeting objective • Attract as many participants as possible • Additional small track and slow track • Online feedback on validation dataset • Toy problem (only one informative input variable) • Leverage challenge protocol overhead • One month to explore descriptive data and test submission protocol • Attractive conditions • No intellectual property conditions • Money prizes

Data • Each customer is a data instance with three labels: churn, appetency and up-selling (-1 or 1). • The feature vector for each customer has two versions: small (230 variables) large (15,000 variables  sparse!) • For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical.

Training and Testing • Training: 50,000 samples with labels for churn, appetency and up-selling • Testing: 50,000 samples without labels • TASK: predicting a score for each customer in each task

Binary classification. But predicting a Score??? http://www.kddcup-orange.com/evaluation.php

How AUC is calculated?

How to deal with Categorical values • Binarization: • { A, B, C } -> Create 3 binary variables • Ordinalization: • { A, B, C } -> {1, 2, 3}

Project 1 Requirement • Deadline: 25 March 2014 • Team: 1 or 2 students. • Since the judging site is down. We will judge youroutputofalltrainingdata after deadline,bytheAUCway. • Submit your result/code/report to zchenah@ust.hk before the deadline. • Wewillcheckifyourcodereallywork.Sopleasealsoprovideareadmeifitiscomplicated • 50% score for yourAUC ranking over all teams • 50% score for report and what you have tried and more importantly what you have found • Preprocessing steps • Classifiers • Ensemble methods

Assignment 1 • Deadline: 11 March 2014 • Submitbyteamas Project1 • Data exploration and experiment plan • What you have found on the data, e.g. dataimbalance, various statistics over the data, data preprocessing methods you want to apply or have applied, etc. • A plan on what classification methods (svm, knn, naivebayes, etc.) and ensemble methods you want to try. You should be familiar with the tools and their I/O formats. • At least three-page report • Basically, Assignment 1 is a mid-term/progress report for the project. • Btw,The “10 paper presentation”isaindividualwork,which isnotrelatedtoP1/A1.

Winning methods • Fast track: • IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.) • ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used. • David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees. • Slow track: • University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss. • Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting. • National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes. • (+: small dataset unscrambling)

COMP 4332 Tutorial 2 Feb 18 Chen Zhao