Saharon Rosset, Tel Aviv University Claudia Perlich, IBM Research

KDD-09 TutorialPredictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real-Life Projects Saharon Rosset, Tel Aviv University Claudia Perlich, IBM Research

Predictive modeling Most general definition: build a model from observed data, with the goal of predicting some unobserved outcomes • Primary example: supervised learning • get training data: (x1,y1), (x2,y2),…, (xn,yn)drawn i.i.d from joint distribution on (X,y) • Build model f(x) to describe the relationship between x and y • Use it to predict y when only x is observed in “future” • Other cases may relax some of the supervised learning assumptions • For example: in KDD Cup 2007 did not see any yi’s, had to extrapolate them based on training xi’s – see later in tutorial

Predictive Modeling Competitions Competitions like KDD-Cup extract “core” predictive modeling challenges from their application environment • Usually supposed to represent real-life predictive modeling challenges • Extracting a real-life problem from its context and making a credible competition out of it is often more difficult than it seems • We will see it in examples

The Goals of this Tutorial • Understand the two modes of predictive modeling, their similarities and differences: • Real life projects • Data mining competitions • Describe the main factors for success in the two modes of predictive modeling • Discuss some of the recurring challenges that come up in determining success These goals will be addressed and demonstrated through a series of case studies

Credentials in Data Mining Competitions Saharon Rosset - Winner KDD CUP 99~ - Winner KDD CUP 00+ Claudia Perlich - Runner up KDD CUP 03* - Winner ILP challenge 05 - Winner KDD CUP 09@ • Jointly • Winners in KDD CUP 2007@ • Winners of KDD CUP 2008@ • Winners of INFORMS data mining challenge 08@ Collaborators: @Prem Melville, @Yan Liu, @Grzegorz Swirszcz, *Foster Provost, *Sofus Macscassy, +~Aron Inger, +Nurit Vatnik, +Einat Neuman, @Alexandru Niculescu-Mizil

Experience with Real Life Projects 2004-2009 Collaboration on Business Intelligence projects at IBM Research • Total of >10 publications on real life projects • Total 4 IBM Outstanding Technical Achievement awards • IBM accomplishment and major accomplishment • Finalists in this year’s INFORMS Edelman Prize for real-life applications of Operations Research and Statistics One of the successful projects will be discussed here as a case study

Outline • Introduction and overview (SR) • Differences between competitions and real life • Success Factors • Recurrent challenges in competitions and real projects • Case Studies • KDD CUP 2007 (SR) • KDD CUP 2008 (CP) • Business Intelligence Example : Market Alignment Program (MAP) (CP) • Conclusions and summary (CP)

Differences between competitions and projects In this tutorial we deal with the predictive modeling aspect, so our discussion of projects will also start with a well defined predictive task and ignore most of the difficulties with getting to that point

Real life project evolution and our focus Business/ modeling problem definition Statistical problem definition Modeling methodology design Data preparation & integration Sales force mgmt. Wallet est. Quantile est.,Latent variable est. Quantile est.,Graphical model IBM relationshipsFirmographics Model generation & validation Implementation & application development Not our focus Loosely related Programming,Simulation,IBM Wallets OnTarget,MAP Our focus

Two types of competitions Sterile • Clean data matrix • Standard error measure • Often anonymized features • Pure machine learning • Example • KDD Cup 2009 • PKDD Challenge 2007 • Approach • Emphasize algorithms, computation • Attack with heavy (kernel?) machines • Challenges • Size, missing values, # features Real • Raw data • Set up the model yourself • Task-specific evaluation • Simulate real life mode • Example • KDD Cup 2007 • KDD Cup 2008 • Approach • Understand the domain • Analyze the data • Build model • Challenges • Too numerous

Factors of Success in Competitions and Real Life 1. Data and domain understanding • Generation of data and task • Cleaning and representation/transformation 2. Statistical insights • Statistical properties • Test validity of assumptions • Performance measure 3. Modeling and learning approach • Most “publishable” part • Choice or development of most suitable algorithm Real Sterile

Recurring challenges We emphasize three recurring challenges in predictive modeling that often get overlooked: • Data leakage: impact, avoidance and detection • Leakage: use of “illegitimate” data for modeling • “Legitimate data”: data that will be available when model is applied • In competitions, the definition of leakage is unclear • Adapting learning to real-life performance measures • Could move well beyond standard measures like MSE, error rate, or AUC • We will see this in two of our case studies • Relational learning/Feature construction • Real data is rarely flat, and good, practical solutions for this problem remain a challenge

1 Leakage in Predictive Modeling Introduction of predictive information about the target by the data generation, collection, and preparation process • Trivial example: Binary target was created using a cutoff on a continuous variable and by accident, the continuous variable was not removed • Reversal of cause and effect when information from the future becomes available • It produces models that do not generalize/true model performances is much lower than ‘out-of sample’ (but including leakage) estimate • Commonly occurs when combining data from multiple sources or multiple time points and often manifests in the order in data files • Leakage is surprisingly pervasive in competitions and real life • KDD CUP 2007, KDD CUP 2008 had leakages, as we will see in case studies • INFORMS competition had leakage due to partial removal of information for only positive cases

Real life leakage example P. Melville, S. Rosset, R. Lawrence (2008) Customer Targeting Models Using Actively-Selected Web Content. KDD-08 Built models for identifying new customers for IBM products, based on: • IBM Internal databases • Companies’ websites Example pattern: Companies with the word “Websphere” on their website are likely to be good customers for IBM Websphere products • Ahem, a slight cause and effect problem • Source of problem: we only have current view of company website, not its view when it was an IBM prospect (=prior to buying) Ad-hoc solution: remove all obvious leakage words. • Does not solve the fundamental problem

General leakage solution: “predict the future” Niels Bohr is quoted as saying: “Prediction is difficult, especially about the future” Flipping this around, if: • The true prediction task is “about the future” (usually is) • We can make sure that our model only has access to information “at the present” • We can apply the time-based cutoff in the competition / evaluation / proof of concept stage  We are guaranteed (intuitively and mathematically) that we can prevent leakage For the websites example, this would require getting internet snapshot from (say) two years ago, and using only what we knew then to learn who bought since

2 Real-life performance measures Real life prediction models should be constructed and judged for performance on real-life measures: • Address the real problem at hand – optimize $$$, life span etc. • At the same time, need to maintain statistical soundness: • Can we optimize these measures directly? • Are we better off just building good models in general? Example: breast cancer detection (KDD Cup 2008) • At first sight, a standard classification problem (malignant or benign?) • Obvious extension: cost sensitive objectiveMuch better to do a biopsy on a healthy subject than send a malignant patient home! • Competition objective: optimize effective use of radiologists’ time • Complex measure called FROC • See case study in Claudia’s part

Optimizing real-life measures It is a common approach to use the prediction objective to motivate an empirical loss function for modeling: • If the prediction objective is the expected value of Y given x, then squared error loss (e.g, linear regression or CART) is appropriate • If we want to predict the median of Y instead, then absolute loss is appropriate • More generally, quantile loss can be used (cf. MAP case study) We will see successful examples of this approach in two case studies (KDD CUP 07 and MAP) What do we do with complex measures like FROC? • There is really no way to build a good model directly • Less ambitious approach: • Build a model using standard approaches (e.g. logistic regression) • post-process your model to do well on the specific measure We will see a successful example of this approach in KDD CUP 08

3 Relational and Multi-Level Data Real-life databases are rarely flat! Example: INFORMS Challenge 08, medical records: m:n m:n m:n m:n

Approaches for dealing with relational data Modeling approaches that use relational data directly • There has been a lot of research, but there is a scarcity of practically useful methods that take this approach Flattening the relational structure into a standard X,y setup • The key to this approach is generation of useful features from the relational tables • This is the approach we took in the INFORMS08 challenge Ad hoc approaches • Based on specific properties of the data and modeling problem, it may be possible to “divide and conquer” the relational setup • See example in the KDD CUP 08 case study

Modeler’s best friend: Exploratory data analysis • Exploratory data analysis (EDA) is a general name for a class of techniques aimed at • Examining data • Validating data • Forming hypotheses about data • The techniques are often graphical or intuitive, but can also be statistical • Testing very simple hypotheses as a way of getting at more complex ones • E.g.: test each variable separately against response, and look for strong correlations • The most important proponent of EDA was the great, late statistician John Tukey

The beauty and value of exploratory data analysis • EDA is a critical step in creating successful predictive modeling solutions: • Expose leakage • AVOID PRECONCEPTIONS about: • What matters • What would work • Etc. • Example: Identifying KDD CUP 08 leakage through EDA • Graphical display of identifier vs malingnant/benign (see case study slide) • Could also be discovered via a statistical variable-by-variable examination of significant correlations with response to detect it • Key to finding this: AOIVDING PRECONCEPTIONS about the irrelevance of identifier

Elements of EDA for predictive modeling • Examine data variable by variable • Outliers? • Missing data patterns? • Examine relationships with response • Strong correlations? • Unexpected correlations? • Compare to other similar datasets/problems • Are variable distributions consistent? • Are correlations consistent? • Stare: at raw data, at graphs, at correlations/results Unexpected answers to any of these questions may change the course of the predictive modeling process

Case study #1: Netflix/KDD-Cup 2007

October 2006 Announcement of the NETFLIX Competition USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: • Beat NETFLIX current recommender ‘Cinematch’ RMSE by 10% prior to 2011 • $50,000 for the annual progress price • First two awarded to AT&T team: 9.4% improvement as of 10/08 (almost there!) • Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies • Performance is evaluated on holdout movie-user pairs • NETFLIX competition has attracted ~50K contestants on ~40K teams from >150 different countries • ~40K valid submissions from ~5K different teams

NETFLIX DataInternet Movie Data Base All movies (80K) 17K Selection unclear All users (6.8 M) 480 K At least 20 Ratings by end 2005 NETFLIX Competition Data 100 M ratings

NETFLIX data generation process KDD CUP NO User or Movie Arrival User Arrival Movie Arrival Task 1 17K movies Task 2 Training Data 1998 Time 2005 2006 Qualifier Dataset 3M

KDD-CUP 2007 based on the NETFLIX • Training: NETFLIX competition data from 1998-2005 • Test: 2006 ratings randomly split by movie in to two tasks • Task 1: Who rated what in 2006 • Given a list of 100,000 pairs of users and movies, predict for each pair the probability that the user rated the movie in 2006 • Result: IBM Research team was second runner-up, No 3 out of 39 teams • Task 2: Number of ratings per movie in 2006 • Given a list of 8863 movies, predict the number of additional reviews that all existing users will give in 2006 • Result: IBM Research team was the winner, No 1 out of 34 teams

Test sets from 2006 for Task 1 and Task 2 Marginal 2006 Distribution of rating Users Sample (movie, user) pairs according to product of marginals Task 1 Remove Pairs that were rated prior to 2006 Movies Task 2 log(n+1) Rating Totals Task 2 Test Set (8.8K) Task 1 Test Set (100K)

Task 1: Did User A review Movie B in 2006? A standard classification task to answer question whether “existing” users will review “existing” movies • In line more with “synthetic” mode of competitions than “real” mode • Challenges • Huge amount of data • how to sample the data so that any learning algorithms can be applied is critical • Complex affecting factors • decrease of interest in old movies, growing tendency of watching (reviewing) more movies by Netflix users • Key solutions • Effective sampling strategies to keep as much information as possible • Careful feature extraction from multiple sources

Task 2: How many reviews in 2006? • Task formulation • Regression task to predict the total count of reviewers from “existing” users for 8863 “existing” movies • Evaluation is by RMSE on log scale • Challenges • Movie dynamics and life-cycle • Interest in movies changes over time • User dynamics and life-cycle • No new users are added to the database • Key solutions • Use counts from test set of Task 1 to learn a model for 2006 adjusting for pair removal • Build set of quarterly lagged models to determine the overall scalar • Use Poisson regression

Some data observations Leakage Alert! • Task 1 test set is a potential response for training a model for Task 2 • Was sampled according to marginal (= # reviews for movie in 06 / total # reviews in 06)which is proportional to the Task 2 response (= # reviews for movie in 06) • BIG advantage: we get a view of 2006 behavior for half the movies Build model on this half, apply to the other half (Task 2 test set) • Caveats: • Proportional sampling implies there is a scaling parameter left, which we don’t know • Recall that after sampling, (movie, person) pairs that appeared before 2006 were dropped from Task 1 test set Correcting it is an inverse rejection sampling problem

Test sets from 2006 for Task 1 and Task 2 Users Marginal 2006 Distribution of rating Sample (movie, user) pairs according to product of marginals Task 1 Remove Pairs that were rated prior to 2006 Estimate Marginal Distribution Surrogate learning problem Movies Task 2 log(n+1) Task 2 Test Set (8.8K) Task 1 Test Set (100K) Rating Totals

Some data observations (ctd.) • No new movies and reviewers in 2006 • Need to emphasize modeling the life-cycle of movies (and reviewers) • How are older movies reviewed relative to newer movies? • Does this depend on other features (like movie’s genre)? • This is especially critical when we consider the scaling caveat above

Some statistical perspectives • Poisson distribution is very appropriate for counts Clearly true of overall counts for 2006 • Assuming any kind of reasonable reviewers arrival process • Right modeling approach for true counts is Poisson regression:ni ~ Pois (it)log(i) = j j xij* = arg max l(n ; X,) (maximum likelihood) What does this imply for model evaluation approach? • Variance stabilizing transformation for Poisson is square root ni has roughly constant variance RMSE on log scale emphasizes performance on unpopular movies (small Poisson parameter  larger log scale variance) • We still assumed that if we do well in a likelihood formulation, we will do well with any evaluation approach Adapting to evaluation measures!

Some statistical perspectives (ctd.) Can we invert the rejection sampling mechanism? This can be viewed as a missing data problem ni, mj are the counts for movie i and reviewer j respectively pi, qj are the true marginals for movie i and reviewer j respectively N is the total number of pairs rejected due to review prior to 2006 Ui, Pj are the users who reviewed movie i prior to 2006 and movies reviewed by user j prior to 2006, respectively • Can we design a practical EM algorithm with our huge data size? Interesting research problem… • We implemented ad-hoc inversion algorithm Iterate until convergence between:- assuming movie marginals are fixed, adjusting reviewer marginals- assuming reviewer marginals are fixed, adjusting movie marginals • We verified that it indeed improved our data since it increased correlation with 4Q2005 counts

Utilizing leakage Standard Approach Modeling Approach Schema Who Reviewed Test (100K) Estimate Poison Regression M1 & Predict on Task 1 movies Inverse Rejection Sampling Count ratings by Movie from Scale Predictions To Total Use M1 to Predict Task 2 movies IMDB Movie Features Estimate 4 Poison Regression G1…G4 & Predict for 2006 Construct Movie Features Find optimal Scalar NETFLIX challenge Estimate 2006 total Ratings for Task 2Test set Construct Lagged Features Q1-Q4 2005 

Some observations on modeling approach • Lagged datasets are meant to simulate forward prediction to 2006 • Select quarter (e.g., Q105), remove all movies & reviewers that “started” later • Build model on this data with e.g., Q305 as response • Apply model to our full dataset, which is naturally cropped at Q405  Gives a prediction for Q206 • With several models like this, predict all of 2006 • Two potential uses: • Use as our prediction for 2006 – but only if better than the model built on Task 1 movies! • Consider only sum of their predictions to use for scaling the Task 1 model • We evaluated models on Task 1 test set • Used holdout when also building them on this set • How can we evaluate the models built on lagged datasets? • Missing a scaling parameter between the 2006 prediction and sampled set • Solution: select optimal scaling based on Task 1 test set performance Since other model was still better, we knew we should use it!

Some details on our models and submission • All models at movie level. Features we used: • Historical reviews in previous months/quarters/years (on log scale) • Movie’s age since premier, movie’s age in Netflix (since first review) • Also consider log, square etc  have flexibility in form of functional dependence • Movie’s genre • Include interactions between genre and age  “life cycle” seems to differ by genre! • Models we considered (MSE on log-scale on Task 1 holdout): • Poisson regression on Task 1 test set (0.24) • Log-scale linear regression model on Task 1 test set (0.25) • Sum of lagged models built on 2005 quarters + best scaling (0.31) • Scaling based on lagged models • Our estimated of number of reviews for all models in Task 1 test set: about 9.5M • Implied scaling parameter for predictions about 90 • Total of our submitted predictions for Task 2 test set was 9.3M

Competition evaluation • First we were informed that we won with RMSE of ~770 • They mistakenly evaluated on non-log scale • Strong emphasis on most popular movies • We won by large margin Our model did well on most popular movies! • Then they re-evaluated on log scale, we still won • On log scale the least popular movies are emphasized • Recall that variance stabilizing transformation is in-between (square root) • So our predictions did well on unpopular movies too! • Interesting question: would we win on square root scale (or similarly, Poisson likelihood-based evaluation)? Sure hope so!

Competition evaluation (ctd.) Results of competition (log-scale evaluation): • Components of our model’s MSE: • The error of the model for the scaled-down Task 1 test set (which we estimated at about 0.24) • Additional error from incorrect scaling factor • Scaling numbers: • True total reviews: 8.7M • Sum of our predictions: 9.3M • Interesting question: what would be best scaling • For log-scale evaluation? Conjecture: need to under-estimate true total • For square-root evaluation? Conjecture: need to estimate about right

Effect of scaling on the two evaluation approaches

KDD CUP 2007: Summary Keys to our success: • Identify subtle leakage • Is it formally leakage? Depends on intentions of organizers… • Appropriate statistical approach • Poisson regression • Inverting rejection sampling in leakage • Careful handling of time-series aspects Not keys to our success: • Fancy machine learning algorithms

Case Study # 2: KDD CUP 2008 - Siemens Medical Breast Cancer Identification MLO CC MLO CC 6816 Images 1712 Patients Malignant 105,000 Candidates ? [ x1 , x2 , … , x117, class] candidate feature vector

KDD-CUP 2008 based on Mammography • Training: labeled candidates from 1300 patient and association of candidate to location, image and patient • Test: candidates from separate set of 1300 patients • Task 1: • Rank all candidates by the likelihood of being cancerous • Results: IBM Research team was the winner out of 246 • Task 2: • Identify a list of healthy patients • Results: IBM Research team was the winner out of 205

FROC True Positive Patient Rate False Positive Candidate Rate Per Image Task 1: Candidate Likelihood of Cancer Almost standard probability estimation/ranking task on the candidate level • Somewhat synthetic as the meaning of the features is unknown • Challenges • Low positive rate: 7% patients and 0.6% of candidates • Beware of overfitting • Sampling • Unfamiliar evaluation measure • FROC, related to AUC • Non-robust • Hint at locality • Key Solution • Simple linear model • Post-processing of scores • Leakge in identifiers Adapting to evaluation measures!

Task 2: Classify patients Derivate of the previous task 1 • A patient is healthy if all her candidates are benign • Probability that a patient is healthy is the product of the probabilities of her candidates • Challenges • Extremely non robust performance measure: • Including any patient with cancer in the list disqualified the entry • Risk tradeoff – need to anticipate the solutions of the other participants • Key solution • Pick a model with high sensitivity to false negatives • Leakage in identifiers: EDA at work

EDA on the Breast Cancer Domain 144484 1 148717 0 168975 0 169638 1 171985 0 177389 1 182498 0 185266 0 193561 1 194771 0 198716 1 199814 1 1030694 0 1123030 0 1171864 0 1175742 0 1177150 0 1194527 0 1232036 0 1280544 0 1328709 0 1373028 0 1387320 0 1420306 0 ---more--- Console output of sorted ‘patient_ID patient_label’: Base rate of 7%???? What about 200K to 999K?

Leakage Log of Patient ID Every point is a candidate Model score 18 Patients: 85% Cancer 1027 Patients 0% Cancer 414 Patients: 1% Cancer 245 Patients: 36% Cancer Mystery of the Data Generation:Identifier Leakage in the Breast cancer data • Distribution of identifiers has a strong natural grouping of patient identifiers • 3 natural buckets • The three group have VERY different base rated of cancer prevalence • Last group seems to be sorted (cancer first) • Total of 4 groups with very patient different probability of cancer • Organizers admitted to have combined data from different years in order to increase the positive rate

Building the classification model • For evaluation we created a stratified 50% training and test split by patient • Given few positives (~300), results may exhibit high variance • We explored the use of various learning algorithms including Neural Networks, Logistic regression and various SVMs • Linear models (logistic regression or linear SVMs) yielded the most promising results • FROC 0.0834 • Down-sampling the negative class? • Keep on 25% of all healthy patients • Helped in some cases, not really reliable improvement • Add the identifier category (1,2,3,4) as additional feature

Saharon Rosset, Tel Aviv University Claudia Perlich, IBM Research

Saharon Rosset, Tel Aviv University Claudia Perlich, IBM Research

Presentation Transcript

Uri Zwick Tel Aviv University

Neurodegeneration Research at Tel Aviv University

Claudia Perlich

HEP Tel Aviv University

HEP Tel Aviv University

Iddo Tzameret Tel Aviv University

Uri Zwick Tel Aviv University

TEL AVIV

Orna Muller Tel-Aviv University, Israel

Uri Zwick Tel Aviv University

TEL AVIV

Uri Zwick Tel Aviv University

Amir Levinson Tel Aviv University

Tel Aviv/Jaffa

Jechiel Lichtenstadt Tel Aviv University

Oded Regev Tel-Aviv University

Shaul Livnay PhD Tel Aviv University

HEP Tel Aviv University

Tel Aviv

Tel Aviv painting