1 / 26

Story of IBM Research’s success at KDD/Netflix Cup 2007

Story of IBM Research’s success at KDD/Netflix Cup 2007. Saharon Rosset TAU Statistics (Formerly IBM) IBM Research’s teams: Task 1: Yan Liu, Zhenzhen Kou (CMU intern) Task 2: Saharon Rosset, Claudia Perlich, Yan Liu. October 2006 Announcement of the NETFLIX Competition. USAToday headline:

sasson
Télécharger la présentation

Story of IBM Research’s success at KDD/Netflix Cup 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Story of IBM Research’s success atKDD/Netflix Cup 2007 Saharon Rosset TAU Statistics (Formerly IBM) IBM Research’s teams:Task 1: Yan Liu, Zhenzhen Kou (CMU intern) Task 2: Saharon Rosset, Claudia Perlich, Yan Liu

  2. October 2006 Announcement of the NETFLIX Competition USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: • Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on absolute rating error prior to 2011 • $50,000 for the annual progress price (relative to baseline) • Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies • Performance is evaluated on holdout movies-users pairs • NETFLIX competition has attracted 24,396 contestants on 19,799 teams from 155 different countries • 14891 valid submissions from 2282 different teams • current best result is 7.8% better than baseline (from 6.7% as of March)

  3. Data Overview: NETFLIX Internet Movie Data Base All movies (80K) 17K Selection unclear All users (6.8 M) 480 K At least 20 Ratings by end 2005 NETFLIX Competition Data 100 M ratings Qualifier Dataset 3M

  4. NETFLIX data generation process KDD CUPNO User or Movie Arrival User Arrival Movie Arrival Task 1 17K movies Task 2 Training Data 1998 Time 2005 2006 Qualifier Dataset 3M

  5. KDD-CUP 2007 based on the NETFLIX competition • Knowledge Discovery and Data Mining (KDD)-CUP • Annual competition of the premier conference in Data Mining • Training: NETFLIX competition data from 1998-2005 • Test: 2006 ratings randomly split by movie in to two tasks • Task 1: Who rated what in 2006 • Given a list of 100,000 pairs of users and movies, predict for each pair the probability that the user rated the movie in 2006 • Result: We are the second runner-up, No 3 out of 39 teams • Many of the competing teams have been working on the Netflix data for over six months, giving them a decided advantage in Task 1 here • Task 2: Number of ratings per movie in 2006 • Given a list of 8863 movies, predict the number of additional reviews that all existing users will give in 2006 • Result: We are the winner, No 1 out of 34 teams

  6. Generation of Test sets from 2006 for Task 1 and Task 2 Marginal 2006 Distribution of rating Users Sample (movie, user) pairs according to product of marginals Task 1 Remove Pairs that were rated prior to 2006 Movies Task 2 log(n+1) Rating Totals Task 2 Test Set (8.8K) Task 1 Test Set (100K) Back

  7. Insights from the battlefields: What makes a model successful? Previous successful ‘engagements’ of our team: • Competitions: KDD-CUP 1999, 2000, 2003, ILP-Challenge 2005 • Applications: MAP, OnTarget, … Components of successful modeling: 1. Data and domain understanding • Generation of data and task • Cleaning and representation/transformation 2. Statistical insights • Statistical properties • Test validity of assumptions • Performance measure 3. Modeling and learning approach • Most “publishable” part • Choice or development of most suitable algorithm Importance?

  8. Task 1: Did User A review Movie B in 2006? • Task formulation • A classification task to answer question whether “existing” users will review “existing” movies • Challenges • Huge amount of data • how to sample the data so that any learning algorithms can be applied is critical • Complex affecting factors • decrease of interest in old movies, growing tendency of watching (reviewing) more movies by Netflix users • Key solutions • Effective sampling strategies to keep as much information as possible • Careful feature extraction from multiple sources

  9. Task 1: Effective Sampling Strategies • Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing set • The probability of picking a movie was proportional to the number of ratings that movie received; the same strategy for users Movies …… Movie5 .0011 …… Movie3 .001 …… Movie4 .0007 History Samples …. 1488844,3,2005-09-06 822109,5,2005-05-13 885013,4,2005-10-19 30878,4,2005-12-26 823519,3,2004-05-03 … …… Movie5 User 7 …… Movie3 User 7 …… Movie4 .User 8 Users …… User7 .0007 …… User6 .00012 …… User8 .00003 ……

  10. Task 1: Effective Sampling Strategies • Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing set • The probability of picking a movie was proportional to the number of ratings that movie received; the same strategy for users The Ratio of Positive Examples Movies …… Movie5 .0011 …… Movie3 .001 …… Movie4 .0007 …… Movie5 User 7 …… Movie3 User 7 …… Movie4 .User 8 Users …… User7 .0007 …… User6 .00012 …… User8 .00003 ……

  11. Task 1: Multiple Information Sources user user • Graph-based features based on NETFLIX training set : construct a graph with users and movies as nodes, create an edge if the user reviews the movie • Content-based features: Plot, director, actor, genre, movie connections, box office, scores of the movie crawled from Netflix and IMDB 1488844,3,2005-09-06 822109,5,2005-05-13 885013,4,2005-10-19 30878,4,2005-12-26 823519,3,2004-05-03 … movie user user movie

  12. Task 1: Feature Extraction • Movie-based features • Graph topology: # of ratings per movie (across different years), adjacent scores between movies calculated using SVD on the graph matrix • Movie content: similarity of two movies calculated using Latent Semantic Indexing based on bag of words from (1) plots of the movie and (2) other information, such as director, actors, and genre • User profile • Graph topology: # of ratings per user (across different years) • User preferences based on the movies being rated: key word match count, average/min/max of similarity scores between the movie being predicted and movies having been rated by the user movie (rated) movie (rated) key word match count, average/min/max of similarity scores movie (to predict) user … movie (rated)

  13. Task 1: Learning strategy • Learning Algorithm: • Single classifiers: logistic regression, Ridge regression, decision tree, support vector machines • Naïve Ensemble: combining sub-classifiers built on different types of features with pre-set weights • Ensemble classifiers: combining sub-classifiers with weights learned from the development set

  14. Task 2 description: How many reviews did a Movie receive in 2006? • Task formulation • Regression task to predict the total count of reviewers from “existing” users for 8863 “existing” movies • Challenges • Movie dynamics and life-cycle • Interest in movies changes over time • User dynamics and life-cycle • No new users are added to the database • Key solutions • Use counts from test set of Task 1 to learn a model for 2006 adjusting for pair removal • Build set of quarterly lagged models to determine the overall scalar • Use Poisson regression

  15. Some data observations • Task 1 test set is a potential response for training a model for Task 2 • Was sampled according to marginal (= # reviews for movie in 06 / # reviews in 06)which is proportional to the Task 2 response (= # reviews for movie in 06) • BIG advantage: we get a view of 2006 behavior for half the movies Build model on this half, apply to the other half (Task 2 test set) • Caveats: • Proportional sampling implies there is a scaling parameter left, which we don’t know • Recall that after sampling (movie, person) pairs that appeared before 2006 were dropped from Task 1 test set Correcting it is interesting research challenge of inverse rejection sampling • No new movies and reviewers in 2006 • Need to emphasize modeling the life-cycle of movies (and reviewers) • How are older movies reviewed relative to newer movies? • Does this depend on other features (like movie’s genre)? • This is especially critical when we consider the scaling caveat above

  16. Some statistical perspectives • Poisson distribution is very appropriate for counts • Clearly true of overall counts for 2006 • Assuming any kind of reasonable reviewers arrival process • Implies appropriate modeling approach for true counts is Poisson regression:ni ~ Pois (it)log(i) = j j xij* = arg max l(n ; X,) (maximum likelihood solution) • What happens when we sub-sample for Task 1 test set? • Sum is fixed  multinomial • Large N, small p  each sub-sampled count well approximated by Poisson • Can be shown that Poisson regression (=assuming independence) is appropriate • What does this imply for model evaluation approach? • Variance stabilizing transformation for Poisson is square root ni has roughly constant variance RMSE of log (prediction +1) against log(# ratings +1) emphasizes performance on unpopular movies (small Poisson parameter  larger log scale variance) • We still assumed that if we do well in a likelihood formulation, we will do well with any evaluation approach

  17. Some statistical perspectives (ctd.) • Can we invert the rejection sampling mechanism? • This can be viewed as a missing data problem • Can we design a practical EM algorithm with our huge data size? Interesting research problem… • We implemented ad-hoc inversion algorithm • Iterate until convergence between:- assuming movie marginals are correct and adjusting reviewer marginals- assuming reviewer marginals are correct and adjusting movie marginals • We verified that it indeed improved our data since it increased correlation with 4Q2005 counts

  18. Modeling Approach Schema Task 1 Test (100K) Estimate Poison Regression M1 & Predict on Task 1 movies Inverse Rejection Sampling Count ratings by Movie from Scale Predictions To Total Use M1 to Predict Task 2 movies IMDB Validate against 2006 Task 1 counts Movie Features Estimate 4 Poison Regression G1…G4 & Predict for 2006 Construct Movie Features Find optimal Scalar NETFLIX challenge Estimate 2006 total Ratings for Task 2Test set Construct Lagged Features Q1-Q4 2005 

  19. Some observations on modeling approach • Lagged datasets are meant to simulate forward prediction to 2006 • Select quarter (e.g., Q105), remove all movies & reviewers that “started” later • Build model on this data with e.g., Q305 as response • Apply model to our full dataset, which is naturally cropped at Q405  Gives a prediction for Q206 • With several models like this, predict all of 2006 • Two potential uses: • Use as our prediction for 2006 – but only if better than the model built on Task 1 movies! • Consider only sum of their predictions to use for scaling the Task 1 model • We evaluated models on Task 1 test set • Used holdout when also building them on this set • How can we evaluate the models built on lagged datasets? • Missing a scaling parameter between the 2006 prediction and sampled set • Solution: select optimal scaling based on Task 1 test set performance Since other model was still better, we knew we should use it!

  20. Some details on our models and submission • All models at movie level. Features we used: • Historical reviews in previous months/quarters/years (on log scale) • Movie’s age since premier, movie’s age in Netflix (since first review) • Also consider log, square etc  have flexibility in form of functional dependence • Movie’s genre • Include interactions between genre and age  “life cycle” seems to differ by genre! • Models we considered (MSE on log-scale on Task 1 holdout): • Poisson regression on Task 1 test set (0.24) • Log-scale linear regression model on Task 1 test set (0.25) • Sum of lagged models on built on 2005 quarters + best scaling (0.31) • Scaling based on lagged models • Our estimated of number of reviews for all models in Task 1 test set: about 9.5M • Implied scaling parameter for predictions about 90 • Total of our submitted predictions for Task 2 test set was 9.3M

  21. Competition evaluation • First we were informed that we won with RMSE of ~770 • They mistakenly evaluated on non-log scale • Strong emphasis on most popular movies • We won by large margin Our model did well on popular movies! • Then they re-evaluated on log scale, we still won • On log scale the least popular movies are emphasized • Recall that variance stabilizing transformation is in between (square root) • So our predictions did well on unpopular movies too! • Interesting question: would we win on square root scale (or similarly, Poisson likelihood-based evaluation)? Sure hope so!

  22. Competition evaluation (ctd.) • Results of competition (log-scale evaluation): • Components of our model’s MSE: • The error of the model for the scaled-down Task 1 test set (which we estimated at about 0.24) • Additional error from incorrect scaling factor • Scaling numbers: • True total reviews: 8.7M • Sum of our predictions: 9.3M • Interesting question: what would be best scaling • For log-scale evaluation? Conjecture: need to under-estimate true total • For square-root evaluation? Conjecture: need to estimate about right

  23. Effect of scaling on the two evaluation approaches

  24. Effect of scaling on the two evaluation approaches Legend Log-scale MSE SQRT MSE True sum Submitted sum Sum predictions (M)

  25. Acknowledgements • Rick Lawrence • Naoki Abe • Prem Melville • Hisashi Kashima (TRL) • Shohei Hido (TRL) • Chandan Reddy • Grzegorz Swirszcz • And many more ..

More Related