360 likes | 547 Vues
Predictive modeling competitions. making data science a sport. Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom. Photo by mikebaird, www.flickr.com/photos/mikebaird. Global competitions. Predicting HIV viral load. Competition closes 77%.
E N D
Predictive modeling competitions making data science a sport Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Photo by mikebaird, www.flickr.com/photos/mikebaird
Global competitions Predicting HIV viral load Competition closes 77% 1½ weeks 70.8% State of the art 70%
Diverse experts solving diverse problems Grant Application Forecasting Stock Price Prediction HIV Research Chess Ratings Travel Time Prediction Edmund & Adrian London & USA Dr. Derek Gatherer UK Felipe Maia Uppsala University Ivan Russian Federation Dr. Christopher Hefele, New York Philipp Emanuel Widmann Heidelberg, DE Chih-Li Sung & Roy Tseng Penghu & Taipei Robert Warsaw Gzegorz Swiszcz Gera Cole Harris Texas Jure Zbontar Ljubljana Giuseppe Ragusa Rome Chris DuBois Portland Claudio Perlich USA Edmund & Adrian London & USA Jason Trigg Pennsylvania John Blatz Baltimore Rajstennaj Barrabas USA Chris Raimondi Batimore Jason Trigg Pennsylvania Uri Blass Tel-Aviv Lee Baker Las Cruces, NM Nan Zhou Pittsburgh Jeremy Howard Australia Thomas Mahony Canberra Glen Maher Canberra Emir Delic Australia
Motivation • Why host a competition? • Why compete? • How it works • Heritage Health Prize • Questions
“I keep saying the sexy job in the next ten years will be statisticians.” Hal Varian Google Chief Economist 2009
Crowdsourcing Mismatch between those with data andthose with the skills to analyse it
Countless possible approaches to any data prediction problem. Which to choose? 7
Motivation • Why host a competition? • Why compete? • How it works • Heritage Health Prize • Questions
Tourism Forecasting Competition Forecast Error(MASE) Existing model Aug 9 2 weeks later 1 month later Competition End
Chess Ratings Competition Existing model (ELO) Error Rate(RMSE) Aug 4 1 month later 2 months later Today
Users apply different techniques • neural networks • logistic regression • support vector machine • decision trees • ensemble methods • adaBoost • Bayesian networks • genetic algorithms • random forest • Monte Carlo methods • principal component analysis • Kalman filter • evolutionary fuzzy modeling
~25% Successful grant applications NASA tried, now it’s our turn
~25% Outcomes of a competition to predict the success of grant applications: Successful grant applications • Better identify likely successes to avoid wasting resources on hopeless applications • Identify and communicate the characteristics of a successful application to future applicants
Motivation • Why host a competition? • Why compete? • How it works • Heritage Health Prize • Questions
Why Participants Compete 2 1 More fun than Sudoku Clean, Real world data Professional Reputation & Experience 4 3 Interactions with experts in related fields Prizes
Motivation • Why host a competition? • Why compete? • How it works • Heritage Health Prize • Questions
2 3 1 Upload Submit Evaluate & Exchange
Competition Mechanics Competitions are judged on objective criteria
Motivation • Why host a competition? • Why compete? • How it works • Heritage Health Prize • Questions
An upcoming competition, powered by Kaggle • De-identified dataset containing medical records of 100,000 Americans • $3 million prize http://www.heritagehealthprize.com
& Unfilled Prescriptions & Hypertension & High Cholesterol Diabetes Probability of going to hospital in the next year
NetFlix Prize 2006 – 2009 $1 million prize 50,000 registrations 2011 $3 million prize Projected 100,000 registrations
Motivation • Why host a competition? • Why compete? • How it works • Heritage Health Prize • Questions
Chess Ratings – Elo vs. the Rest of the World IJCNN Social Network Challenge Tourism Forecasting (Part 2) Predict Grant Applications
Jeff Moser Jeremy Howard Nicholas Gruen Anthony Goldbloom
What could the world’s bestanalysts find in your data? e-mail anthony.goldbloom@kaggle.com phone +61438400053 Photo by gidzy, www.flickr.com/photos/gidzy