A Statistical Viewpoint on Data Science, Data Mining and Big Data

A Statistical Viewpoint on Data Science, Data Mining and Big Data. Alec Stephenson. DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES.

A Statistical Viewpoint on Data Science, Data Mining and Big Data

  1. A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

  2. Introduction • Statistics Vs Data Science • Statistician Vs Data Scientist • Data Science in Predictive Analytics • Data Science in Consulting • Big Data: Are Statisticians Relevant?

  3. Data Science Venn Diagram (Drew Conway)

  4. Statistician Vs Data Scientist

  5. I am a Data Scientist • On Linkedin • On my email signature • To market myself to internal and external clients I am a Statistician • Atacademic conferences • Providing expertise for journal articles • Any role as a technical expert

  6. Is There A Greater Demand For Data Scientists? • Experfy www.experfy.com • Melbourne Data Science Meet-Up www.meetup.com/Data-Science-Melbourne/ BUT: • Kaggle Connect No longer exists (March-December 2013)

  7. Data Science Skills Essential: • Statistical Modelling: e.g. R, Matlab, Python • Data Munging: e.g. Perl, Python, Ruby Additional: • Fast Computation: C, C++, Java • Data Storage: SQL, noSQL • Big Data: MapReduce, Mahout, Hive, Pig

  8. Data Mining Competitions www.kaggle.com Good For Building Essential Skills In Predictive Analytics Only Three Steps To Winning: • Data Munging • Machine Learning / Statistical Modelling • Ensembling

  9. Data Mining Competitions www.kaggle.com General Advice: • Just because you have data, does not mean that you have to use it • There is no such thing a single best model • Different models can capture different features • Visualize the data

  10. Data Mining Competitions www.kaggle.com General Advice: • If something takes more than one minute to run, do you really need to run it? • Spend more time on trying different data transformations and models, and less on parameter specification • Just have a go. • How much time can you afford?

  11. Data Mining Competitions www.kaggle.com Usually Good Methods: • Gradient boosting machine (gbm / mboost) • Random forest (randomForest) • Elastic net (glmnet) • Support Vector Machine (kernlab / e1071) • Neural networks (nnet)

  12. Data Mining Competitions www.kaggle.com Usually Not So-Good Methods: • Recursive Partitioning (rpart / tree) • Nearest neighbour (class) • Multivariate Adaptive Regression Splines (earth) • Naive Bayes (e1071)

  13. Data Mining Example I library(randomForest) library(gbm) library(glmnet) data <- as.matrix(iris[,-5]) set.seed(100) ind <- sample(150, 15) train <- data[-ind,] test <- data[ind,]

  14. Data Mining Example II set.seed(100) m1 <- randomForest(train[,2:4], train[,1], ntree = 1000, mtry = 2) pm1 <- predict(m1, test[,-1]) mean((pm1 - test[,1])^2) set.seed(100) m2 <- gbm.fit(train[,2:4], train[,1], distribution = "gaussian", n.trees = 10000, shrinkage = 0.001, interaction.depth= 2) pm2 <- predict(m2, test[,-1], n.trees = 10000) mean((pm2 - test[,1])^2) set.seed(100) m3 <- glmnet(train[,2:4], train[,1], family = "gaussian", alpha = 0.5) pm3 <- predict(m3, test[,-1]) pm3 <- pm3[,ncol(pm3)] mean((pm3 - test[,1])^2)

  15. Data Mining Example III mean(((pm1 + pm2)/2 - test[,1])^2) mean(((pm1 + pm3)/2 - test[,1])^2) mean(((pm2 + pm3)/2 - test[,1])^2) mean(((pm1 + pm2 + pm3)/3 - test[,1])^2)

  16. Prediction: Competitions Vs Clients • Predictive analytics is a black box • Simplicity vs Predictive Accuracy • Communication with client • Reporting: methods or conclusions • Variable Importance • Client Implementation

  17. Big Data • Means different things to different people • SKA: 10 petabytes per hour by 2025 • Statisticians typically consider a few gigabytes to be a huge dataset • Do statisticians have a role to play?

  18. Big Data 3V’s: Volume Velocity Variety • Volume: MB, GB, TB, PB, ... • Velocity: Real-Time, Hourly, Weekly, Batch, • Variety: Structured, Unstructured • Veracity: How accurate? • Value: How valuable?

  19. Gartner Hype Cycle 2013

  20. Big Data: A typical statistician… • Will say that they are heavily involved in big data • Will use big data for marketing purposes • Will never have programmed a MapReduce job • Will have never used datasets of 0.5TB+ • Will not know about big data technologies • Why is this?

  21. Statisticians may have a role in • Deciding what data is relevant to the question • Subsetting and sampling big data • Modelling these subsets Statistician may not have a role • If you need to touch all of the data (0.5TB+) • Restriction to linear (or linearithmic) algorithms • Sums / Averages / Graph Search / Sorting

