Statistical Challenges in Dealing with Big Data

Is Statistics=Data Science The big data issue NairanjanaDasgupta

What determines big data :The 5 V’s • Volume • Considered too large for regular software • Variety • Often a mix of many different data types • Velocity • Extreme speed at which data generated • Variability • Inconsistency of the data set • Veracity • How reliable is this data

How big is big? • By big we mean its volume is such that it is hard to analyze this on a single computer. • That in itself shouldn’t be problematic • But requiring specialized machines to analyze this has added to the myth and enigma of big data. • The problem with big data, at least as I see it, is some very pertinent statistical questions are bypassed when dealing with it.

Some statistical thoughts? • Is the big data a sample or a population? • If it is really a population: then analysis means constructing summary statistics. • This is bulky but not too difficult. • If it is a sample: what was the sampling frame? • If no population was considered when collecting this data, it is definitely not a representative sample. • So, should one really do inference on BIG data? • If one is allowed to do inference wouldn’t the sheer size of the data, give us so much power that we can pretty much come to any decision we test for.

Structure of data • Generally most data sets are rectangular in nature with p variables and n observations collected. • In big data we often have many more predictors than observations (the big p problem) • Many more (orders of magnitude more) observations than predictors, (the big n problem). • Both n and p are big and are fluid as they are constantly updated and amassed.

The Variety and Velocity Piece • Generally opportunistic data is a mix of categorical, discrete, ordinal, continuous and a mix of that as well. So if we use it as a multivariate we have to think about how to proceed. While not trivial this can be surmounted with too much difficulty. • The big issue is often the data is being amassed (I am not using collected intentionally) at a faster rate than it can actually be analyzed and summarized.

Variability and Veracity • This type of data is extremely variable and there is no systematic model in place to capture the components of variability. Modeling is very hard when you have no idea about the sources of variability in these types of data sets. • Veracity: is it measuring what we think it is? How truthful is this data? • Just because it is big, is it really good? • O’Donoghueand Herbert: “Big data very often means 'dirty data' and the fraction of data inaccuracies increases with data volume growth.” (context of medical data)

Visualization of big data • Often called dashboards • Really a collection of well known age old graphs that most of you can do in excel! • It is really just summary data in pretty colors. • Don’t be fooled by these fancy terms.

Example of a Dashboard.

Prediction versus Inference • As the whole question of whether it is a sample or a population in itself is muddy let us leave inference out for now and now focus on analyzing. • A common analysis method associated with opportunistic data is predictive analysis.

Predictive Analytics and big data • Encompasses: prediction models, machine learning, data mining for prediction of the unknown using the history of the past. • If we are predicting are we inferring?? I will assume it is okay to do that. • Exploits patterns found in historical and allows assessment of risk associated with a particular set of conditions. • Credit scoring has used predictive analytics for a long time • However, here at least in the past sampling was done to perform inference.

Techniques used in Predictive Analytics or supervised learning • Regression techniques • Logistic regression • Time series models • Survival or duration analysis • Classification and Discrimination • Regression Trees • Modeled by humans • etc., • Neural networks • Multilayer perceptron Radial basis functions • Support vector machines • Naïve Bayes • k-nearest neighbors • Geospatial predictive modeling • Done by machines: no model • etc., Analytical Methods Machine Learning Methods

Supervised Learning • Idea is learning from a known data set to predict the unknown. • Essentially we know the class labels ahead of time. • What we need to do is find a RULE using features in the data that DISCRIMINATES effectively between the classes. • So that if we have a new observation with its features we can correctly classify it. • Machine Learning uses this idea and so it is very popular now.

Steps • Selection of features • Model Fitting • Model Validation using prediction of known classes • Feature selection is done by the computer • No model, but computer determines the functions of the predictors used • Model is validated based on prediction of known classes Discriminant Analysis Machine Learning

Feature Selection • Find which of the observed variables can actually distinguish between the classes of interest. • This is variable selection

MODEL FITTING • Commonly used in Stats: • LDA • K Nearest Neighbor • QDA • Logistic Regression

Without models we can use Machine Learning methods • Neural networks • Naïve Bayes • Support Vector machines • Perceptron • Decision Trees • Random Forests

Validation • See how well the classifiers classify the observations into the different classes. • Mostly commonly used method leave-one-out-cross validation. • Though test data set (holdout sample) and resubmissions are still used.

Recap of Part 4 • The sticky problem is if the data we have is a sample or a population. • Inference is tough, as it is hard to figure out to what population we are inferring for. • Predictive analytics often associated with big data • At the end of the day, machines are faster and more efficient but cannot create interpretative models (not yet). • We still don’t know if big data is good data, it depends upon who is collecting it and for what purpose.

Myth of Big Data • There is no myth, it is just unwieldy, unstructured, under-designed data that is already being amassed. • It still has to be good data for us to make good analysis and predictions. • At the end of the day to make inferences on data (big or small) we need it to be representative.

Statistical Challenges in Dealing with Big Data