320 likes | 340 Vues
Explore Bayesian statistics as an alternative to hypothesis testing for behavioral predictions, assessing model fit and comparing effects.
 
                
                E N D
Prediction Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University
Hypothesis tests • Hypothesis tests are commonly used as part of a method to establish scientific “truth” • Is there an effect? • What should I believe? • An alternative approach is to give up on “truth” and instead focus on “prediction” • The question is not “Is there an effect?” or “What should I believe?” • Rather: “How should I behave?” • Follow the data, but do not follow it blindly • Build a quantitative model, but test the model
Model building • Suppose you have two samples and you are interested in the means • Further suppose that the population properties are: • μ1=0, μ2=0.3 • σ1=σ2=1 • Typically, we would draw random samples from each group and run a t-test to determine if we should treat the means as being different • Treatment • Theory • Future work • Prediction
Model building • We typically build the following kind of model • The score for subject k is related to the grand mean, to deviations from the grand mean due to being in group 1 or group 2, and to random noise • This model gives mean values for each group
Model building • Draw samples (n1=n2) from populations having • μ1=0,μ2=0.3 • σ1=σ2=1 • Construct different models that vary in the estimate of the mean values: Hypothesis testing model Full model Null model If do not reject H0 (p<.05) If do reject H0
Small samples • 20 experiments • n1=n2=10
Bigger samples • 20 experiments • n1=n2=50
Big samples • 20 experiments • n1=n2=100
Model fit/error • A standard way of judging the quality of a model is by its fit to a data set • One fit measure is root mean squared error • We want a model with low RMSE
Checking model approaches • Draw samples (n1=n2) from populations having • μ1=0,μ2=0.3 • σ1=σ2=1 • Repeat for 10,000 simulated experiments • Compute RMSE for each model and average across experiments • Vary sample size n1=n2
Comparing models • μ2 - μ1=0.3
Comparing models • For small samples, the null model provides the smallest average RMSE • For large samples, the full model provides the smallest average RMSE
Comparing models • There is always a better model (on average) than what is derived by hypothesis testing • Hypothesis testing (on average) leads to over fitting for some small samples (when it rejects) • Hypothesis testing (on average) leads to under fitting for some large samples (when it does not reject)
Bigger effects • Similar for other effect sizes: μ2 - μ1=0.8
Null effects • Similar for other effect sizes: μ2 - μ1=0
Known unknowns • But these simulations are all theoretical • To compute RMSE we need to know the true means • However, we can do something similar if we do not compute RMSE relative to the true means, but relative to test data
Prediction / validation • Suppose I build my models from one set of data, x1i, and x2i, and then test them with another set of data, y1i, and y2i • Here, we compute RMSE relative to means from the test data set • You could also compute RMSE relative to individual data points
Small effect • When μ2 - μ1=0.3
Bigger effect • When μ2 - μ1=0.8
Null effect • When μ2 - μ1=0
Prediction / validation • Can better see differences by subtracting full model RMSE from other models’ RMSE μ2 - μ1=0.8 μ2 - μ1=0.3 μ2 - μ1=0 Smallest number (biggest negative) Indicates the best model (with the smallest RMSE).
Prediction / validation • This looks good • At least on average, the RMSE patterns for testing means of new data are similar to those for RMSE for testing against the true means • If we want to deduce which model best predicts values, we can pick the model that minimizes the test RMSE value • Cost: we have to run the experiment twice • Testing does not require equal sample sizes, but you trade off model development against model testing
Cross validation • We partly avoid that cost by using cross-validation to approximate RMSETest • Divide the data set x1i, and x2i into multiple subsets (a common choice is 10 subsets) • Build your model using all but one of the subsets • Compute RMSE for the left-out subset • Repeat for all possible combinations • 10 build and test “folds” • Compute mean RMSE across the subsets
Cross validation • When μ2 - μ1=0.3, 5-fold cross validation
Cross validation • When μ2 - μ1=0.8, 5-fold cross validation
Cross validation • When μ2 - μ1=0.0, 5-fold cross validation
Optional stopping • Actual use: μ2 - μ1=0.3, 10-fold cross validation • Start with n1=n2=10, compute cross-validated RMSE • Add 10 scores and repeat until n1=n2=200
Optional stopping • Actual use: μ2 - μ1=0.8, 10-fold cross validation • Start with n1=n2=10, compute cross-validated RMSE • Add 10 scores and repeat until n1=n2=200
Optional stopping • Actual use: μ2 - μ1=0.0, 10-fold cross validation • Start with n1=n2=10, compute cross-validated RMSE • Add 10 scores and repeat until n1=n2=200
Cross validation • At each step, you should follow the data and use the best model for minimizing RMSE • As the data changes, so does your model • You can have an intermediate decision, but still expect it to change • If you have to make a decision with the current data it makes sense to choose the best model • Note that the best model is not necessarily a good model • You have to judge whether the RMSE is small enough for whatever purpose you have in mind
Prediction / validation • Cross validation and test validation naturally generalize to more complicated models and experimental designs • Interactions, nonlinear models • Details of how to generate validation “folds” can get complicated • It’s mostly a matter of being careful about generating representative folds and not inputting your own bias and • No need to use RMSE • Other “cost” functions work in a similar way
Conclusions • Prediction / validation seems like a viable approach • It encourages data accumulation • But it gives up on the idea of establishing “truth” from data • Instead, it focuses on practical uses of data • There are Bayesian methods that have the same goal • They are better if you have useful prior knowledge