320 likes | 337 Vues
Explore the use of hypothesis testing vs prediction in psychological science and the benefits of Bayesian statistics. Learn how to build quantitative models and test their fit using various samples. Compare different models and assess their performance using root mean squared error (RMSE). Discover the advantages of prediction and validation techniques, and understand the concept of cross-validation. Apply optional stopping to determine sample sizes for optimal model performance.
E N D
Prediction Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University
Hypothesis tests • Hypothesis tests are commonly used as part of a method to establish scientific “truth” • Is there an effect? • What should I believe? • An alternative approach is to give up on “truth” and instead focus on “prediction” • The question is not “Is there an effect?” or “What should I believe?” • Rather: “How should I behave?” • Follow the data, but do not follow it blindly • Build a quantitative model, but test the model
Model building • Suppose you have two samples and you are interested in the means • Further suppose that the population properties are: • μ1=0, μ2=0.3 • σ1=σ2=1 • Typically, we would draw random samples from each group and run a t-test to determine if we should treat the means as being different • Treatment • Theory • Future work • Prediction
Model building • We typically build the following kind of model • The score for subject k is related to the grand mean, to deviations from the grand mean due to being in group 1 or group 2, and to random noise • This model gives mean values for each group
Model building • Draw samples (n1=n2) from populations having • μ1=0,μ2=0.3 • σ1=σ2=1 • Construct different models that vary in the estimate of the mean values: Hypothesis testing model Full model Null model If do not reject H0 (p<.05) If do reject H0
Small samples • 20 experiments • n1=n2=10
Bigger samples • 20 experiments • n1=n2=50
Big samples • 20 experiments • n1=n2=100
Model fit/error • A standard way of judging the quality of a model is by its fit to a data set • One fit measure is root mean squared error • We want a model with low RMSE
Checking model approaches • Draw samples (n1=n2) from populations having • μ1=0,μ2=0.3 • σ1=σ2=1 • Repeat for 10,000 simulated experiments • Compute RMSE for each model and average across experiments • Vary sample size n1=n2
Comparing models • μ2 - μ1=0.3
Comparing models • For small samples, the null model provides the smallest average RMSE • For large samples, the full model provides the smallest average RMSE
Comparing models • There is always a better model (on average) than what is derived by hypothesis testing • Hypothesis testing (on average) leads to over fitting for some small samples (when it rejects) • Hypothesis testing (on average) leads to under fitting for some large samples (when it does not reject)
Bigger effects • Similar for other effect sizes: μ2 - μ1=0.8
Null effects • Similar for other effect sizes: μ2 - μ1=0
Known unknowns • But these simulations are all theoretical • To compute RMSE we need to know the true means • However, we can do something similar if we do not compute RMSE relative to the true means, but relative to test data
Prediction / validation • Suppose I build my models from one set of data, x1i, and x2i, and then test them with another set of data, y1i, and y2i • Here, we compute RMSE relative to means from the test data set • You could also compute RMSE relative to individual data points
Small effect • When μ2 - μ1=0.3
Bigger effect • When μ2 - μ1=0.8
Null effect • When μ2 - μ1=0
Prediction / validation • Can better see differences by subtracting full model RMSE from other models’ RMSE μ2 - μ1=0.8 μ2 - μ1=0.3 μ2 - μ1=0 Smallest number (biggest negative) Indicates the best model (with the smallest RMSE).
Prediction / validation • This looks good • At least on average, the RMSE patterns for testing means of new data are similar to those for RMSE for testing against the true means • If we want to deduce which model best predicts values, we can pick the model that minimizes the test RMSE value • Cost: we have to run the experiment twice • Testing does not require equal sample sizes, but you trade off model development against model testing
Cross validation • We partly avoid that cost by using cross-validation to approximate RMSETest • Divide the data set x1i, and x2i into multiple subsets (a common choice is 10 subsets) • Build your model using all but one of the subsets • Compute RMSE for the left-out subset • Repeat for all possible combinations • 10 build and test “folds” • Compute mean RMSE across the subsets
Cross validation • When μ2 - μ1=0.3, 5-fold cross validation
Cross validation • When μ2 - μ1=0.8, 5-fold cross validation
Cross validation • When μ2 - μ1=0.0, 5-fold cross validation
Optional stopping • Actual use: μ2 - μ1=0.3, 10-fold cross validation • Start with n1=n2=10, compute cross-validated RMSE • Add 10 scores and repeat until n1=n2=200
Optional stopping • Actual use: μ2 - μ1=0.8, 10-fold cross validation • Start with n1=n2=10, compute cross-validated RMSE • Add 10 scores and repeat until n1=n2=200
Optional stopping • Actual use: μ2 - μ1=0.0, 10-fold cross validation • Start with n1=n2=10, compute cross-validated RMSE • Add 10 scores and repeat until n1=n2=200
Cross validation • At each step, you should follow the data and use the best model for minimizing RMSE • As the data changes, so does your model • You can have an intermediate decision, but still expect it to change • If you have to make a decision with the current data it makes sense to choose the best model • Note that the best model is not necessarily a good model • You have to judge whether the RMSE is small enough for whatever purpose you have in mind
Prediction / validation • Cross validation and test validation naturally generalize to more complicated models and experimental designs • Interactions, nonlinear models • Details of how to generate validation “folds” can get complicated • It’s mostly a matter of being careful about generating representative folds and not inputting your own bias and • No need to use RMSE • Other “cost” functions work in a similar way
Conclusions • Prediction / validation seems like a viable approach • It encourages data accumulation • But it gives up on the idea of establishing “truth” from data • Instead, it focuses on practical uses of data • There are Bayesian methods that have the same goal • They are better if you have useful prior knowledge