390 likes | 410 Vues
This text discusses the use of Bayes Factors in the analysis of a study on face emotion and working memory, providing guidelines for interpreting evidence and demonstrating the importance of null findings. It also covers Bayesian meta-analysis as a way to combine replication studies.
E N D
Bayes Factors and inference Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University
Bayes Factor • The ratio of likelihood for the data under the null compared to the alternative • Nothing special about the null, it compares any two models • Likelihoods are averages across different possible parameter values specified by the model by a prior distribution
What does it mean? • Guidelines BFEvidence 1 – 3 Anecdotal 3 – 10 Substantial 10 – 30 Strong 30 – 100 Very strong >100 Decisive
Evidence for the null • BF01>1 implies (some) support the null hypothesis • Evidence for “invariances” • This is more or less impossible for NHST • It is a useful measure • Consider a recent study in Psychological Sciences • Liu, Wang, Wang & Jiang (2016). Conscious Access to Suppressed Threatening Information Is Modulated by Working Memory
Working memory face emotion • Explored whether keeping a face in working memory influenced its visibility under continuous flash suppression • To insure subjects kept face in memory, tested for identity
Working memory face emotion • Different types of face emotions: fearful face, neutral face • No significant differences of correct responses (same/different) for emotions: • Experiment 1: t(11)= -1.74, p=0.110 • If we compute the JZS Bayes Factor we get • > ttest.tstat(t=-1.74, n1=12, simple=TRUE) • B10 • 0.9240776 • Which is anecdotal support for the null hypothesis • You would want B10< 1/3 for substantial support for the null
Replications • Experiment 3 • t(11)=-1.62, p=.133 • Experiment 4 • t(13)=-1.37, p=.195 • Converting to JZS Bayes Factors suggests these are modest support for the null • Experiment 3 • ttest.tstat(t= -1.62, n1=12, simple=TRUE) • B10 • 0.8033315 • Experiment 4 • ttest.tstat(t= -1.37, n1=14, simple=TRUE) • B10 • 0.5857839
The null result matters • The authors wanted to demonstrate that faces with different emotions were equivalently represented in working memory • But differently affected visibility during the flash suppression part of a trial • Experiment 1: • Reaction times for seeing a face during continuous flash suppression were shorter for fearful faces than for neutral faces • Main effect of emotion: F(1, 11)=5.06, p=0.046 • Reaction times were shorter when the emotion of the face during continuous flash suppression matched the emotion of the face in working memory • Main effect of congruency: F(1, 11)=11.86, p=0.005
Main effects • We will talk about a Bayesian ANOVA later, but we can consider the t-test equivalent of these tests: • Effect of emotion • > ttest.tstat(t= sqrt(5.06), n1=12, simple=TRUE) • B10 • 1.769459 • Suggests anecdotal support for the alternative hypothesis • Effect of congruency • ttest.tstat(t= sqrt(11.86), n1=12, simple=TRUE) • B10 • 9.664241 • Suggests substantial support for the alternative hypothesis
Evidence • It is generally harder to get convincing evidence (BF>3 or BF>10) than to get p<.05 • Interaction: F(1, 11)=4.36, p=.061 • Contrasts: • RT for fearful faces shorter if congruent withworking memory: t(11)=-3.59, p=.004 • RT for neutral faces unaffected by congruencyt(11)=-0.45 • Bayesian interpretations of t-tests: • > ttest.tstat(t=-3.59, n1=12, simple=TRUE) • B10 • 11.94693 • > ttest.tstat(t=-0.45, n1=12, simple=TRUE) • B10 • 0.3136903
Substantial Evidence • For a two-sample t-test (n1=n2=10), a BF>3 corresponds to p<0.022 • For a two-sample t-test (n1=n2=100), a BF>3 corresponds to p<0.012 • For a two-sample t-test (n1=n2=1000), a BF>3 corresponds to p<0.004
Strong Evidence • For a two-sample t-test (n1=n2=10), a BF>10 corresponds to p<0.004 • For a two-sample t-test (n1=n2=100), a BF>10 corresponds to p<0.003 • For a two-sample t-test (n1=n2=1000), a BF>10 corresponds to p<0.001 • Of course, if you change your prior you change these values • (but not much) • Setting the scale parameter r=sqrt(2) (ultra wide) gives • For a two-sample t-test (n1=n2=10), a BF>10 corresponds to p<0.005 • For a two-sample t-test (n1=n2=100), a BF>10 corresponds to p<0.0017 • For a two-sample t-test (n1=n2=1000), a BF>10 corresponds to p<0.00054
Bayesian meta-analysis • Rouder & Morey (2011) identified how to combine replication studies to produce a JZS Bayes Factor that accumulates the information across experiments • The formula for a one-sample, one-tailed t-test for BF10 is • f( ) is the Cauchy (or half-Cauchy) distribution • g( ) is the non-central t distribution • It looks complicated, but it is easy enough to calculate
Bayesian meta-analysis • Consider the null results on face emotion and memorability • Experiment 1 • t(11)= -1.74, p=0.110 • Experiment 3 • t(11)=-1.62, p=.133 • Experiment 4 • t(13)=-1.37, p=.195 • Strong support for the alternative! > tvalues<-c(-1.74, -1.62, -1.37) > nvalues<-c(12, 12, 14) > meta.ttestBF(t=tvalues, n1=nvalues) Bayes factor analysis -------------- [1] Alt., r=0.707 : 4.414733 ±0% Against denominator: Null, d = 0 --- Bayes factor type: BFmetat, JZS
Equivalent statistics • Bayes Factors are not magic, and they use the very same information as other approaches to statistical inference • Consider a variety of statistics for different inferential methods • Standardized effect size (Cohen’s d, Hedge’s g) • Confidence interval for d or g • JZS Bayes Factor • Akaiki Information Criterion (AIC) • Bayesian Information Criterion (BIC)
Equivalent statistics • For a 2-sample t-test with known sample sizes n1 and n2, all of these statistics are mathematicallyequivalent to each other • Given one statistic, you can compute all the others • You should use the statistic that is appropriate for the inference you want to make
Equivalent statistics • Each of these statistics is a “sufficient statistic” for the population effect size • A data set provides an estimate of the population effect size • It is “sufficient” because knowing the whole data set provides no more information aboutδthan just knowing d
Equivalent statistics: d, t, p • Any invertible transformation of a sufficient statistic is also sufficient • For example, • Similarly, a t value corresponds to a unique p value
Equivalent statistics: CIs • The variance of Cohen’s d is a function of only sample size and d • This means that if you know d and the sample sizes, you can compute either limit of a confidence interval of d • If you know either limit of a confidence interval of d you can also compute d • You get no more information about the data set by reporting a confidence interval of d than by reporting a p value
Equivalent statistics: Likelihood • Many statistics are based on likelihood • Essentially the “probability” of the observed data, given a specific model (not quite probability because a specific value of a continuous variable has probability zero- so it is a product of the probability density function values) • For a two-sample t-test, the alternative hypothesis (full model) is that a score from group s (1 or 2) is defined as • With different means for each group s • Likelihood for the full model is then:
Equivalent statistics: Likelihood • For a two-sample t-test, the null hypothesis (reduced model) is that a score from group s (1 or 2) is defined as • With the same mean for each group s • These calculations always use estimates of the mean and standard deviation that maximize the likelihood value for that model
Equivalent statistics: Likelihood • Compare the full (alternative) model against the reduced (null) model • Log likelihood ratio • Because the reduced model is a special case of the full model, LF > LR • If Λ is sufficiently big, you can argue that the full model is better than the reduced model • Likelihood test
Equivalent statistics: t, Likelihood • No new information here • Let n = n1 + n2 • Then
Equivalent statistics: AIC • As we saw earlier, just adding complexity to a model will make its claims unreplicable • The model ends up “explaining” random noise • The model will poorly predict future random samples • A better approach is to adjust the likelihood to consider the complexity of the model • Models are penalized for complexity • Akaiki Information Criterion (AIC) • Smaller (more negative) values are better
Equivalent statistics: AIC • For a two-sample t-test, we can compare the full (alternative, 3 parameters) model and the reduced (null, 2 parameters) model • When ΔAIC>0, choose the full model • When ΔAIC<0, choose the null model
Equivalent statistics: AIC • For small sample sizes, you will do better with a “corrected” formula • So, for a two-sample t-test • When ΔAICc>0, choose the full model • When ΔAICc<0, choose the null model • The chosen model is expected to do the better job of predicting future data • This does not mean it will do a “good” job, maybe both models are bad
Equivalent statistics: AIC • Model selection based on AIC is appropriate when you want to predict future data, but you do not have a lot of confidence that you have an appropriate model • You expect the model to change with future data • Perhaps guided by the current model • To me, this feels like a lot of research in experimental psychology • The calculations are based on the very same information in a data set as the t-value, d-value, and p-value
Equivalent statistics: AIC • Inference based on AIC is actually more lenient than the traditional criterion for p-values
Equivalent statistics: BIC • Decisions based on AIC are not guaranteed to pick the “correct” model • An alternative complexity correction does better in this regard • Bayesian Information Criterion • For a two-sample t-test
Equivalent statistics: BIC • Inference based on BIC is much more stringent than the traditional criterion for p-values
Equivalent statistics: JZS BF • AIC and BIC use the “best” (maximum likelihood) model parameters • A fully Bayesian approach is to average likelihood across plausible parameters • Requires a prior probability density function • Compute the ratio of average likelihood for the full (alternative) and reduced (null) model • Bayes Factor • The JZS prior is for the standardized effect size • It’s Bayes Factor is simply a function of t and the sample sizes • It contains no more information about the data set than a p-value
Equivalent statistics: JZS BF • Inference based on the JZS Bayes Factor is much more stringent than the traditional criterion for p-values
Equivalent statistics: JZS BF • Model selection based on BIC and the JZS Bayes Factor is guaranteed to select the “true” model, if it is being tested • So, if you think you understand a situation well enough that you can identify plausible “true” models, then the BIC or Bayes Factor process is a good choice for identifying the true model
Equivalent statistics • I created a web site to do the conversions between statistics • http://psych.purdue.edu/~gfrancis/EquivalentStatistics/ • Also computes other relevant statistics (e.g., post hoc power)
Equivalent statistics • The various statistics are equivalent, but that does not mean you should report whatever you want • It means you should think very carefully about your analysis • Do you want to predict future data? • Do you think you can identify the “true” model? • Do you want to control the Type I error rate? • Do you want to estimate the effect size? • You also need to think carefully about whether you can satisfy the requirements of the inference • Can you avoid optional stopping in data collection? • Is your prior informative?
What should we do? • The first step is to identify what you want to do • Not as easy as it seems • “Produce a significant result”is not an appropriate answer • Your options are basically: • 1) Control Type I error: [identify an appropriate sample size and fix it; identify the appropriate analyses and adjust the significance criterion appropriately; do not include data from any other studies (past or future)] • 2) Estimate an effect size: [sample until you have a precise enough measurement; have to figure what “precise enough” means; explore/describe the data without drawing conclusions] • 3) Find the “true” model: [sample until the Bayes Factor provides overwhelming evidence for one model versus other models; have to identify prior distributions of “belief” in those models; have to believe that the true model is among the set being considered] • 4) Find the model that best predicts future data: [machine learning techniques such as cross-validation; information criterion; be willing to accept that your current model is probably wrong]
Equivalent statistics • Common statistics are equivalent with regard to the information in the data set • But no method of statistical inference is appropriate for every situation • The choice of what to do can give radically different answers to seeming similar questions • n1=n2=250, d=0.183 • p=0.04 • ΔBIC= - 2.03 (evidence for null) • ΔAICc = 2.16 (full model better predicts future data than the null model) • JZS Bayes Factor = 0.755 (weak evidence that slightly favors the null model)
What should we do? • Do you even need to make a decision? (choose a model, reject a null) • Oftentimes the decision of a hypothesis test is really just a description of the data • When you make a decision you need to consider the context (weigh probabilities and utilities) • For example, suppose a teacher needs to improve mean reading scores by 7 points for a class of 30 students • Approach A (compared to current method): , s=5, d=1.2 • Approach B (compared to current method): , s=50, d=0.1 B: P(Mean>7)=0.41 A: P(Mean>7)=0.14
Conclusions • These differences make sense because science involves many different activities at different stages of investigation • Discovery • Theorizing • Verification • Prediction • Testing • Bayes Factors fit into part (but not all) of these activities