Bruno Lecoutre C.N.R.S. et Université de Rouen E-mail: bruno.lecoutre@univ-rouen.fr

MaxEnt 2006 Twenty sixth International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering CNRS, Paris, France, July 9, 2006 And if you were a Bayesian without knowing it? Bruno Lecoutre C.N.R.S. et Université de Rouen E-mail: bruno.lecoutre@univ-rouen.fr Internet: http://www.univ-rouen.fr/LMRS/Persopage/Lecoutre/Eris Equipe Raisonnement Induction Statistique

Bayes, Thomas (b. 1702, London - d. 1761, Tunbridge Wells, Kent), mathematician who first used probability inductively and established a mathematical basis for probability inference (a means of calculating, from the number of times an event has not occured, the probability that it will occur in future trials)

Probability and Statistical Inference

(Mis)intepretations of p-values in Bayesian terms  Many statistical users misinterpret the p-values of significance tests as “inverse” probabilities: 1-p is “the probability that the alternative hypothesis is true”

(Mis)intepretations of confidence levels in Bayesian terms Frequentist interpretation of a 95% confidence interval:  In the long run 95% of computed confidence intervals will contain the “true value” of the parameter  Each interval in isolation has either a 0 or 100% probability of containing it This “correct” interpretation does not make sense for most users!  It is the interpretation in (Bayesian) terms of “a fixed interval having a 95% chance of including the true value of interest” which is the appealing feature of confidence intervals

(Mis)intepretations of frequentist procedures in Bayesian terms  Even experienced users and experts in statistics are not immune from conceptual confusions • “In these conditions [a p-value of 1/15], the odds of 14 to 1 • that this loss was caused by seeding [of clouds] • do not appear negligible to us” • Neyman et al., 1969  All the attempts to rectify these interpretations have been a loosing battle  Virtually all users interpret frequentist confidence intervals in a Bayesian fashion We ask themselves: “And if you were a Bayesian without knowing it?”

Two main definitions of probability (already in Bernoulli, 17th century)  The long-run frequency of occurrence of an event, either in a sequence of repeated trials or in an ensemble of “identically” prepared systems “Frequentist(“classical”,“orthodox”,“sampling theory”) conception • Seems to make probability an objective property, existing in the nature independently of us, that should be based on empirical frequencies A measure of the degree of belief (or confidence) in the occurrence of an event or more generally in a proposition The “Bayesian” conception • A much more generaldefinition: Ramsey, 1931; Savage, 1954; de Finetti, 1974 Jaynes, E.T. (2003) Probability Theory: The Logic of Science (Edited by G.L. Bretthorst) Cambridge, England: Cambridge University Press

 The Bayesian definition fits the meaning of the term probability in everyday language The Bayesian probability theory appears to be much more closely related to how people intuitively reason in the presence of uncertainty • “It is beyond any reasonable doubt that for most people, • probabilities about single events do make sense • even though this sense may be naïve and fall short from numerical accuracy” • Rouanet, in Rouanet et al., 2000, page 26

Frequentist approach Self-proclaimed “objective” contrary to the “Bayesian” inference that should be necessary “subjective” Bayesian approach The Bayesian definition can serve to describe “objective knowledge”, in particular based on symmetry arguments or on frequency data Bayesian statistical inference is no less objective than frequentist inference It is even the contrary in many contexts

Statistical Inference • Statistical inference is typically concerned with both known quantities - the observed data - and unknown quantities - the parameters and the data that have not been observed. • “The raw material of a statistical investigation is a set of observations; these are the values taken on by random variablesX whose distribution Pis at least partly unknown. • Lehmann, 1959

Frequentist inference All probabilities (in fact frequencies) are conditional on [unknown] parameters • Significance tests (parameter value fixed by hypothesis) • Confidence intervals Bayesian inference  Parameters can also be probabilized • Distributions of probabilities that express our uncertainty • before observations (does nor depend on data): prior probabilities • after observations (conditional on data): posterior (or revised) probabilities • also about future data: predictive probabilities

A simple illustrative situation A finite population of size N=20 With a dichotomous variable 1 (success) – 0 (failure) Proportion j of success Unknown parameter j = ? Known data 0 0 0 1 0 f = 1/5 A sample of size n=5 from this population has been observed

Inductive reasoning: generalisation from known to unknown Unknown parameter j = ? • In the frequentist framework: • no probabilities • no solution Known data 0 0 0 1 0 f = 1/5

Frequentist inference: from unknown to known Known data 0 1 0 0 0 f = 1/5 • no more solution Unknown parameter j = ?

Frequentist inference Data: 0 0 0 1 0 (f = 1/5 = 0.20) Imaginary repetitions of the observations f = 0/5 : 0.00006 f = 1/5 : 0.005 f = 2/5 : 0.068 f = 3/5 : 0.293 f = 4/5 : 0.440 f = 5/5 : 0.194 One sample have been observed out 15 503 possible samples Parameter Fixed value  Example: j = 15/20 = 0.75 Sampling probabilities = frequencies

Data 0 0 0 1 0 f = 0.20 Frequentist significance test Imaginary repetitions of the observations f = 0/5 : 0.00006 f = 1/5 : 0.005 f = 2/5 : 0.068 f = 3/5 : 0.293 f = 4/5 : 0.440 f = 5/5 : 0.194 Null hypothesis Example 1: j = 0.75 (15/20) Level: a = 0.05 0.995 If j = 0.75 one find in 99.5% of the repetitions a value f > 1/5 (greater than the observation f=0.20) The null hypothesis j = 0.75 is rejected (Significant: p = 0.00506)

However, this conclusion is based on the probability of the samples that have not been observed! “If P is small, that means that there have been unexpectedly large departures from prediction. But why should these be stated in terms of P? The latter gives the probability of departures, measured in a particular way, equal to or greaterthan the observed set, and the contribution from the actual value is nearly always negligible. What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure.”’ Jeffreys, 1961

Data 0 0 0 1 0 f = 0.20 Frequentist significance test Imaginary repetitions of the observations f = 0/5 : 0.00006 f = 1/5 : 0.005 f = 2/5 : 0.068 f = 3/5 : 0.293 f = 4/5 : 0.440 f = 5/5 : 0.194 Null hypothesis Example 2: j = 0.50 (10/20) Level: a = 0.05 0.848 If j = 0.50 one find in 84.8% of the repetitions A value f > 1/5 (greater than than the observation f =0.20) The null hypothesis j = 0.50 is not rejected (Non significant: p = 0.152) Obviously this does not prove that j = 0.50!

Data 0 0 0 1 0 f = 0.20 Frequentist confidence interval Set of possible values for j that are not rejected at level a Example a = 0.05 One get the “95% confidence” interval [0.05 , 0.60] How to interpret the “95% confidence”?

Interpretation of frequentist confidence? The frequentist interpretation is based on the universal statement: “Whatever the fixed value of the parameter is, in 95% (at least) of the repetitions the interval that should be computed includes this value”

A very strange interpretation: it does not involve the data in hand! It is at least unrealistic “Objection has sometimes been made that the method of calculating Confidence Limits by setting an assigned value such as 1% on the frequency of observing 3 or less (or at the other end of observing 3 or more) is unrealistic in treating the values less than 3, which have not been observed, in exactly the same manner as the value 3, which is the one that has been observed. This feature is indeed not very defensible save as an approximation.” Fisher, 1990/1973, page 71

Return to the inductive reasoning: Generalisation from known to unknown Set of all possible values of the unknown parameter j = 0/20, 1/20, 2/20… 20/20 Bayesian inference Probabilities that express our uncertainty (in addition to sampling probabilities) Known data 0 0 0 1 0 f = 1/5 “As long as we are uncertain about values of parameters, we will fall into the Bayesian camp” Iversen, 2000

Data 0 0 0 1 0 f = 1/5 Bayesian inference • All the frequentist probabilities associated with the data • Pr(f = 1/5 |j)  Likelihood function j = 0/20  0 j = 10/20  0.135 j = 1/20  0.250 j = 11/20  0.089 j = 2/20  0.395 j = 12/20  0.054 j = 3/20  0.461 j = 13/20  0.029 j = 4/20  0.470 j = 14/20  0.014 j = 5/20  0.440 j = 15/20  0.005 j = 6/20  0.387 j = 16/20  0.001 j = 7/20  0.323 j = 17/20  0 j = 8/20  0.255 j = 18/20  0 j = 9/20  0.192 j = 19/20  0 j = 20/20  0

Bayesian inference We assume prior probabilities Pr(j) (before observation) • joint probabilities (a simple product): Pr(jand f=1/5) = Pr(f=1/5 |j) × Pr(j) likelihood × prior probability • Predictive probabilities (sum of the joint probabilities) Pr(f=1/5) A weighted average of the likelihood function • posterior probabilities (A simple application of the definition of conditional probabilities) Pr(j | f=1/5) = Pr(jand f=1/5) / Pr(f) The normalized product of the prior and the likelihood

We can conclude with Berry “Bayesian statistics is difficult in the sense that thinking is difficult”Berry, 1997

Considerable difficulties with the frequentist approach

The mysterious and unrealistic use of the sampling distribution Frequent questions asked by students and statistical users “why one considers the probability of samples outcomes that are more extreme than the one observed?” “why must one calculate the probability of samples that have not been observed?” etc. No such difficulties with the Bayesian inference Involves the sampling probability of the data , via the likelihood function that writes the sampling distribution in the natural order : “from unknown to known”

Experts in statistics are not immune from conceptual confusions About confidence intervals A methodological paper by Rosnow and Rosenthal (1996) They take the example of an observed difference between two means d=+0.266 They consider the interval [0,+532] whose bounds are the “null hypothesis” (0) and what they call the “counternul value” (2d=+0.532), the symmetrical value of 0 with regard to d They interpret this specific interval [0,+532] as “a 77% confidence interval” (0.77=1-2×0.115, where 0.115 is the one-sided p-value for the usual t test) Clearly, 0.77 is here a data dependent probability, which needs a Bayesian approach to be correctly interpreted

Experimental research and statistical inference: A paradoxical situation • Null Hypothesis Significance Testing (NHST) • An unavoidable norm in most scientific publications • Often appears as a label of scientificness BUT • Innumerable misinterpretations and misuses • Use explicitly denounced by the most eminent and most experienced scientists “The test provides neither the necessary nor the sufficient scope or type of knowledge that basic scientific social research requires” Morrison & Henkel, 1969

Today is a crucial time Users' uneasiness is ever growing In all fields necessity of changes in reporting experimental results  routinely report effect size indicators  and their interval estimates in addition to or in place of the results of NHST

Common misinterpretations of NHST Emphasized by empirical studies Rosenthal & Gaito, 1963; Nelson, Rosenthal & Rosnow, 1986; Oakes, 1986; Zuckerman, Hodgins, Zuckerman & Rosenthal, 1993; Falk & Greenbaum, 1995; Mittag & Thompson, 2000; Gordon, 2001; M.-P. Lecoutre (2000), B. Lecoutre, M.-P. Lecoutre & Poitevineau, 2001 Shared by most methodology instructors Haller & Krauss, 2001 • Professional applied statisticians are not immune to misinterpretations M.-P. Lecoutre, Poitevineau & B.Lecoutre (2003) - Even statisticians are not immune to misinterpretations of Null Hypothesis Significance Tests. International Journal of Psychology, 38, 37-45

Why these misinterpretations? • An individual's lack of mastery? This explanation is hardly applicable to professional statisticians • “Judgmental adjustments” or “adaptative distorsions”' (M.-P. Lecoutre, in Rouanet et al., 2000, page 74) designed to make an ill-suited tool fit their true needs • Examples: • - Confusion between “statistical significance” and “scientific significance” • - Improper uses of nonsignificant results as “proof of the null hypothesis” • - “Incorrect” (“non frequentist”) interpretations of p-values as inverse probabilities

NHST does not address questions that are of primary interest for the scientific research • This suggests that • “users really want to make a different kind of inference” Robinson & Wainer, 2002, page 270

A more or less “naïve” mixture of NHST results and other information • The task of statisticians in pharmaceutical companies “Actually, what an experienced statistician does when looking at p-values is to combine them with information on sample size, null hypothesis, test statistic, and so forth to form in his mind something that is pretty much like a Confidence interval to be able to interpret the p-values in a reasonable way” Schmidt, 1995, page 490 BUT this is not an easy task!

A set of recipes and rituals • Many attempts to remedy the inadequacy of usual significance tests See for instance: the “Task Force” of the American Psychological Association (Wilkinson et al. 1999) • They are both partially technically redundant and conceptually incoherent • They do not supply real statistical thinking “We need statistical thinking, not rituals” Gigerenzer, 1998

Confidence intervals could quickly become a compulsory norm in experimental publications In practice two probabilities can be routinely associated with a specific interval estimate computed from a particular sample • The first probability is “the proportion of repeated intervals that contain the parameter” It is usually termed the coverage probability • The second is the Bayesian “posterior probability that this interval contains the parameter” (given the data in hand), assuming a noninformative prior distribution In the frequentist approach, it is forbidden to use the second probability In the Bayesian approach, the two probabilities are valid Moreover, an “objective Bayes” interval is often “a great frequentist procedure” (Berger, 2004)

The debates can be expressed on these terms: “whether the probabilities should only refer to data and be based on frequency or whether they should also apply to parameters and be regarded as measures of beliefs”

The ambivalence of statistical instructors It is undoubtedly the natural (Bayesian) interpretation “a fixed interval having a 95% chance of including the true value of interest” that is the appealing feature of confidence intervals Most statistical instructors tolerate and even use this heretic interpretation

The ambivalence of statistical instructors • In a popular statistical textbook (whose objective is to allow the reader “accessing the deep intuitions in the field”), one can found the following interpretation of the confidence interval for a proportion: “Si dans un sondage de taille 1000, on trouve P [la proportion observée] = 0.613, la proportion p1à estimer a une probabilité 0.95 de se trouver dans la fourchette: [0.58,0.64]” “If in a public opinion poll of size 1000, one find P [the observed proportion] = 0.613, the proportionp1 to be estimated has a 0.95 probability to be in the range: [0.58,0.64]'' Claudine Robert, 1995, page 221

The ambivalence of statistical instructors • In an other book that claims the goal of understanding statistics, a 95% confidence interval is described as “an interval such that the probability is 0.95 that the interval contains the population value]” Pagano, 1990, page 228

The ambivalence of statistical instructors “It would not be scientifically sound to justify a procedure by frequentist arguments and to interpret it in Bayesian terms” Rouanet, 2000

Other authors claim that the “correct” frequentist interpretation they advocate can be expressed as : “We can be 95% confident that the population mean is between 114.06 and 119.94” Kirk, 1982 “We may claim 95% confidence that the population value of multiple R2 is no lower than 0.266” Smithson, 2001  Hard to imagine that readers can understand that “confident” refers here to a frequentist view of probability! We will distinguish between probability as frequency, termed probability, and probability as information/uncertainty, termed confidence” Schweder & Hjort (2002)

Teaching the frequentist interpretation: a losing battle “we are fighting a losing battle” Freeman, 1993

Most statistical users are likely to be Bayesian “without knowing it”! “It could be argued that since most physicians use statement A [the probability the true mean value is in the interval is 95%] to describe ‘confidence’ intervals, what they really want are ‘probability’ intervals. Since to get them they must use Bayesian methods, then they are really Bayesians at heart!” Grunkemeier & Payne, 2002

The Bayesian therapy

It is not acceptable that that future statistical inference methods users will continue using non appropriate procedures because they know no other alternative Since most people use “inverse probability” statements to interpret NHST and confidence intervals, the Bayesian definition of probability, conditional probabilities and Bayes’ formula are already - at least implicitly - involved in the use of frequentist methods Which is simply required by the Bayesian approach is a very natural shift of emphasis about these concepts, showing that they can be used consistently and appropriately in statistical analysis (Lecoutre, 2006)

A better understanding of frequentist procedures “Students [exposed to a Bayesian approach] come to understand the frequentist concepts of confidence intervals and P values better than do students exposed only to a frequentist approach Berry, 1997 Combining descriptive statistics and significance tests A basic situation: the inference about the difference d between two normal means Let us denote by d (assuming d≠0) the observed difference and by t the value of the Student's test statistic Assuming the usual non informative prior, the posterior for d is a generalized (or scaled) t distribution (with the same degrees of freedom As the t test), centered on d and with scale factor the ratio e=d/t (see e.g. Lecoutre, 2006)

Conceptual links Bayesian interpretation of the p-value The one-sided p-value of the t test is exactly the posterior Bayesian probability that the difference dhas the opposite sign of the observed difference If d>0, there is a p posterior probability of a negative difference and a 1-p complementary probability of a positive difference In the Bayesian framework these statements are statistically correct Bayesian interpretation of the confidence interval It becomes correct to say that “there is a 95% [for instance] probability of d being included between the fixed bounds of the interval” (conditionally on the data)

Some decisive advantages overcoming usual difficulties In this way, Bayesian methods allow users to overcome usual difficulties encountered with the frequentist approach “public use” statements The use of noninformative priors has a privileged status in order to gain “public use” statements Combining information when “good prior information is available” other Bayesian techniques also have an important role to play in experimental investigations

Bruno Lecoutre C.N.R.S. et Université de Rouen E-mail: bruno.lecoutre@univ-rouen.fr

Bruno Lecoutre C.N.R.S. et Université de Rouen E-mail: bruno.lecoutre@univ-rouen.fr

Presentation Transcript

Inferring compositional style in the neo-plastic paintings of Piet Mondrian by machine learning

Craig Roberts Physics Division

Pricing Financial Derivatives

Árvores B

The GENIUS Grid Portal

H388 Presentations 11/28/06