Bayes or Bust ?

Bayes or Bust ? Keith Arnaud Thou shalt not answer questionnaires Or quizzes upon world affairs, Nor with compliance Take any test. Thou shalt not sit with statisticians nor commit A social science. -- W.H. Auden (with thanks to Eric Feigelson and Bill Press)

"As an undergraduate, I always found the subject of statistics to be rather mysterious. I was already familiar with the binomial, Poisson and normal distributions. Most of this made sense, but only seemed to relate to things like rolling dice, flipping coins, shuffling cards and so on. However, having aspirations of becoming a scientist, what I really wanted to know was how to analyse experimental data. Thus, I eagerly looked forward to the lectures on statistics. Sadly, they were a great disappointment. Although many of the tests and procedures expounded were intuitively reasonable, there was something deeply unsatisfactory about the whole affair; there didn't seem to be any underlying principles ! Hence, the course on 'probability and statistics' had led to an unfortunate dichotomy: probability made sense, but was just a game; statistics was important, but it was a bewildering collection of tests with little obvious rhyme or reason." - D.S. Sivia (Data Analysis: A Bayesian Tutorial)

Most astronomers and physicists don't understand statistics Suppose you derive a 90% confidence region on some parameter of interest. What does this mean ? Does it tell you that there is a probability of 0.9 that the true value of the parameter lies in the range calculated ?

Cause Effects Deductive logic "Probability" Possible causes Inference "Statistics" Effects

James Bernoulli (1713) asked whether the rules of probability (then being worked out to calculate odds in gambling) could also be used in the problem of inference or inductive logic. Rev. Thomas Bayes answered this in the affirmative in an essay published posthumously in 1763. Pierre-Simon Laplace reformulated Bayes' theorem in 1812. He expressed it with far greater clarity than Bayes and used it to solve problems in celestial mechanics and medical statistics. Laplace combined the available astronomical data to provide an estimate (and uncertainty) on the mass of Saturn. He stated "...it is a bet of 11,000 to 1 that the error in this result is not 1/100th of its value". (The modern estimate differs from Laplace's by 0.63%)

Bayes and Laplace thought of probability as measuring a degree of belief in something. They used the language of odds and betting in their arguments. However, with the push in the 19th and early 20th centuries towards more rigorous arguments in mathematics, this view was considered too subjective and a new definition of probability was developed. Probability was defined solely in terms of the long-term frequency of events. Thus to say that the probability of rolling a 4 with a die is 1/6 is to mean that repeated rolling of the die will give a 4 one sixth of the time. This definition works fine for rolling dice and tossing coins but where does it leave Laplace's measurement of the mass of Saturn ? Laplace derived a probability distribution for the mass of Saturn. But there is only one planet - it makes no sense to talk about its mass in terms of frequencies. One could perhaps consider an ensemble of Universes where the only thing that changes is the mass of Saturn but this seems very contrived.

The solution arrived at was to relate the mass to the data through some function. Since the data has random noise this function can be considered a random variable so we can talk about its probability distribution The function is called a statistic. This definition of statistics is often referred to as frequentist or orthodox. A frequentist statistic is best thought of as a predictor of future events. If you talk to a good orthodox statistician they will refer to what they do as prediction, not as inference. The central problem of orthodox statistics is how to choose the function of the data so as to provide a good predictor. This was the work of Fisher, Neymann, Pearson, ... who gave us the statistics we know and love.

We can now return to the definition of a 90% confidence region. It does not express the probability of the true value of the parameter being within the confidence region because orthodox statistics does not allow us to talk about the probability of a parameter. The parameter has a single true value (which we don't know) and is not a random variable. Instead, the confidence region is a predictor of future events. It tells us that if we make many more identical observations then 90% of the time the confidence region will contain the true value of the parameter. Hypothesis testing must be thought of in a similar fashion. For instance, we are not testing whether the true value of some parameter is non-zero but are instead making a statement about future observations.

In addition to the problems of interpretation, orthodox statistics do present practical difficulties. As we all know from personal experience, there are a bewildering array of different statistics to choose from. These problems have led to a revival in recent years of the Bayesian and Laplacian idea of probability. Richard Cox showed in the 1940s that if we assign numbers to express our relative belief in various propositions then logical consistency can only be achieved if these numbers map to a set of real positive numbers that obey the usual rules of probability theory. Philosophers of science continue to argue about the details of Bayesian inference (see eg. Bayes or Bust ? by John Earman). Practical Bayesian inference was pioneered by scientists such as Harold Jeffreys and Ed Jaynes but has now established bridgeheads inside statistics departments. The Bayesians are still in the minority but are gaining in numbers and influence. Interestingly, recent work in cognitive science indicates that our minds may well operate by Bayesian inference (see eg “How the Mind Works” by Steven Pinker) – a possibility advanced by Jaynes over 40 years ago.

Prior Likelihood p(H|D,I) = p(D|H,I) x p(H|I) p(D|I) Posterior The posterior probability distribution contains all the information available after the data have been acquired. As such it is the starting point for any further inference.

Is This Coin Fair ? Suppose we have a coin and we wish to find out whether or not it is fair. Let H be a number between 0 and 1 such that if the coin has H=0 then it will always produce Tails when tossed and if H=1 then Heads. We now toss the coin N times and produce n Heads. By Bayes theorem the posterior probability distribution for H is proportional to : p(H|I) Hn (1-H)(N-n) where p(H|I) is the prior probability distribution for H.

Heads Tails

X- or gamma-ray spectroscopy Suppose we have a model with M parameters {mj}. Then we are interested in the probability distribution p({mj} | {Si}, {Bi}, I) Where {Si} and {Bi} are the source and background observations (over N channels, i=1,2,...,N) and I is the prior information. However, we also have an unknown background which we can model as a set of N parameters {bi}. Our probability distribution is thus the marginalization over these N nuisance parameters so is the N-dimensional integral over p({mj}, {bi} | {Si}, {Bi}, I) Using Bayes theorem and the independence of the model and background parameters we can rewrite the integrand as proportional to

p({Si} | {mj}, {bi} , I) p({Bi} | {bi } , I) p({mj} | I) p({bi} | I) If the observed counts in each channel in the source and background are independent and distributed as Poisson variables then the first two terms in this integrand can be expressed as the product of Poisson probabilities. Assuming that we have no prior information on the background we can represent it as a uniform distribution between 0 and some maximum. After considerable algebraic manipulation (originally due to Tom Loredo) we can show that the posterior probability distribution for the model parameters is proportional to p({mj} | I) f({mj}, {Si}, {Bi}) Where f is a complicated (but analytic) function of the model parameters and the observables.

Computational Issues The end result of a Bayesian analysis will probably be a multi-dimensional probability density function. We will need to characterize its shape and/or integrate (marginalize) over some dimensions. Both these processes are very computationally intensive. The major practical issue in Bayesian inference is how to perform the multi-dimensional integrals required. There are a number of packages available and they do work well for a limited class of problems. However, a more general approach is necessary. The most promising method at present seems to be Markov Chain Monte Carlo. A Markov Chain is a sequence of random variables {X0, X1, X2, …} such that each state Xt+1 is sampled from a distribution p(Xt+1|Xt) which depends only on the current state Xt. The fundamental theorem of Markov Chains shows that for larger enough t the Xt are drawn from a stationary distribution which is independent of t and the starting point of the chain.

The MCMC method The MCMC method then consists of setting up a Markov Chain such that the stationary distribution is the posterior probability distribution of interest. The Markov Chain values then provide a sampling of the probability distribution which we can use for integration or characterization. Constructing such Markov Chains turns out to be remarkably simple. The method was first developed in 1953 by Metropolis et al. (in the context of statistical physics) and generalized in 1970 by Hastings. Suppose that our posterior distribution is p(X). We are at Xt in the chain. We sample a candidate point Y from a proposal distribution q(Xt). We now accept Y with a probability min(1,p(Y)q(X|Y)/p(X)/q(Y|X)). If the candidate is accepted we set Xt+1=Y otherwise Xt+1=Xt. Remarkably, q can be any distribution and the stationary distribution of the chain will still be p. However, it should be chosen so that the chain converges quickly (a short “burn-in”) and mixes well ie it samples all parts of the distribution p. There are a number of canonical choices for q and this is an active area of research in the statistical community.

Why not to use the F-test A common problem in high energy astrophysics is to determine whether an emission (or absorption) line is present in a spectrum or whether a source is present in the field of view. A standard (frequentist) technique has been to use the F-test or, more generally, likelihood ratio test (LRT). In a recent preprint, Protassov et al. point out that this is incorrect. The LRT is not valid if the test hypothesis is at the boundary of the parameter space eg if we are testing for a non-zero value of the normalization of a spectral line. These authors go on to argue that the best approach to this problem is to derive the posterior distribution for the line normalization and focus on estimating its strength rather than making a statement on whether or not it exists. However, if a test is required one valid method is to use MCMC to generate N samples from the posterior distribution under the null hypothesis (ie no line). For each sample simulate data and compute an LRT statistic. The fraction of these fake datasets with a value of the statistic greater than that observed for the actual data gives the false positive probability.

Protassov et al.

Conclusions Most astronomers are implicit Bayesians. However, most statistical inference in the literature is frequentist. This is problematic, especially when the frequentist methods are used outside their range of validity. Bayesian methods are conceptually much simpler and provide the sort of information that astronomers expect. Their drawbacks are practical. There is the problem of assigning prior probability distributions. They also typically require more computation, particularly when marginalizing over nuisance parameters. Markov Chain Monte Carlo looks to be the most promising numerical technique. To get Bayesian methods widely used by astronomers will require the availability of fast, reliable, user-friendly software.

References Sivia, D.S., 1996, "Data Analysis: A Bayesian Tutorial", Clarendon Press O'Hagan, A., 1994, "Kendall's Advanced Theory of Statistics: Volume 2B Bayesian Inference", Wiley Gelman, A. et al, 1995, "Bayesian Data Analysis", Chapman & Hall Gilks, W.R. et al, 1996, "Markov Chain Monte Carlo in Practice", Chapman & Hall Protassov et al., 2002, astro-ph/0201547 http://astrosun.tn.cornell.edu/staff/loredo/bayes

Bayes or Bust ?

Bayes or Bust ?

Presentation Transcript

Bayes for Beginners

An introduction to probabilistic graphical models and the Bayes Net Toolbox for Matlab

Recursive Bayes Filtering Advanced AI

Probabilistic Robotics

FC.131 POST WAR BOOM AND BUST (1920-29)

Unit 5: Boom and Bust

K-nearest neighbor & Naive Bayes

LING / C SC 439/539 Statistical Natural Language Processing

Statistics

PHYSTAT05 Statistical Problems in Particle Physics, Astrophysics and Cosmology Louis Lyons

A2 - The USA 1917 - 54: Boom, Bust & Recovery

Lecture 3 Bayesian Reasoning 第 3 讲贝叶斯推理

Given that I ate meatballs for lunch and that I like spaghetti|What do you

Maxent Models and Discriminative Estimation

Chapter 4 Probability

Some More Efficient Learning Methods

Bayes, birds and brains: applications of inference and probabilistic modelling

Statistics

Text Classification and Na ï ve Bayes

1920s and 30s Boom and Bust

Bayes or Bust ?

Bayes or Bust ?

Presentation Transcript

Bayes for Beginners

An introduction to probabilistic graphical models and the Bayes Net Toolbox for Matlab

Recursive Bayes Filtering Advanced AI

Probabilistic Robotics

FC.131 POST WAR BOOM AND BUST (1920-29)

Unit 5: Boom and Bust

K-nearest neighbor &amp; Naive Bayes

LING / C SC 439/539 Statistical Natural Language Processing

Statistics

PHYSTAT05 Statistical Problems in Particle Physics, Astrophysics and Cosmology Louis Lyons

A2 - The USA 1917 - 54: Boom, Bust &amp; Recovery

Lecture 3 Bayesian Reasoning 第 3 讲 贝叶斯推理

Given that I ate meatballs for lunch and that I like spaghetti|What do you

Maxent Models and Discriminative Estimation

Chapter 4 Probability

Some More Efficient Learning Methods

Bayes, birds and brains: applications of inference and probabilistic modelling

Statistics

Text Classification and Na ï ve Bayes

1920s and 30s Boom and Bust

K-nearest neighbor & Naive Bayes

A2 - The USA 1917 - 54: Boom, Bust & Recovery

Lecture 3 Bayesian Reasoning 第 3 讲贝叶斯推理