- Anandamayee Majumdar Visiting Scientist, University of North Texas

Lecture 1: Bayesian Inference and Data Analysis Department of Statistics, Rajshahi University, Rajshahi -AnandamayeeMajumdar Visiting Scientist, University of North Texas School of Public Health, USA;Professor, University of Suzhou, China.

Overview • Applications • Introduction • Steps and Components • Motivation • Bayes Rule • Probability as a Measure of Certainty • Simulation from a distribution using inverse CDF • A one parameter model example • Binomial example approached by Bayes and Laplace.

Applications to Computer Science • Bayesian inference has applications in Artificial intelligence and Expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. • Recently Bayesian inference has gained popularity amongst the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously. In the areas of population genetics and dynamical systems theory, approximate Bayesian computation (ABC) is also becoming increasingly popular. • As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying e-mail spam.

Application to the Court Room • Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt’. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence.

Other Applications • Population genetics • Ecology • Archaeology • Environmental Science • Finance • ….and many more

Introduction: Bayesian Inference • Practical methods for learning from data • Use of Probability Models • Quantify Uncertainty

Steps • Set up a full probability model (a joint distribution for all observable and unobservable quantities in a problem) • Consistent with underlying scientific problem • Consistent with data collection process

Steps (continued) • Conditioning on observed data: Calculate and interpret the posterior distribution (the conditional probability distribution of the unobserved quantities given observed data) P (θ | Data)

Steps (continued) 3. Evaluate the fit of the model and the implications of the resulting posterior distribution • Does model fit data? • Are conclusions reasonable? • How sensitive are results to the modeling assumptions in step 1?

Step 3 continued • If necessary one can alter or expand the model and repeat the three steps

Step 1 is a stumbling block • How do we go about constructing the joint distribution, i.e. the full probability model? • Advanced improved techniques in second step may help • Advances in carrying out the third step alleviate the somewhat the issue of incorrect model specification in first step.

Primary motivation for Bayesian thinking • Facilitates common sense interpretation of statistical conclusions. • Eg. Bayesian (probability) interval for an unknown quantity of interest can be directly regarded as having a high probability of containing the unknown quantity in contrast to a frequentist (confidence) interval which is justified with a retrospective perspective and sampling methodology.

Primary motivation for Bayesian thinking (continued) • Increased emphasis has been placed on interval estimation than hypothesis testing – adds a strong impetus to the Bayesian viewpoint -We shall look at the extent to which Bayesian interpretations of common simple statistics procedures are justified.

Real Life Example • A clinical trial of cancer patients might be designed to compare the 5 year survival probability given the new drug – with that in the standard treatment • Inference based on a sample of patients • We can not assign patients to both treatments • Causal inference (compare the observed outcome in a patient to the unobserved outcome if exposed to the other treatment)

Two kinds of estimands • Estimand = Unobserved quantity for which inference is needed • Potentially observable quantity (Ÿ). • Quantities that are not directly observable (parameters) (θ). • The first helps to understand how model fits real data

General notation • θ → denotes unobservable vector quantities or population parameters of interest • y → observed data y= (y1, y2, …, yn) • Ÿ → potentially observable but unknown quantity (replication, future prediction etc) • In general these are multivariate quantities

General notation • x → explanatory variable / covariate • X → entire set of explanatory variables for all n units (of data)

Fundamental Difference Bayesian Approach • Inference of θ→ based on p(θ|y) • Inference of Ÿ→ based on p(Ÿ|y) *Bayesian statistical conclusions: Made using probability statements (‘highly unlikely’, ‘very likely’) Frequentist Approach • Inference of θ→ based on p(y |θ) • Inference of Ÿ→ based on θ→ based on p(y |θ) *Frequentist statistical conclusions based on p-values (‘not significant’ ,`test can not be rejected’ etc)

Practical similarity? Difference? • Despite differences in many simple analyses, results obtained using the two different procedures yield superficially similar results (especially in asymptotic cases) • Bayesian methods can be easily extended to more complex problems • Usually Bayesian models work better with less data • Bayesian method can include prior information into the analysis through the prior distribution • Easy sequential updates of inference possible by assuming previous posterior distribution as new prior distribution (Bayesian updating) as new data becomes available.

A Fundamental Concept:The Prior distribution • θ→random because it is unknown to us, though we may have some feeling about it from before • Prior distribution → “subjective” probability that quantifies whatever belief (however vague) we may have about θ before having looked at the data.

Fundamental Result:Bayes Rule • Due to Thomas Bayes (1702–1761) • Joint distribution p(θ, y) = p(y | θ) p(θ ) • p(θ | y) = p(θ, y)/p(y) = p(y | θ)p(θ)/p(y)

Gist – Main point to remember • p(θ | y) α p(y | θ) p(θ) as p(y) is free of θ • Any two data that yields the same likelihood, yields the same inference • Encapsulates the technical core of Bayesian inference : primary task is to develop the model p(θ, y)and perform the necessary computations to summarize p(θ|y) appropriately.

Attractive property of Bayes Rule • Posterior Odds = p(θ1|y)/p(θ2|y) = {p(θ1 )p(y |θ1)/p(y)}/{p(θ2) p(y |θ2)/p(y)} = {p(θ1 ) / p(θ2)} {p(y|θ1) / p(y|θ2)} = Prior Odds * Likelihood Ratio

Example: Hemophilia Inheritance • Father →XY, Mother →XX • Hemophilia exhibits X-chromosome-linked recessive inheritance • If son receives a bad chromosome from mother, he will be affected • If daughter receives one bad chromosome from mother, she will not be affected, but will be a carrier • If both X are affected in a woman it is fatal (occurrence rare)

A woman has an affected brother → mother carrier of hemophilia • Mother →XgoodXbad • Father not affected Unknown quantity of interest θ = 0 if woman is not a carrier 1 if woman is carrier Prior: P(θ=0) = P(θ=1) = 0.5

Model and Likelihood • Suppose the woman has two sons, neither of whom are affected. Let yi= 1 denote an affected son 0 denote an unaffected son • The two conditions of two sons are independent given θ (no two are identical twins). Pr(y1=0, y2=0 | θ=1) =(0.5)(0.5)=0.25 Pr(y1=0, y2=0 | θ=0) =(1)(1)=1

Posterior distribution • Bayes Rule: Combines the information in the data with the prior probability y = (y1, y2) joint data Posterior probability of interest: p(θ=1|y) = p(y |θ=1)p(θ=1) / {p(y|θ=1)p(θ=1) + p(y|θ=0)p(θ=0)} = (0.25)(0.5) / {(0.25)(0.5) + (1)(0.5)} = 0.2

Conclusions • It is clear that if the woman has unaffected children it is less probable she is a carrier • Bayes Rule provides a formal mechanism in terms of prior and posterior odds. • Prior odds= 0.5/0.5=1 • Likelihood ratio = 0.25/1= 0.25 • So posterior odds = (1) (0.25) = 0.25 • So P(θ=1|y)=0.2

Easy sequential analysis performance with Bayesian Analysis • Suppose that the woman has a third son, also unaffected. • We do not repeat entire analysis • Use previous posterior distribution as new prior P(θ=1| y1, y2,y3) = P(y3|θ=1)(0.2)/{P(y3|θ=1)(0.2)+ P(y3|θ=0)(0.8)} = (0.5)(0.2)/{(0.5)(0.2) + (1)(0.8)} = 0.111

Probability as a measure of uncertainty Legitimate to ask in Bayesian Analysis • Pr(Rain tomorrow)? • Pr(Victory of Bangladesh in 20-20 match)? • Pr(Heads if coin is tossed)? • Pr(Average height of students within (4ft, 5ft)) of interest after data is acquired • Pr(Sample average of students within (4ft, 5ft)) of interest before data is acquired

Bayesian Analysis methods enable statements to be made about the partial knowledge available (based on data) concerning some situation (unobservable, or as yet unobserved) in a systematic way, using probability as the measure • Guiding principle: State of knowledge about anything unknown is described by a probability distribution

Usual Numerical Methods of Certainty • Symmetry/ Exchangeability Argument • Probability = # favourable cases/ # possibilities • (Coin tossing experiment) • Involves assumptions, on physical condition of toss, physical conditions about forces at work • Dubious if we know a coin is either double-headed or double-tailed.

Usual Numerical Methods of Certainty 2. Frequency Argument • Probability = Relative frequency obtained in a very long sequence (experiments assumed, identically performed, physically independent of each other)

Other arguments in consideration • Physical randomness induces uncertainty (we speak of ‘likely’, ‘less likely’ etc events) • Axiomatic approach: Decision theory related • Coherence of bets: (define probability through odds ratio) • Fundamental difficulties remain defining odds • Ultimate test is success of application!

Summarizing inference using simulation • Simulation: Forms a central part of Bayesian Analysis → Relative ease with which samples can be drawn from even a complex, explicitly unknown probability distribution

For example: • To estimate 95th percentile of the posterior distribution of θ|y, draw a random sample of size L (large), from p(θ|y) and use the 0.95Lth order statistic. • For most purposes L=1000 is adequate for such estimates

Generating values from a probability distribution is often straight forward with modern computing techniques • This technique is based on (Pseudo) random number generators →yields a deterministic sequence that appears to have the same properties as a sequence of independent random drawsfrom uniform distribution on [0,1]

Sampling using inverse cumulative distribution function • F is the c.d.f of a random variable • F-1 (U) =inf{x: F(x) ≥ U} will follow the distribution defined by F(.) where U ~ Uniform(0, 1). • If F is discrete, F-1 can be tabulated

Posterior samples as building blocks of posterior distribution • One can use this array, to generate the posterior distribution • One can use this array to find the posterior distribution of, say, θ1/θ2 or say, log(θ3) by adding appropriate columns to this array and using the existing columns – extremely straight forward!

Single - parameter models • Consider some fundamental and widely used one dimensional models—the binomial, normal, Poisson, and exponential etc • We shall discuss important concepts and computational methods for Bayesian data analysis

Estimating a probability from binomial data • Sequence of Bernoulli trials; data y1,…, yn, each of which is either 0 or 1 (n fixed). Exchangeability implies likelihood depends only on sum of yi(y). • Provides a relatively simple but important example • Parallels the very first published Bayesian analysis by Thomas Bayes in 1763

Proportion of female births • 200 years ago it was established that the proportion of female births in European populations was less than 0.5 • This century interest has focused on factors that may influence the gender ratio. • The currently accepted value of the proportion of female births in very large European-race populations is 0.485.

Define the parameter θto be the proportion of female births • Alternative way of reporting this parameter is as a ratio of male to female birth rates • Bayesian inference in the binomial model, we must specify a prior distribution for θ. • For simplicity assume the prior to be Uniform(0,1) • Bayes rule implies that p(θ|y) αθy (1-θ)n-y

In single- parameter problems, this allows immediate graphical presentation of the posterior distribution.

Since p(θ|y) is a density and should integrate to 1, the normalizing constant can be worked out. • The posterior distribution is recognizable as a beta distribution • θ|y ~ Beta(y+1, n-y+1) • In analyzing the binomial model, Pierre-Simon Laplace (1749–1827) also used the uniform prior distribution. • His first serious application was to estimate the proportion of female births in a population. • A total of 241,945 girls and 251,527 boys were born in Paris from 1745 to 1770. • Laplace used (Normal) approximation and showed that • P(θ≥0.5|y =241,945, n =251,527+241,945) ≈ 1.15 × 10 -42 So he was ‘morally certain’ that θ<0.5.

Lecture 2: Bayesian Inference and Data Analysis Dept. of Statistics, Rajshahi University, Rajshahi AnandamayeeMajumdar Visiting Scientist, University of North Texas School of Public Health, USA;Professor, University of Suzhou, China.

Overview • Prediction in the Binomial example • Justification of the Uniform prior in Binomial case • Prior distributions – more discussion and an example • Hyperparameters, hyperpriors • Hierarchical models • Posterior distribution as a compromise between priordistribution and data. • Graphical and Numerical Summaries • Posterior probability intervals (or credible intervals) • Normal example with unknown mean and known variance • Central Limit Theorem in the Bayesian Context • Large sample properties and results

Prediction in the Binomial Example • In the binomial example with the uniform prior distribution, the prior predictive distribution (marginal of y) can be evaluated explicitly • Marginal distribution of y: p(y=i) = 1/(n+1) for i=0,1,…, n • All values of y are equally likely, a priori . • For posterior prediction, we might be more interested in the outcome of one new trial, rather than another set of n new trials.

Prediction in the Binomial example • Letting y_tilde denote the result of a new trial, exchangeable with the first n • This result, based on the uniform prior distribution, is known as ‘Laplace’s law of succession.’

- Anandamayee Majumdar Visiting Scientist, University of North Texas