Create Presentation
Download Presentation

Download Presentation
## Bayesian Essentials and Bayesian Regression

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Y**1 1 X Distribution Theory 101 • Marginal and Conditional Distributions: uniform**Simulating from Joint**• To draw from the joint: • i. draw from marginal on X • ii. Condition on this draw, and draw from conditional of Y|X**The Goal of Inference**• Make inferences about unknown quantities using available information. • Inference -- make probability statements • unknowns -- • parameters, functions of parameters, states or latent variables, “future” outcomes, outcomes conditional on an action • Information – • data-based • non data-based • theories of behavior; “subjective views” there is an underlying structure • parameters are finite or in some range**Data Aspects of Marketing Problems**• Ex: Conjoint Survey • 500 respondents rank, rate, chose among product configurations. • Small Amount of Information per respondent • Response Variable Discrete • Ex: Retail Scanning Data • very large number of products • large number of geographical units (markets, zones, stores) • limited variation in some marketing mix vars • Must make plausible predictions for decision making!**The likelihood principle**Note: any function proportional to data density can be called the likelihood. • LP: the likelihood contains all information relevant for inference. That is, as long as I have same likelihood function, I should make the same inferences about the unknowns. • Implies analysis is done conditional on the data,.**Bayes theorem**• p(|D) p(D| ) p() • Posterior “Likelihood” × Prior • Modern Bayesian computing– simulation methods for generating draws from the posterior distribution p(|D).**Summarizing the posterior**Output from Bayesian Inf: A high dimensional dist Summarize this object via simulation: marginal distributions of don’t just compute**Prediction**See D,compute: “Predictive Distribution”**Decision theory**• Loss: L(a,) where a=action; =state of nature • Bayesian decision theory: • Estimation problem is a special case:**Sampling properties of Bayes estimators**An estimator is admissible if there exist no other estimator with lower risk for all values of . The Bayes estimator minimizes expected (average) risk which implies they are admissible: The Bayes estimator does the best for every D. Therefore, it must work as at least as well as any other estimator.**Bayes Inference: Summary**• Bayesian Inference delivers an integrated approach to: • Inference – including “estimation” and “testing” • Prediction – with a full accounting for uncertainty • Decision – with likelihood and loss (these are distinct!) • Bayesian Inference is conditional on available info. • The right answer to the right question. • Bayes estimators are admissible. All admissible estimators are Bayes (Complete Class Thm). Which Bayes estimator?**Bayes/Classical Estimators**Prior washes out – locally uniform!!! Bayes is consistent unless you have dogmatic prior.**Benefits/Costs of Bayes Inf**Benefits- finite sample answer to right question full accounting for uncertainty integrated approach to inf/decision “Costs”- computational (true any more? < classical!!) prior (cost or benefit?) esp. with many parms (hierarchical/non-parameric problems)**Bayesian Computations**Before simulation methods, Bayesians used posterior expectations of various functions as summary of posterior. If p(θ|D) is in a convenient form (e.g. normal), then I might be able to compute this for some h. Via iid simulation for all h.**Conjugate Families**• Models with convenient analytic properties almost invariably come from conjugate families. • Why do I care now? • - conjugate models are used as building blocks • build intuition re functions of Bayesian inference • Definition: • A prior is conjugate to a likelihood if the posterior is in the same class of distributions as prior. • Basically, conjugate priors are like the posterior from some imaginary dataset with a diffuse prior.**Need a prior!**Beta-Binomial model**Regression model**Is this model complete? For non-experimental data, don’t we need a model for the joint distribution of y and x?**two separate analyses**Regression model simultaneous systems are not written this way! rules out x=f(β)!!! If Ψ is a priori indep of (β,σ),**Conjugate Prior**What is conjugate prior? Comes from form of likelihood function. Here we condition on X. quadratic form suggests normal prior. Let’s – complete the square on βor rewrite by projecting y on X (column space of X).**y**x2 2x2 1x1 x1 Geometry of regression**Traditional regression**• No one ever computes a matrix inverse directly. • Two numerically stable methods: • QR decomposition of X • Cholesky root of X’X and compute inverse using root • Non-Bayesians have to worry about singularity or near singularity of X’X. We don’t! more later**Cholesky Roots**In Bayesian computations, the fundamental matrix operation is the Cholesky root. chol() in R The Cholesky root is the generalization of the square root applied to positive definite matrices. As Bayesians with proper priors, we don’t ever have to worry about singular matrices! U is upper triangular with positive diagonal elements. U-1 is easy to compute by recursively solving TU = I for T, backsolve() in R.**Cholesky Roots**Cholesky roots can be useful to simulate from Multivariate Normal Distribution. To simulate a matrix of draws from MVN (each row is a separate draw) in R, Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma) Y=t(t(Y)+mu)**data.txt:**UNIT Y X1 X2 A 1 0.23815 0.43730 A 2 0.55508 0.47938 A 3 3.03399 -2.17571 A 4 -1.49488 1.66929 B 10 -1.74019 0.35368 B 9 1.40533 -1.26120 B 8 0.15628 -0.27751 B 7 -0.93869 -0.04410 B 6 -3.06566 0.14486 df=read.table("data.txt",header=TRUE) myreg=function(y,X){ # # purpose: compute lsq regression # # arguments: # y -- vector of dep var # X -- array of indep vars # # output: # list containing lsq coef and std errors # XpXinv=chol2inv(chol(crossprod(X))) bhat=XpXinv%*%crossprod(X,y) res=as.vector(y-X%*%bhat) ssq=as.numeric(res%*%res/(nrow(X)-ncol(X))) se=sqrt(diag(ssq*XpXinv)) list(b=bhat,std_errors=se) } Regression with R**where**Regression likelihood**Regression likelihood**This is called an inverted gamma distribution. It can also be related to the inverse of a Chi-squared distribution. Note the conjugate prior suggested by the form the likelihood has a prior on βwhich depends on σ.**Bayesian Regression**Prior: Interpretation as from another dataset. Inverted Chi-Square: Draw from prior?**IID Simulations**Scheme: [y|X, , 2] [|2] [2] 1) Draw [2 | y, X] 2) Draw [ | 2,y, X] 3) Repeat**Bayes Estimator**The Bayes Estimator is the posterior mean of β. Marginal on β is a multivariate student t. Who cares?**Shrinkage and Conjugate Priors**The Bayes Estimator is the posterior mean of β. This is a “shrinkage” estimator. Is this reasonable?**Assessing Prior Hyperparameters**These determine prior location and spread for both coefs and error variance. It has become customary to assess a “diffuse” prior: This can be problematic. Var(y) might be a better choice.**Improper or “non-informative” priors**Classic “non-informative” prior (improper): • Is this “non-informative”? • Of course not, it says that is large with high prior “probability” • Is this wise computationally? • No, I have to worry about singularity in X’X • Is this a good procedure? • No, it is not admissible. Shrinkage is good!**runireg**• runireg= • function(Data,Prior,Mcmc){ • # • # purpose: • # draw from posterior for a univariate regression model with natural conjugate prior • # • # Arguments: • # Data -- list of data • # y,X • # Prior -- list of prior hyperparameters • # betabar,A prior mean, prior precision • # nu, ssq prior on sigmasq • # Mcmc -- list of MCMC parms • # R number of draws • # keep -- thinning parameter • # • # output: • # list of beta, sigmasq draws • # beta is k x 1 vector of coefficients • # model: • # Y=Xbeta+e var(e_i) = sigmasq • # priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1) • # sigmasq ~ (nu*ssq)/chisq_nu**runireg**• RA=chol(A) • W=rbind(X,RA) • z=c(y,as.vector(RA%*%betabar)) • IR=backsolve(chol(crossprod(W)),diag(k)) • # W'W=R'R ; (W'W)^-1 = IR IR' -- this is UL decomp • btilde=crossprod(t(IR))%*%crossprod(W,z) • res=z-W%*%btilde • s=t(res)%*%res • # • # first draw Sigma • # • sigmasq=(nu*ssq + s)/rchisq(1,nu+n) • # • # now draw beta given Sigma • # • beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k) • list(beta=beta,sigmasq=sigmasq) • }**Inverted Wishart distribution**Form of the likelihood suggests that natural conjugate (convenient prior) for would be of the Inverted Wishart form: denoted • - tightness V- location however, as increases, spread also increases limitations: i. small -- thick tail ii. only one tightness parm**Multivariate regression prior and posterior**Prior: Posterior: