Lectures 13,14 – Model Inference and Averaging

Lectures 13,14 – Model Inference and Averaging Rice ECE697 Farinaz Koushanfar Fall 2006

Summary • Bootstrap and maximum likelihood (ML) • Bayesian methods • The expectation maximization (EM) algorithm • MCMC for sampling from posterior • Bagging • Model averaging

Empirical Distribution • Suppose x1, ..., xN are the observed outcomes of N iid random variables following an unknown PDF • The empirical distribution: P(X = a) = count (xi = a)/n. • The empirical estimate of a parameter  is computed from the empirical distribution by the formula that defines the parameter based on its true distribution. • For example, the empirical estimate of the variance is • Empirical estimates are often biased; there is no guarantee they have best possible variance, or other good properties.

Bootstrap • Let T(x1, ..., xN) be an estimator of . • Bootstrap generates repeated estimates by generating repeated "fake" outcomes. Each fake outcome is generated by taking a random sample according to the empirical distribution P(X = a) = count(xi = a)/n. • To generate fake outcomes, resample with replacement. • By taking R random samples with replacement, get R different bootstrap estimates of ; call these B1, ..., BR. • What do we use R bootstrap? The most common is for CI: (1) Use the order statistics of Br. E.g., for a 95% confidence interval, use B(2.5%) and B(97.5%). (2) If we know that the PDF( T(x1, ..., xN)) ~ Gaussian, base the CI on the sample variance of the bootstrap estimates,

Bootstrap-Example • Training data, Z={z1,z2,..,zN}, with zi=(xi,yi) • Suppose we are fitting a cubic spline, with 3 knots placed at quantiles (splines are one form of kernel basis function, centered at knots) h1(x) h2(x) h3(x) h4(x) h5(x) h6(x) h7(x)

Bootstrap-Example (cont’d) h(x)T=(h1(x),…,h7(x)) • Spline prediction • Can think of (x)=E(Y|X=x) • The usual estimate of ^=(HTH)-1HTy • The estimated covariance is • Noise var: • How do we apply bootstrap on this example?

Bootstrap-Example (cont’d) • Draw B datasets with replacements (Z*:zi=(xi,yi)) • To each bootstrap sample, fit a cubic spline ^*(x) • Example - 10 bootstrap samples (left), CI (right)

Least Square and Bootstrap - Example

Least Square, Bootstrap, and ML • Previous example was nonparametric bootstrap • Suppose that error is Gaussian: ~N(0,2) • In parametric bootstrap, we draw samples by adding Gaussian noise to the predicted values • The process repeated B times, re-compute the spline on each, the CI from this method will be exactly the least square bands!! • The function estimated from the bootstrap sample has the distribution

Maximum Likelihood (ML) inference • In general, bootstrap estimate agrees not with the least squares, but with ML • Specify a pdf function to observations: zi~g(z) • ML is based on a likelihood function L(;Z) • The logarithm of L(;Z) is denoted as l(;Z), and is the log-likelihood function • ML chooses the value of  that maximizes l(;Z)

ML – Some Definitions • The score function is: • Where • Assuming that the max is in the interior, it is 0 • The information matrix is • I() evaluated at 0 is the observed information • Fisher information (or expected information) is

ML – Some More Results • Assume independent sampling from g(z) • ***The sampling distribution of the ML estimator has a limiting Normal PDF (N) • The standard error of j estimation is • CI for j has the forms

ML for our smoothing example • Parameters are =(,2); the log-likelihood is • ML estimate achieved by: •  • The information matrix for =(,2) is

Bayesian Approach to Inference • Specify a sampling model Pr(Z|), pdf for data • Given the parameters, and a prior distribution for the Pr(), reflecting the knowledge before we see the new data • Corresponding to our updated knowledge about  after we see the new data • The difference b/w Bayesian and regular inference is that it expresses the uncertainty before seeing the data (prior), and express the uncertainty remaining (posterior)

Bayesian Approach (Cont’d) • Predict the value of a future observation via the predictive distribution • ML would use Pr(znew|) to predict future data • Unlike predictive distribution, does not account for uncertainty in estimating 

Bayesian Approach on Our Example • Parametric model • Assume that 2 is known and randomness is only coming from variations of y around (x) • Assuming finite number of basis, put the prior on distribution of coefficients ~N(0,) •  (x) is Gaussian with covariance kernel K(x,x’)=cov((x), (x’))= h(x)Th(x’) • Posterior distribution for  is also Gaussian

Example (Cont’d) • The corresponding posterior value for (x) is • How to choose ? Take the prior to be =I

Example (Cont’d) • Let’s take a look at the posterior curves and see the impact of the prior on the posterior Looks like bootstrap curves

Bayes Inference – Example (from wikipedia) • Suppose we wish to know about the proportion r of voters in a large population who will vote "yes" in a referendum • Let n be the number of voters in a random sample (chosen with replacement, so that we have statistical independence) • Let m be the number of voters in that random sample who will vote "yes“ • Suppose that we observe n = 10 voters and m = 7 say they will vote yes. From Bayes theorem:

Example from Wikipedia (Cont’d) • From this we see that from the prior probability density function f(r) and the likelihood function L(r) = f(m = 7|r, n = 10), we can compute the posterior pdf • f(r) summarizes what we know about the distribution of r in the absence of any observation. • We provisionally assume in this case that the prior distribution of r is uniform over the interval [0, 1]. That is, f(r) = 1. • If some additional background information is found, we should modify the prior accordingly. However before we have any observations, all outcomes are equally likely.

Example from Wikipedia (Cont’d) • Assuming random sampling, the likelihood function L(r) = P(m = 7|r, n = 10,) is just the probability of 7 successes in 10 trials for a binomial pdf • As with the prior, the likelihood is open to revision -- more complex assumptions will yield more complex likelihood functions. Maintaining the current assumptions, we compute the normalizing factor, • For r[0,1] inclusive, the posterior distribution for r is then

Example from Wikipedia (Cont’d) • One may be interested in the probability that more than half the voters will vote "yes". • The prior probability that more than half the voters will vote "yes" is 1/2, by the symmetry of the uniform distribution. • In comparison, the posterior probability that more than half the voters will vote "yes", i.e., the conditional probability given the outcome of the opinion poll – that seven of the 10 voters questioned will vote "yes" – is

Expectation Maximization (EM) Algorithm • EM algorithm is used for simplifying difficult ML problems, we will show an example first • Assume we are looking for a simple mixture model

EM - Example • Model: Y is mixture of two normal distributions: Y1~N(1,12), and Y1~N(2,22) • Y=(1-)Y1+Y2, where {0,1}, Pr(=1)= • Let (y) be Gaussian with =(,2), pdf of y is gY(y)=(1- ) 1(y)+  2(y) • The parameters of the ML for the mixture model are =(,1,12,2,22), • The log-likelihood on N training cases is

EM - Example • Direct maximization is difficult numerically, because of the sum of the terms inside the log • A simpler approach is to set some of the values, find the others and iteratively update • This is the core of EM algorithm • Expectation step: Do a soft assignment of each observation to one model: how much each model is responsible for explaining a data point • E.g., responsibility of model 2 for observation i can be written as • Maximization step: The responsibilities are used in a weighted ML fit to update the parameter estimates

EM Algorithm for 2-Component Gaussian Mixture Model • Take initial guesses for parameters ,1,12,2,22 • Expectation step: compute the responsibilities • Maximization step: compute weighted mean & variance And the mixing probability • Iterate steps 2 and 3 (E and M) until converge

EM – Example (cont’d) Selected iterations of the EM algorithm For mixture example

EM – When is it Useful? • Another name for EM is Baum-Welch algorithm • EM is used for maximizing the likelihood for a certain class of problems, where ML is difficult • Data augmentation: ML is enlarged with latent (unobserved) variables • In the previous example, the unobserved variable was i • EM is widely used when a part of the actual data is missing  treat missing data as latent variables

EM Algorithm – General Case • Assume that the observed data is Z, with log-likelihood l(;Z), depending on parameter  • Missing data is Zm, complete data T=(Z,Zm), with log-likelihood l0(;T), l0 based on complete density • In the mixture problem T=(Z,Zm)=(y,) • And l1 is based on the conditional density Pr(Zm|Z,)

EM Algorithm – General Case • Start with initial guesses for the parameter ^(0) • Expectation step: at the j-th step, compute as a function of the dummy argument ’ • Maximization step: determine the new estimate ^(j+1) as the maximizer of Q(’, ^(j)) over ’ • Iterate steps 2 and 3 (E and M) until convergence

Why EM Works? • Remember that in terms of the log-likelihood: • Where l0 is based on complete density, l1 is based on the conditional density Pr(Zm|Z,) • Taking the conditional distribution of T|Z governed by the parameter  gives • R(*,) is the expectation of a log-likelihood of a density (indexed by *) w.r.t the same density indexed by   maximized as a function of , when *= (Jensen’s inequality) • Thus, the EM iteration never decreases the log-likelihood!

MCMC Methods • Slides (for MCMC) mostly borrowed from Sujit Sahu www.maths.soton.ac.uk/staff/Sahu/ • Assume that we want to find • If interval [a,b] is divided s.t. a=x0<x1...<xN=b, the integral can be approximated by • Now assume we need to find an expectation • This can be difficult!

MCMC for Integration (Cont’d) • If we can draw samples • Then we can estimate • This is Monte Carlo (MC) integration • We have changed notation Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Consistency of the integral estimate • For independent samples, by Law of Large Numbers, • But independent sampling from (x) can be difficult • It turns out that the above equation holds if we generate samples using a Markov chain Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Markov Chains (Review) • Markov chain generated by sampling • Where p is the transition kernel • So X(t+1) depends only on X(t), not on X(0), X(1),…, X(t-1) • For example • This is called the first order auto-regressive process with lag-1, autocorrelation 0.5 Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Markov Chains (Stationary) • As t, the Markov chain converges (in distribution) to its stationary (invariant) distribution • In the example before, this is • Which does not depend on x(0) • Does this happen for all Markov chains? Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Markov Chains (Irreducibility) • Assuming that a stationary distribution exists, it is unique if the chain is irreducible • Irreducible means that any set of states can be reached from any other state in a finite number of moves • An example of reducible Markov chain: • Suppose p(x|y)=0, for xA and yB and vice versa Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Markov Chains (Aperiodicity) • A Markov chain taking only finite number of values is aperiodic if greatest common divisor of return times to any particular state i say, is 1 • Think of recording the number of steps taken to return to the state 1. The g.c.d of those numbers should be 1 • If the g.c.d is bigger than 1, 2 say, then the chain will return in cycles of 2,4,6,.. Number of steps. This is not allowed for aperiodicity • Definition can be extended to general state-space case Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Markov Chains (Ergodicity) • Assume the Markov chain has a stationary distribution (x), is aperiodic and irreducible •  Ergodic theorem • Also for such chains with • Central limit theorem holds and • Convergence occurs geometrically Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Numerical Standard Error (nse) • In general, no simpler expression exists for nse Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Numerical Standard Error (nse) • If {h(X(t))} can be approximated as a first-order auto-regressive process, then • Where  is the lag-1 auto-correlation {h(X(t))} • The first factor is the usual term under independent sampling • The second term is usually >1 • As thus is the penalty to be paid because a Markov chain has been used Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

More on nse • The nse may not be finite in general • It is finite if the chain converges geometrically • If the nse is finite, then we can make it as small as we like by increasing N • The ‘obvious’ estimator of nse is not consistent Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Markov Chains -- Summary • A Markov chain may have a stationary distribution • The stationary distribution is unique is the chain is irreducible • We can estimate nse’s if the chain is also geometrically convergent • Where does this all gets us? Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

MCMC • How do we construct a Markov chain whose stationary distribution is our target distribution,  (x)? • Metropolis et al. (1953) shows how • The method was generalized by Hastings (1970) • This is called Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Metropolis-Hastings Algorithm • At each iteration t • Step 1 Sample y~q(y|x(t)), where y is the candidate point and q(.) is the proposed pdf • Step 2 With probability • Set x(t+1)=y (acceptance) • Else set x(t+1)=x(t) (rejection) Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Metropolis-Hasting Notes • The normalizing constant in (x) is not required to run the algorithm. It cancels in the ratio • If q(y|x)= (y), then we obtain independent samples • Usually q is chosen so that q(y|x) is easy to sample from • Theoretically, any density q(.|x) having the same support should work. However, some q’s are better than the others • The induced Markov chains have the desirable properties under mild conditions on (x) Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Gibbs Sampler • Gibbs sampling is a Monte-Carlo sampling method • Suppose that x=(x1,x2,…,xk) is k(2) dimensional • Gibbs sampler uses what are called the full (or complete) conditional distributions • Note that the conditional • Is proportional to the joint. Often this helps in finding it Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Gibbs Sampling (Cont’d) • Sample or update in turn, but always use the most relevant values Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

Gibbs Sampling -- Algorithm • Take some initial values Uk(0), k=1,2,…,K • Repeat for t=1,2,…,: For k=1,2,…,K generate Uk(t) from Pr(Uk(t) | U1(t),…, Uk-1(t-1), …, UK(t-1)) • Continue step 2 until the joint distribution of (U1(t), U2(t), …, Uk(t)) does not change

Gibbs Sampling (Cont’d) • In Bayesian Inference, the goal is to draw a sample from the joint posterior of the parameters given the data Z • Gibbs sampling will be helpful if it is easy to sample from the conditional distribution of each parameter given the other parameters and Z • An example – Gaussian mixture problem is described on the next slide

Lectures 13,14 – Model Inference and Averaging