1 / 58

Lectures 13,14 – Model Inference and Averaging

Lectures 13,14 – Model Inference and Averaging. Rice ECE697 Farinaz Koushanfar Fall 2006. Summary. Bootstrap and maximum likelihood (ML) Bayesian methods The expectation maximization (EM) algorithm MCMC for sampling from posterior Bagging Model averaging. Empirical Distribution.

adair
Télécharger la présentation

Lectures 13,14 – Model Inference and Averaging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lectures 13,14 – Model Inference and Averaging Rice ECE697 Farinaz Koushanfar Fall 2006

  2. Summary • Bootstrap and maximum likelihood (ML) • Bayesian methods • The expectation maximization (EM) algorithm • MCMC for sampling from posterior • Bagging • Model averaging

  3. Empirical Distribution • Suppose x1, ..., xN are the observed outcomes of N iid random variables following an unknown PDF • The empirical distribution: P(X = a) = count (xi = a)/n. • The empirical estimate of a parameter  is computed from the empirical distribution by the formula that defines the parameter based on its true distribution. • For example, the empirical estimate of the variance is  • Empirical estimates are often biased; there is no guarantee they have best possible variance, or other good properties.

  4. Bootstrap • Let T(x1, ..., xN) be an estimator of .  • Bootstrap generates repeated estimates by generating repeated "fake" outcomes. Each fake outcome is generated by taking a random sample according to the empirical distribution P(X = a) = count(xi = a)/n. • To generate fake outcomes, resample with replacement. • By taking R random samples with replacement, get R different bootstrap estimates of ; call these B1, ..., BR. • What do we use R bootstrap?  The most common is for CI: (1) Use the order statistics of Br.  E.g., for a 95% confidence interval, use B(2.5%) and B(97.5%). (2) If we know that the PDF( T(x1, ..., xN)) ~ Gaussian, base the CI on the sample variance of the bootstrap estimates,

  5. Bootstrap-Example • Training data, Z={z1,z2,..,zN}, with zi=(xi,yi) • Suppose we are fitting a cubic spline, with 3 knots placed at quantiles (splines are one form of kernel basis function, centered at knots) h1(x) h2(x) h3(x) h4(x) h5(x) h6(x) h7(x)

  6. Bootstrap-Example (cont’d) h(x)T=(h1(x),…,h7(x)) • Spline prediction • Can think of (x)=E(Y|X=x) • The usual estimate of ^=(HTH)-1HTy • The estimated covariance is • Noise var: • How do we apply bootstrap on this example?

  7. Bootstrap-Example (cont’d) • Draw B datasets with replacements (Z*:zi=(xi,yi)) • To each bootstrap sample, fit a cubic spline ^*(x) • Example - 10 bootstrap samples (left), CI (right)

  8. Least Square and Bootstrap - Example

  9. Least Square, Bootstrap, and ML • Previous example was nonparametric bootstrap • Suppose that error is Gaussian: ~N(0,2) • In parametric bootstrap, we draw samples by adding Gaussian noise to the predicted values • The process repeated B times, re-compute the spline on each, the CI from this method will be exactly the least square bands!! • The function estimated from the bootstrap sample has the distribution

  10. Maximum Likelihood (ML) inference • In general, bootstrap estimate agrees not with the least squares, but with ML • Specify a pdf function to observations: zi~g(z) • ML is based on a likelihood function L(;Z) • The logarithm of L(;Z) is denoted as l(;Z), and is the log-likelihood function • ML chooses the value of  that maximizes l(;Z)

  11. ML – Some Definitions • The score function is: • Where • Assuming that the max is in the interior, it is 0 • The information matrix is • I() evaluated at 0 is the observed information • Fisher information (or expected information) is

  12. ML – Some More Results • Assume independent sampling from g(z) • ***The sampling distribution of the ML estimator has a limiting Normal PDF (N) • The standard error of j estimation is • CI for j has the forms

  13. ML for our smoothing example • Parameters are =(,2); the log-likelihood is • ML estimate achieved by: •  • The information matrix for =(,2) is

  14. Bayesian Approach to Inference • Specify a sampling model Pr(Z|), pdf for data • Given the parameters, and a prior distribution for the Pr(), reflecting the knowledge before we see the new data • Corresponding to our updated knowledge about  after we see the new data • The difference b/w Bayesian and regular inference is that it expresses the uncertainty before seeing the data (prior), and express the uncertainty remaining (posterior)

  15. Bayesian Approach (Cont’d) • Predict the value of a future observation via the predictive distribution • ML would use Pr(znew|) to predict future data • Unlike predictive distribution, does not account for uncertainty in estimating 

  16. Bayesian Approach on Our Example • Parametric model • Assume that 2 is known and randomness is only coming from variations of y around (x) • Assuming finite number of basis, put the prior on distribution of coefficients ~N(0,) •  (x) is Gaussian with covariance kernel K(x,x’)=cov((x), (x’))= h(x)Th(x’) • Posterior distribution for  is also Gaussian

  17. Example (Cont’d) • The corresponding posterior value for (x) is • How to choose ? Take the prior to be =I

  18. Example (Cont’d) • Let’s take a look at the posterior curves and see the impact of the prior on the posterior Looks like bootstrap curves

  19. Bayes Inference – Example (from wikipedia) • Suppose we wish to know about the proportion r of voters in a large population who will vote "yes" in a referendum • Let n be the number of voters in a random sample (chosen with replacement, so that we have statistical independence) • Let m be the number of voters in that random sample who will vote "yes“ • Suppose that we observe n = 10 voters and m = 7 say they will vote yes. From Bayes theorem:

  20. Example from Wikipedia (Cont’d) • From this we see that from the prior probability density function f(r) and the likelihood function L(r) = f(m = 7|r, n = 10), we can compute the posterior pdf • f(r) summarizes what we know about the distribution of r in the absence of any observation. • We provisionally assume in this case that the prior distribution of r is uniform over the interval [0, 1]. That is, f(r) = 1. • If some additional background information is found, we should modify the prior accordingly. However before we have any observations, all outcomes are equally likely.

  21. Example from Wikipedia (Cont’d) • Assuming random sampling, the likelihood function L(r) = P(m = 7|r, n = 10,) is just the probability of 7 successes in 10 trials for a binomial pdf • As with the prior, the likelihood is open to revision -- more complex assumptions will yield more complex likelihood functions. Maintaining the current assumptions, we compute the normalizing factor, • For r[0,1] inclusive, the posterior distribution for r is then

  22. Example from Wikipedia (Cont’d) • One may be interested in the probability that more than half the voters will vote "yes". • The prior probability that more than half the voters will vote "yes" is 1/2, by the symmetry of the uniform distribution. • In comparison, the posterior probability that more than half the voters will vote "yes", i.e., the conditional probability given the outcome of the opinion poll – that seven of the 10 voters questioned will vote "yes" – is

  23. Expectation Maximization (EM) Algorithm • EM algorithm is used for simplifying difficult ML problems, we will show an example first • Assume we are looking for a simple mixture model

  24. EM - Example • Model: Y is mixture of two normal distributions: Y1~N(1,12), and Y1~N(2,22) • Y=(1-)Y1+Y2, where {0,1}, Pr(=1)= • Let (y) be Gaussian with =(,2), pdf of y is gY(y)=(1- ) 1(y)+  2(y) • The parameters of the ML for the mixture model are =(,1,12,2,22), • The log-likelihood on N training cases is

  25. EM - Example • Direct maximization is difficult numerically, because of the sum of the terms inside the log • A simpler approach is to set some of the values, find the others and iteratively update • This is the core of EM algorithm • Expectation step: Do a soft assignment of each observation to one model: how much each model is responsible for explaining a data point • E.g., responsibility of model 2 for observation i can be written as • Maximization step: The responsibilities are used in a weighted ML fit to update the parameter estimates

  26. EM Algorithm for 2-Component Gaussian Mixture Model • Take initial guesses for parameters ,1,12,2,22 • Expectation step: compute the responsibilities • Maximization step: compute weighted mean & variance And the mixing probability • Iterate steps 2 and 3 (E and M) until converge

  27. EM – Example (cont’d) Selected iterations of the EM algorithm For mixture example

  28. EM – When is it Useful? • Another name for EM is Baum-Welch algorithm • EM is used for maximizing the likelihood for a certain class of problems, where ML is difficult • Data augmentation: ML is enlarged with latent (unobserved) variables • In the previous example, the unobserved variable was i • EM is widely used when a part of the actual data is missing  treat missing data as latent variables

  29. EM Algorithm – General Case • Assume that the observed data is Z, with log-likelihood l(;Z), depending on parameter  • Missing data is Zm, complete data T=(Z,Zm), with log-likelihood l0(;T), l0 based on complete density • In the mixture problem T=(Z,Zm)=(y,) • And l1 is based on the conditional density Pr(Zm|Z,)

  30. EM Algorithm – General Case • Start with initial guesses for the parameter ^(0) • Expectation step: at the j-th step, compute as a function of the dummy argument ’ • Maximization step: determine the new estimate ^(j+1) as the maximizer of Q(’, ^(j)) over ’ • Iterate steps 2 and 3 (E and M) until convergence

  31. Why EM Works? • Remember that in terms of the log-likelihood: • Where l0 is based on complete density, l1 is based on the conditional density Pr(Zm|Z,) • Taking the conditional distribution of T|Z governed by the parameter  gives • R(*,) is the expectation of a log-likelihood of a density (indexed by *) w.r.t the same density indexed by   maximized as a function of , when *= (Jensen’s inequality) • Thus, the EM iteration never decreases the log-likelihood!

  32. MCMC Methods • Slides (for MCMC) mostly borrowed from Sujit Sahu www.maths.soton.ac.uk/staff/Sahu/ • Assume that we want to find • If interval [a,b] is divided s.t. a=x0<x1...<xN=b, the integral can be approximated by • Now assume we need to find an expectation • This can be difficult!

  33. MCMC for Integration (Cont’d) • If we can draw samples • Then we can estimate • This is Monte Carlo (MC) integration • We have changed notation Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  34. Consistency of the integral estimate • For independent samples, by Law of Large Numbers, • But independent sampling from (x) can be difficult • It turns out that the above equation holds if we generate samples using a Markov chain Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  35. Markov Chains (Review) • Markov chain generated by sampling • Where p is the transition kernel • So X(t+1) depends only on X(t), not on X(0), X(1),…, X(t-1) • For example • This is called the first order auto-regressive process with lag-1, autocorrelation 0.5 Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  36. Markov Chains (Stationary) • As t, the Markov chain converges (in distribution) to its stationary (invariant) distribution • In the example before, this is • Which does not depend on x(0) • Does this happen for all Markov chains? Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  37. Markov Chains (Irreducibility) • Assuming that a stationary distribution exists, it is unique if the chain is irreducible • Irreducible means that any set of states can be reached from any other state in a finite number of moves • An example of reducible Markov chain: • Suppose p(x|y)=0, for xA and yB and vice versa Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  38. Markov Chains (Aperiodicity) • A Markov chain taking only finite number of values is aperiodic if greatest common divisor of return times to any particular state i say, is 1 • Think of recording the number of steps taken to return to the state 1. The g.c.d of those numbers should be 1 • If the g.c.d is bigger than 1, 2 say, then the chain will return in cycles of 2,4,6,.. Number of steps. This is not allowed for aperiodicity • Definition can be extended to general state-space case Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  39. Markov Chains (Ergodicity) • Assume the Markov chain has a stationary distribution (x), is aperiodic and irreducible •  Ergodic theorem • Also for such chains with • Central limit theorem holds and • Convergence occurs geometrically Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  40. Numerical Standard Error (nse) • In general, no simpler expression exists for nse Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  41. Numerical Standard Error (nse) • If {h(X(t))} can be approximated as a first-order auto-regressive process, then • Where  is the lag-1 auto-correlation {h(X(t))} • The first factor is the usual term under independent sampling • The second term is usually >1 • As thus is the penalty to be paid because a Markov chain has been used Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  42. More on nse • The nse may not be finite in general • It is finite if the chain converges geometrically • If the nse is finite, then we can make it as small as we like by increasing N • The ‘obvious’ estimator of nse is not consistent Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  43. Markov Chains -- Summary • A Markov chain may have a stationary distribution • The stationary distribution is unique is the chain is irreducible • We can estimate nse’s if the chain is also geometrically convergent • Where does this all gets us? Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  44. MCMC • How do we construct a Markov chain whose stationary distribution is our target distribution,  (x)? • Metropolis et al. (1953) shows how • The method was generalized by Hastings (1970) • This is called Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  45. Metropolis-Hastings Algorithm • At each iteration t • Step 1 Sample y~q(y|x(t)), where y is the candidate point and q(.) is the proposed pdf • Step 2 With probability • Set x(t+1)=y (acceptance) • Else set x(t+1)=x(t) (rejection) Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  46. Metropolis-Hasting Notes • The normalizing constant in (x) is not required to run the algorithm. It cancels in the ratio • If q(y|x)= (y), then we obtain independent samples • Usually q is chosen so that q(y|x) is easy to sample from • Theoretically, any density q(.|x) having the same support should work. However, some q’s are better than the others • The induced Markov chains have the desirable properties under mild conditions on (x) Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  47. Gibbs Sampler • Gibbs sampling is a Monte-Carlo sampling method • Suppose that x=(x1,x2,…,xk) is k(2) dimensional • Gibbs sampler uses what are called the full (or complete) conditional distributions • Note that the conditional • Is proportional to the joint. Often this helps in finding it Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  48. Gibbs Sampling (Cont’d) • Sample or update in turn, but always use the most relevant values Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/

  49. Gibbs Sampling -- Algorithm • Take some initial values Uk(0), k=1,2,…,K • Repeat for t=1,2,…,: For k=1,2,…,K generate Uk(t) from Pr(Uk(t) | U1(t),…, Uk-1(t-1), …, UK(t-1)) • Continue step 2 until the joint distribution of (U1(t), U2(t), …, Uk(t)) does not change

  50. Gibbs Sampling (Cont’d) • In Bayesian Inference, the goal is to draw a sample from the joint posterior of the parameters given the data Z • Gibbs sampling will be helpful if it is easy to sample from the conditional distribution of each parameter given the other parameters and Z • An example – Gaussian mixture problem is described on the next slide

More Related