1 / 34

Tutorial 3

Tutorial 3. Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture model EM Algorithm General Setting Jensen ’ s inequality. Bayesian Estimation: General Theory. Bayesian leaning considers (the parameter vector to be

khuyen
Télécharger la présentation

Tutorial 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial 3 • Maximum likelihood – an example • Maximum likelihood – another example • Bayesian estimation • EM for a mixture model • EM Algorithm General Setting • Jensen’s inequality 236607 Visual Recognition Tutorial

  2. Bayesian Estimation:General Theory • Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This isBayesian learning 236607 Visual Recognition Tutorial

  3. Bayesian parametric estimation • Density function for x, given the training data set (it was defined in the Lect.2) • From the definition of conditional probability densities • The first factor is independent of X(n) since it just our assumed form for parameterized density. • Therefore 236607 Visual Recognition Tutorial

  4. Bayesian parametric estimation • Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of If the weighting factor , which is a posterior of peaks very sharply about some value we obtain . Thus the optimal estimator is the most likely value of given the data and the prior of . 236607 Visual Recognition Tutorial

  5. Bayesian decision making • Suppose we know the distribution of possible values of that is a prior • Suppose we also have a loss function which measures the penalty for estimating when actual value is • Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk • Note that the loss function is usually continuous. 236607 Visual Recognition Tutorial

  6. Maximum A-Posteriori (MAP) Estimation • Let us look at : the optimal estimator is the most likely value of q given the data and the prior of q • This “most likely value” is given by 236607 Visual Recognition Tutorial

  7. Maximum A-Posteriori (MAP) Estimation since the data is i.i.d. • We can disregard the normalizing factor when looking for the maximum 236607 Visual Recognition Tutorial

  8. MAP - continued So, the we are looking for is 236607 Visual Recognition Tutorial

  9. Maximum likelihood • In MAP estimator, the larger n (the size of the data), the less important is in the expression It can motivate us to omit the prior. • What we get is the maximum likelihood (ML) method. • Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way . • is a log-likelihood of with respect to X(n) . • We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function. 236607 Visual Recognition Tutorial

  10. Maximum likelihood – an example • Let us find the ML estimator for the parameter of the exponential density function : so we are actually looking for the maximum of log-likelihood. • Observe: • The maximum is achieved where • We have got the empirical mean (average) 236607 Visual Recognition Tutorial

  11. Maximum likelihood – another example • Let us find the ML estimator for • Observe: • The maximum is at where • This is the median of the sampled data. 236607 Visual Recognition Tutorial

  12. Bayesian estimation -revisited • We saw Bayesian estimator for 0/1 loss function (MAP). • What happens when we assume other loss functions? • Example 1: (q is unidimensional). • The total Bayesian risk here: • We seek its minimum: 236607 Visual Recognition Tutorial

  13. Bayesian estimation -continued • At the which is a solution we have • That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution • Example 2: (squared error). • Total Bayesian risk: • Again, in order to find the minimum, let the derivative be equal 0: 236607 Visual Recognition Tutorial

  14. Bayesian estimation -continued • The optimal estimator here is the conditional expectation of q given the data X(n) . 236607 Visual Recognition Tutorial

  15. Mixture Models 236607 Visual Recognition Tutorial

  16. Mixture Models • Introduce multinomial random variable Z with components Zk If and only if Zn takes kth value then . Note that 236607 Visual Recognition Tutorial

  17. Mixture Models where Marginal prob. of X is 236607 Visual Recognition Tutorial

  18. Mixture Models Conditional prob. of Z. Define posterior A mixture model as graphical model. Z – multinomial latent variable 236607 Visual Recognition Tutorial

  19. Unconditional Mixture Models Cond. Mix.Mod. -> to solve regression and classification (supervised). Need observation of data X and labels Y that is (X,Y) pairs. Uncond. Mix.Mod. -> to solve density estimation problems Need only observation of data X. Applications – detection of outliers, compression, unsupervised classification (clustering) … 236607 Visual Recognition Tutorial

  20. Unconditional Mixture Models 236607 Visual Recognition Tutorial

  21. Gaussian Mixture Models Estimate from IID data D={x1,…,xN} 236607 Visual Recognition Tutorial

  22. The K- means algorithm • Group data D={x1,…,xN} into a set of K clusters, where K is given. Represents i-th cluster as one vector - its mean . Data points assign to the nearest mean . Phase 1: values for the indicator variable are evaluated by assigning each point xn to the closed mean: Phase 2: recompute 236607 Visual Recognition Tutorial

  23. EM Algorithm • If Zn were observed, then it would be “class label” and estimate of mean would be • We don’t know them and replace them by their conditional expectations, conditioning on data: But from (6),(7) depends on parameter estimates so we should iterate. 236607 Visual Recognition Tutorial

  24. EM Algorithm • Iteration formulas: 236607 Visual Recognition Tutorial

  25. EM Algorithm • Expectation step is (14) • Maximization step is parameter updates (15)-(17) • What relationship this algorithm has to quantity which we want to maximize - log likelihood (9) ? • Calculating derivatives of l with respect to the parameters, we have 236607 Visual Recognition Tutorial

  26. EM Algorithm • Setting to zero yields • Analogously and mixing proportions: 236607 Visual Recognition Tutorial

  27. EM General Setting • EM is iterative technique designed for probabilistic models. • We have two sample spaces: • X which are observed (dataset) • Z which are missing (latent) • A probability model is • If we knew Z we would do ML estimation by maximizing 236607 Visual Recognition Tutorial

  28. EM General Setting • Z is not observed so we calculate incomplete log likelihood • Given Z is not observed so the complete log likelihood is a random quantity and cannot be maximized directly. • Thus we average over Z using some “averaging distribution” q(z|x). • We hope that maximizing this surrogate expression will yield value of q which will be improvement of initial value of q. 236607 Visual Recognition Tutorial

  29. EM General Setting • The distribution can be used to obtain lower bound on log likelihood: • EM is coordinate ascent on • At the (t+1)st iteration, for fixed q(t), we first maximize with respect to q, which yield q(t+1). For this q(t+1) we then maximize with respect to q which yields q(t+1), 236607 Visual Recognition Tutorial

  30. EM General Setting • E step • M step • The M step is equivalently viewed as the maximization of the expected complete log likelihood. Proof: • Second term is independent of q. Thus maximizing of with respect to q is equivalent tomaximizing . 236607 Visual Recognition Tutorial

  31. EM General Setting • The E step can be solved ones and for all: choise yields the maximum: 236607 Visual Recognition Tutorial

  32. Jensen’s inequality • Definition: function is convex over (a,b) if Convex Concave • Jensen’s inequality: For convex function 236607 Visual Recognition Tutorial

  33. Jensen’s inequality • For d.r.v.with two mass points • Let Jensen’s inequality is right for k-1 mass points, then due to induction assumption due to convexity 236607 Visual Recognition Tutorial

  34. Jensen’s inequality corollary • Let • Function log is concave, so from Jensen inequality we have: 236607 Visual Recognition Tutorial

More Related