Tutorial 3

Tutorial 3 • Maximum likelihood – an example • Maximum likelihood – another example • Bayesian estimation • EM for a mixture model • EM Algorithm General Setting • Jensen’s inequality 236607 Visual Recognition Tutorial

Bayesian Estimation:General Theory • Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This isBayesian learning 236607 Visual Recognition Tutorial

Bayesian parametric estimation • Density function for x, given the training data set (it was defined in the Lect.2) • From the definition of conditional probability densities • The first factor is independent of X(n) since it just our assumed form for parameterized density. • Therefore 236607 Visual Recognition Tutorial

Bayesian parametric estimation • Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of If the weighting factor , which is a posterior of peaks very sharply about some value we obtain . Thus the optimal estimator is the most likely value of given the data and the prior of . 236607 Visual Recognition Tutorial

Bayesian decision making • Suppose we know the distribution of possible values of that is a prior • Suppose we also have a loss function which measures the penalty for estimating when actual value is • Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk • Note that the loss function is usually continuous. 236607 Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation • Let us look at : the optimal estimator is the most likely value of q given the data and the prior of q • This “most likely value” is given by 236607 Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation since the data is i.i.d. • We can disregard the normalizing factor when looking for the maximum 236607 Visual Recognition Tutorial

MAP - continued So, the we are looking for is 236607 Visual Recognition Tutorial

Maximum likelihood • In MAP estimator, the larger n (the size of the data), the less important is in the expression It can motivate us to omit the prior. • What we get is the maximum likelihood (ML) method. • Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way . • is a log-likelihood of with respect to X(n) . • We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function. 236607 Visual Recognition Tutorial

Maximum likelihood – an example • Let us find the ML estimator for the parameter of the exponential density function : so we are actually looking for the maximum of log-likelihood. • Observe: • The maximum is achieved where • We have got the empirical mean (average) 236607 Visual Recognition Tutorial

Maximum likelihood – another example • Let us find the ML estimator for • Observe: • The maximum is at where • This is the median of the sampled data. 236607 Visual Recognition Tutorial

Bayesian estimation -revisited • We saw Bayesian estimator for 0/1 loss function (MAP). • What happens when we assume other loss functions? • Example 1: (q is unidimensional). • The total Bayesian risk here: • We seek its minimum: 236607 Visual Recognition Tutorial

Bayesian estimation -continued • At the which is a solution we have • That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution • Example 2: (squared error). • Total Bayesian risk: • Again, in order to find the minimum, let the derivative be equal 0: 236607 Visual Recognition Tutorial

Bayesian estimation -continued • The optimal estimator here is the conditional expectation of q given the data X(n) . 236607 Visual Recognition Tutorial

Mixture Models 236607 Visual Recognition Tutorial

Mixture Models • Introduce multinomial random variable Z with components Zk If and only if Zn takes kth value then . Note that 236607 Visual Recognition Tutorial

Mixture Models where Marginal prob. of X is 236607 Visual Recognition Tutorial

Mixture Models Conditional prob. of Z. Define posterior A mixture model as graphical model. Z – multinomial latent variable 236607 Visual Recognition Tutorial

Unconditional Mixture Models Cond. Mix.Mod. -> to solve regression and classification (supervised). Need observation of data X and labels Y that is (X,Y) pairs. Uncond. Mix.Mod. -> to solve density estimation problems Need only observation of data X. Applications – detection of outliers, compression, unsupervised classification (clustering) … 236607 Visual Recognition Tutorial

Unconditional Mixture Models 236607 Visual Recognition Tutorial

Gaussian Mixture Models Estimate from IID data D={x1,…,xN} 236607 Visual Recognition Tutorial

The K- means algorithm • Group data D={x1,…,xN} into a set of K clusters, where K is given. Represents i-th cluster as one vector - its mean . Data points assign to the nearest mean . Phase 1: values for the indicator variable are evaluated by assigning each point xn to the closed mean: Phase 2: recompute 236607 Visual Recognition Tutorial

EM Algorithm • If Zn were observed, then it would be “class label” and estimate of mean would be • We don’t know them and replace them by their conditional expectations, conditioning on data: But from (6),(7) depends on parameter estimates so we should iterate. 236607 Visual Recognition Tutorial

EM Algorithm • Iteration formulas: 236607 Visual Recognition Tutorial

EM Algorithm • Expectation step is (14) • Maximization step is parameter updates (15)-(17) • What relationship this algorithm has to quantity which we want to maximize - log likelihood (9) ? • Calculating derivatives of l with respect to the parameters, we have 236607 Visual Recognition Tutorial

EM Algorithm • Setting to zero yields • Analogously and mixing proportions: 236607 Visual Recognition Tutorial

EM General Setting • EM is iterative technique designed for probabilistic models. • We have two sample spaces: • X which are observed (dataset) • Z which are missing (latent) • A probability model is • If we knew Z we would do ML estimation by maximizing 236607 Visual Recognition Tutorial

EM General Setting • Z is not observed so we calculate incomplete log likelihood • Given Z is not observed so the complete log likelihood is a random quantity and cannot be maximized directly. • Thus we average over Z using some “averaging distribution” q(z|x). • We hope that maximizing this surrogate expression will yield value of q which will be improvement of initial value of q. 236607 Visual Recognition Tutorial

EM General Setting • The distribution can be used to obtain lower bound on log likelihood: • EM is coordinate ascent on • At the (t+1)st iteration, for fixed q(t), we first maximize with respect to q, which yield q(t+1). For this q(t+1) we then maximize with respect to q which yields q(t+1), 236607 Visual Recognition Tutorial

EM General Setting • E step • M step • The M step is equivalently viewed as the maximization of the expected complete log likelihood. Proof: • Second term is independent of q. Thus maximizing of with respect to q is equivalent tomaximizing . 236607 Visual Recognition Tutorial

EM General Setting • The E step can be solved ones and for all: choise yields the maximum: 236607 Visual Recognition Tutorial

Jensen’s inequality • Definition: function is convex over (a,b) if Convex Concave • Jensen’s inequality: For convex function 236607 Visual Recognition Tutorial

Jensen’s inequality • For d.r.v.with two mass points • Let Jensen’s inequality is right for k-1 mass points, then due to induction assumption due to convexity 236607 Visual Recognition Tutorial

Jensen’s inequality corollary • Let • Function log is concave, so from Jensen inequality we have: 236607 Visual Recognition Tutorial

Tutorial 3

Tutorial 3

Presentation Transcript

Tutorial 3

TUTORIAL 3

TUTORIAL 3

Tutorial 3

Tutorial 3

Tutorial 3

Tutorial#3

Tutorial 3

Tutorial 3

Tutorial 3

Tutorial #3

Tutorial 3

Tutorial 3

TUTORIAL 3

Tutorial 3

Tutorial 3

Tutorial -3

Tutorial 3

Tutorial 3

Tutorial 3