CS b553: Algorithms for Optimization and Learning
This document delves into the Expectation-Maximization (EM) algorithm, a powerful technique used for fitting probability distributions to data with hidden (latent) variables. It covers the fundamental problem of estimating parameters in a Bayesian framework when data is incomplete, as well as applications such as clustering, dimensionality reduction, and psychological trait assessment. Key concepts include maximum likelihood and marginal likelihood estimation, computational challenges, and examples illustrating correlated variables and parameter estimates, demonstrating the efficacy of hidden variable models.
CS b553: Algorithms for Optimization and Learning
E N D
Presentation Transcript
CS b553: Algorithms for Optimization and Learning Parameter Learning with Hidden Variables & Expectation Maximization
Agenda • Learning probability distributions from data in the setting of known structure, missing data • Expectation-maximization (EM) algorithm
Basic Problem • Given a dataset D={x[1],…,x[M]} and a Bayesian model over observed variables X and hidden (latent) variablesZ • Fit the distribution P(X,Z) to the data • Interpretation: each example x[m] is an incomplete view of the “underlying” sample (x[m],z[m]) Z X
Applications • Clustering in data mining • Dimensionality reduction • Latent psychological traits (e.g., intelligence, personality) • Document classification • Human activity recognition
Hidden Variables can Yield more Parsimonious Models • Hidden variables => conditional independences Z X1 X2 X3 X4 Without Z, the observables become fully dependent X1 X2 X3 X4
Hidden Variables can Yield more Parsimonious Models • Hidden variables => conditional independences Z 1+4*2=9 parameters X1 X2 X3 X4 Without Z, the observables become fully dependent X1 X2 1+2+4+8=15 parameters X3 X4
Generating Model These CPTs are identical and given qz z[1] z[M] qx|z x[1] x[M] These CPTs are identical and given
Example: discrete variables Categorical distributions given by parameters qz P(Z[i] |qz) = Categorical(qz) qz z[1] z[M] qx|z x[1] x[M] Categorical distribution P(X[i]|z[i],qx|z[i]) = Categorical(qx|z[i]) (in other words, z[i] multiplexes between Categorical distributions)
Maximum Likelihood estimation • Approach: find values of q = (qz, qx|z), and DZ=(z[1],…,z[M]) that maximize the likelihood of the data • L(q, DZ ; D) = P(D|q, DZ ) • Find arg max L(q, DZ ; D) over q, DZ
Marginal Likelihood estimation • Approach: find values of q = (qz, qx|z), and that maximize the likelihood of the data without assuming values of DZ=(z[1],…,z[M]) • L(q; D) = SDzP(D, DZ |q) • Find arg max L(q; D) over q • (A partially Bayesian approach)
Computational challenges • P(D|q, DZ ) and P(D,DZ | q) are easy to evaluate, but… • Maximum likelihood arg max L(q, DZ ; D) • Optimizing over M assignments to Z (|Val(Z)|M possible joint assignments) as well as continuous parameters • Maximum marginal likelihood arg max L(q; D) • Optimizing locally over continuous parameters, but objective requires summing over M assignments to Z
Expectation Maximization for ML • Idea: use a coordinate ascent approach • argmaxq, DZ L(q, DZ ; D) =argmaxqmax DZ L(q, DZ ; D) • Step 1: Finding DZ*= argmax DZ L(q, DZ ; D)is easy given a fixed q • Fully observed, ML parameter estimation • Step 2: Set Q(q) = L(q, DZ*; D)Findingq*=argmaxqQ(q)is easy given that DZ is fixed • Fully observed, ML parameter estimation • Repeat steps 1 and 2 until convergence
Example: Correlated variables Unrolled network Plate notation qz qz M z[1] z[M] z qx1|z qx1|z x1[1] x1[M] x1 qx1|z qx1|z x2[M] x2[1] x2
Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 qz M z qx1|z x1 qx2|z x2
Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.5 • qx1|z=1 = 0.4, qx1|z=2= 0.3 • qx2|z=1 = 0.7, qx2|z=2= 0.6 qz M z qx1|z x1 qx2|z x2
Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.5 • qx1|z=1 = 0.4, qx1|z=2= 0.3 • qx2|z=1 = 0.7, qx2|z=2= 0.6 qz M z • Estimated Z’s • (1,1): type 1 • (1,0): type 1 • (0,1): type 2 • (0,0): type 2 qx1|z x1 qx2|z x2
Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.604 • qx1|z=1 = 1, qx1|z=2= 0 • qx2|z=1 = 0.368, qx2|z=2= 0.919 qz M z • Estimated Z’s • (1,1): type 1 • (1,0): type 1 • (0,1): type 2 • (0,0): type 2 qx1|z x1 qx2|z x2
Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.604 • qx1|z=1 = 1, qx1|z=2= 0 • qx2|z=1 = 0.368, qx2|z=2= 0.919 qz M z • Estimated Z’s • (1,1): type 1 • (1,0): type 1 • (0,1): type 2 • (0,0): type 2 qx1|z x1 qx2|z x2 Converged (true ML estimate)
Example: Correlated variables Plate notation qz M z x1 Random initial guess qZ = 0.44 qX1|Z=1 = 0.97 qX2|Z=1 = 0.21 qX3|Z=1 = 0.87 qX4|Z=1 = 0.57 qX1|Z=2 = 0.07 qX2|Z=2 = 0.97 qX3|Z=2 = 0.71 qX4|Z=2 = 0.03 Log likelihood -5176 qx1|z qx2|z x2 qx3|z x3 x4 qx4|z
Example: E step Plate notation • X Dataset qz M z x1 Random initial guess qZ = 0.44 qX1|Z=1 = 0.97 qX2|Z=1 = 0.21 qX3|Z=1 = 0.87 qX4|Z=1 = 0.57 qX1|Z=2 = 0.07 qX2|Z=2 = 0.97 qX3|Z=2 = 0.71 qX4|Z=2 = 0.03 Log likelihood -4401 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z
Example: M step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.43 qX1|Z=1 = 0.67 qX2|Z=1 = 0.27 qX3|Z=1 = 0.37 qX4|Z=1 = 0.83 qX1|Z=2 = 0.31 qX2|Z=2 = 0.68 qX3|Z=2 = 0.31 qX4|Z=2= 0.21 Log likelihood -3033 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z
Example: E step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.43 qX1|Z=1 = 0.67 qX2|Z=1 = 0.27 qX3|Z=1 = 0.37 qX4|Z=1 = 0.83 qX1|Z=2 = 0.31 qX2|Z=2 = 0.68 qX3|Z=2 = 0.31 qX4|Z=2= 0.21 Log likelihood -2965 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z
Example: E step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.40 qX1|Z=1 = 0.56 qX2|Z=1 = 0.31 qX3|Z=1 = 0.40 qX4|Z=1 = 0.92 qX1|Z=2 = 0.45 qX2|Z=2 = 0.66 qX3|Z=2 = 0.26 qX4|Z=2= 0.04 Log likelihood -2859 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z
Example: Last E-M step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.43 qX1|Z=1 = 0.51 qX2|Z=1 = 0.36 qX3|Z=1 = 0.35 qX4|Z=1 = 1 qX1|Z=2 = 0.53 qX2|Z=2 = 0.57 qX3|Z=2 = 0.33 qX4|Z=2= 0 Log likelihood -2683 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z
Problem: Many Local Minima • Flipping Z assignments causes large shifts in likelihood, leading to a poorly behaved energy landscape! • Solution: EM using the marginal likelihood formulation • “Soft” EM • (This is the typical form of the EM algorithm)
Expectation Maximization for MML • argmaxqL(q, D) =argmaxqEDZ|D,q [L(q; DZ ,D)] • Do argmaxqEDZ|D,q[log L(q; DZ ,D)] instead • (justified later) • Step 1: Given current fixed qt,find P(Dz|qt, D) • Compute a distribution over each Z[i] • Step 2: Use these probabilities in the expectationEDZ |D,qt[log L(q, DZ ; D)] = Q(q). Now find maxqQ(q) • Fully observed, weighted, ML parameter estimation • Repeat steps 1 (expectation) and 2 (maximization) until convergence
E step in detail • Ultimately, want to maximize Q(q | qt) = EDZ|D,qt [log L(q; DZ ,D)] over q • Q(q | qt) =SmSz[m] P(z[m]|x[m], qt) log P(x[m], z[m]|q) • E step computes the termswm,z(qt)=P(Z[m]=z|D, qt)over all examples m and zVal[Z]
M step in detail • argmaxq Q(q | qt) = SmSzwm,z(qt)log P (x[m]|q, z[m]=z)= argmaxPm PzP (x[m]|q, z[m]=z)^(wm,z(qt)) • This is weighted ML • Each z[m] is interpreted to be observed wm,z(qt)times • Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case
Example: Bernoulli Parameter for Z • qZ*=argmaxqzSmSzwm,zlog P (x[m],z[m]=z |qZ)= argmaxqzSmSzwm,zlog (I[z=1]qZ+I[z=0](1-qZ)=argmaxqz[log (qZ)Smwm,z=1+log(1-qZ)Smwm,z=0] • => qZ*= (Smwm,z=1)/ Sm(wm,z=1+wm,z=0) • “Expected counts” Mqt[z] = Smwm,z(qt)Express qZ* = Mqt[z=1] / Mqt[ ]
Example: Bernoulli Parameters for Xi | Z • qXi|z=k*=argmaxqzSmwm,z=klog P(x[m],z[m]=k |qXi|z=k) • = argmaxqxi|z=kSmSzwm,zlog (I[xi[m]=1,z=k]qXi|z=k+I[xi[m]=0,z=k](1-qXi|z=k)= … (similar derivation) • => qXi|z=k * = Mqt[xi=1,z=k] / Mqt[z=k]
EM on Prior Example (100iterations) Plate notation • X Dataset qz M z x1 Final estimates qZ = 0.49 qX1|Z=1 = 0.64 qX2|Z=1 = 0.88 qX3|Z=1 = 0.41 qX4|Z=1 = 0.46 qX1|Z=2 = 0.38 qX2|Z=2 = 0.00 qX3|Z=2 = 0.27 qX4|Z=2= 0.68 Log likelihood -2833 qx1|z P(Z)=2 qx2|z x2 qx3|z x3 x4 qx4|z
Convergence • In general, no way to tell a priori how fast EM will converge • Soft EM is usually slower than hard EM • Still runs into local minima, but has more opportunities to coordinate parameter adjustments
Why does it work? • Why are we optimizing over Q(q | qt) =SmSz[m] P(z[m]|x[m], qt) log P(x[m], z[m]|q) • rather than the true marginalized likelihood: L(q|D) = Pm Sz[m] P(z[m]|x[m], qt) P(x[m], z[m]|q)?
Why does it work? • Why are we optimizing over Q(q | qt) =SmSz[m] P(z[m]|x[m], qt) log P(x[m], z[m]|q) • rather than the true marginalized likelihood: L(q|D) = Pm Sz[m] P(z[m]|x[m], qt) P(x[m], z[m]|q)? • Can prove that: • The log likelihood is increased at every step • A stationary point of argmaxqEDZ|D,q [L(q; DZ ,D)] is a stationary point of log L(q|D) • see K&F p882-884
Gaussian Clustering using EM • One of the first uses of EM • Widely used approach • Finding good starting points: • k-means algorithm • (Hard assignment) • Handling degeneracies • Regularization
Recap • Learning with hidden variables • Typically categorical