CS b553: Algorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning Parameter Learning with Hidden Variables & Expectation Maximization

Agenda • Learning probability distributions from data in the setting of known structure, missing data • Expectation-maximization (EM) algorithm

Basic Problem • Given a dataset D={x[1],…,x[M]} and a Bayesian model over observed variables X and hidden (latent) variablesZ • Fit the distribution P(X,Z) to the data • Interpretation: each example x[m] is an incomplete view of the “underlying” sample (x[m],z[m]) Z X

Applications • Clustering in data mining • Dimensionality reduction • Latent psychological traits (e.g., intelligence, personality) • Document classification • Human activity recognition

Hidden Variables can Yield more Parsimonious Models • Hidden variables => conditional independences Z X1 X2 X3 X4 Without Z, the observables become fully dependent X1 X2 X3 X4

Hidden Variables can Yield more Parsimonious Models • Hidden variables => conditional independences Z 1+4*2=9 parameters X1 X2 X3 X4 Without Z, the observables become fully dependent X1 X2 1+2+4+8=15 parameters X3 X4

Generating Model These CPTs are identical and given qz z[1] z[M] qx|z x[1] x[M] These CPTs are identical and given

Example: discrete variables Categorical distributions given by parameters qz P(Z[i] |qz) = Categorical(qz) qz z[1] z[M] qx|z x[1] x[M] Categorical distribution P(X[i]|z[i],qx|z[i]) = Categorical(qx|z[i]) (in other words, z[i] multiplexes between Categorical distributions)

Maximum Likelihood estimation • Approach: find values of q = (qz, qx|z), and DZ=(z[1],…,z[M]) that maximize the likelihood of the data • L(q, DZ ; D) = P(D|q, DZ ) • Find arg max L(q, DZ ; D) over q, DZ

Marginal Likelihood estimation • Approach: find values of q = (qz, qx|z), and that maximize the likelihood of the data without assuming values of DZ=(z[1],…,z[M]) • L(q; D) = SDzP(D, DZ |q) • Find arg max L(q; D) over q • (A partially Bayesian approach)

Computational challenges • P(D|q, DZ ) and P(D,DZ | q) are easy to evaluate, but… • Maximum likelihood arg max L(q, DZ ; D) • Optimizing over M assignments to Z (|Val(Z)|M possible joint assignments) as well as continuous parameters • Maximum marginal likelihood arg max L(q; D) • Optimizing locally over continuous parameters, but objective requires summing over M assignments to Z

Expectation Maximization for ML • Idea: use a coordinate ascent approach • argmaxq, DZ L(q, DZ ; D) =argmaxqmax DZ L(q, DZ ; D) • Step 1: Finding DZ*= argmax DZ L(q, DZ ; D)is easy given a fixed q • Fully observed, ML parameter estimation • Step 2: Set Q(q) = L(q, DZ*; D)Findingq*=argmaxqQ(q)is easy given that DZ is fixed • Fully observed, ML parameter estimation • Repeat steps 1 and 2 until convergence

Example: Correlated variables Unrolled network Plate notation qz qz M z[1] z[M] z qx1|z qx1|z x1[1] x1[M] x1 qx1|z qx1|z x2[M] x2[1] x2

Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 qz M z qx1|z x1 qx2|z x2

Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.5 • qx1|z=1 = 0.4, qx1|z=2= 0.3 • qx2|z=1 = 0.7, qx2|z=2= 0.6 qz M z qx1|z x1 qx2|z x2

Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.5 • qx1|z=1 = 0.4, qx1|z=2= 0.3 • qx2|z=1 = 0.7, qx2|z=2= 0.6 qz M z • Estimated Z’s • (1,1): type 1 • (1,0): type 1 • (0,1): type 2 • (0,0): type 2 qx1|z x1 qx2|z x2

Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.604 • qx1|z=1 = 1, qx1|z=2= 0 • qx2|z=1 = 0.368, qx2|z=2= 0.919 qz M z • Estimated Z’s • (1,1): type 1 • (1,0): type 1 • (0,1): type 2 • (0,0): type 2 qx1|z x1 qx2|z x2

Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.604 • qx1|z=1 = 1, qx1|z=2= 0 • qx2|z=1 = 0.368, qx2|z=2= 0.919 qz M z • Estimated Z’s • (1,1): type 1 • (1,0): type 1 • (0,1): type 2 • (0,0): type 2 qx1|z x1 qx2|z x2 Converged (true ML estimate)

Example: Correlated variables Plate notation qz M z x1 Random initial guess qZ = 0.44 qX1|Z=1 = 0.97 qX2|Z=1 = 0.21 qX3|Z=1 = 0.87 qX4|Z=1 = 0.57 qX1|Z=2 = 0.07 qX2|Z=2 = 0.97 qX3|Z=2 = 0.71 qX4|Z=2 = 0.03 Log likelihood -5176 qx1|z qx2|z x2 qx3|z x3 x4 qx4|z

Example: E step Plate notation • X Dataset qz M z x1 Random initial guess qZ = 0.44 qX1|Z=1 = 0.97 qX2|Z=1 = 0.21 qX3|Z=1 = 0.87 qX4|Z=1 = 0.57 qX1|Z=2 = 0.07 qX2|Z=2 = 0.97 qX3|Z=2 = 0.71 qX4|Z=2 = 0.03 Log likelihood -4401 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

Example: M step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.43 qX1|Z=1 = 0.67 qX2|Z=1 = 0.27 qX3|Z=1 = 0.37 qX4|Z=1 = 0.83 qX1|Z=2 = 0.31 qX2|Z=2 = 0.68 qX3|Z=2 = 0.31 qX4|Z=2= 0.21 Log likelihood -3033 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

Example: E step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.43 qX1|Z=1 = 0.67 qX2|Z=1 = 0.27 qX3|Z=1 = 0.37 qX4|Z=1 = 0.83 qX1|Z=2 = 0.31 qX2|Z=2 = 0.68 qX3|Z=2 = 0.31 qX4|Z=2= 0.21 Log likelihood -2965 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

Example: E step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.40 qX1|Z=1 = 0.56 qX2|Z=1 = 0.31 qX3|Z=1 = 0.40 qX4|Z=1 = 0.92 qX1|Z=2 = 0.45 qX2|Z=2 = 0.66 qX3|Z=2 = 0.26 qX4|Z=2= 0.04 Log likelihood -2859 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

Example: Last E-M step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.43 qX1|Z=1 = 0.51 qX2|Z=1 = 0.36 qX3|Z=1 = 0.35 qX4|Z=1 = 1 qX1|Z=2 = 0.53 qX2|Z=2 = 0.57 qX3|Z=2 = 0.33 qX4|Z=2= 0 Log likelihood -2683 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

Problem: Many Local Minima • Flipping Z assignments causes large shifts in likelihood, leading to a poorly behaved energy landscape! • Solution: EM using the marginal likelihood formulation • “Soft” EM • (This is the typical form of the EM algorithm)

Expectation Maximization for MML • argmaxqL(q, D) =argmaxqEDZ|D,q [L(q; DZ ,D)] • Do argmaxqEDZ|D,q[log L(q; DZ ,D)] instead • (justified later) • Step 1: Given current fixed qt,find P(Dz|qt, D) • Compute a distribution over each Z[i] • Step 2: Use these probabilities in the expectationEDZ |D,qt[log L(q, DZ ; D)] = Q(q). Now find maxqQ(q) • Fully observed, weighted, ML parameter estimation • Repeat steps 1 (expectation) and 2 (maximization) until convergence

M step in detail • argmaxq Q(q | qt) = SmSzwm,z(qt)log P (x[m]|q, z[m]=z)= argmaxPm PzP (x[m]|q, z[m]=z)^(wm,z(qt)) • This is weighted ML • Each z[m] is interpreted to be observed wm,z(qt)times • Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case

Example: Bernoulli Parameter for Z • qZ*=argmaxqzSmSzwm,zlog P (x[m],z[m]=z |qZ)= argmaxqzSmSzwm,zlog (I[z=1]qZ+I[z=0](1-qZ)=argmaxqz[log (qZ)Smwm,z=1+log(1-qZ)Smwm,z=0] • => qZ*= (Smwm,z=1)/ Sm(wm,z=1+wm,z=0) • “Expected counts” Mqt[z] = Smwm,z(qt)Express qZ* = Mqt[z=1] / Mqt[ ]

EM on Prior Example (100iterations) Plate notation • X Dataset qz M z x1 Final estimates qZ = 0.49 qX1|Z=1 = 0.64 qX2|Z=1 = 0.88 qX3|Z=1 = 0.41 qX4|Z=1 = 0.46 qX1|Z=2 = 0.38 qX2|Z=2 = 0.00 qX3|Z=2 = 0.27 qX4|Z=2= 0.68 Log likelihood -2833 qx1|z P(Z)=2 qx2|z x2 qx3|z x3 x4 qx4|z

Convergence • In general, no way to tell a priori how fast EM will converge • Soft EM is usually slower than hard EM • Still runs into local minima, but has more opportunities to coordinate parameter adjustments

Why does it work? • Why are we optimizing over Q(q | qt) =SmSz[m] P(z[m]|x[m], qt) log P(x[m], z[m]|q) • rather than the true marginalized likelihood: L(q|D) = Pm Sz[m] P(z[m]|x[m], qt) P(x[m], z[m]|q)? • Can prove that: • The log likelihood is increased at every step • A stationary point of argmaxqEDZ|D,q [L(q; DZ ,D)] is a stationary point of log L(q|D) • see K&F p882-884

Gaussian Clustering using EM • One of the first uses of EM • Widely used approach • Finding good starting points: • k-means algorithm • (Hard assignment) • Handling degeneracies • Regularization

Recap • Learning with hidden variables • Typically categorical

CS b553: Algorithms for Optimization and Learning