Unsupervised Learning

Unsupervised Learning Gaussian Mixture Models Expectation-Maximization (EM)

Gaussian Mixture Models Like K-Means, GMM clusters have centers. In addition, they have probability distributions that indicate the probability that a point belongs to the cluster. These ellipses show “level sets”: lines with equal probability of belonging to the cluster. Notice that green points still have SOME probability of belonging to the blue cluster, but it’s much lower than the blue points. This is a more complex model than K-Means: distance from the center can matter more in one direction than another. X2 X1

GMMs and EM Gaussian Mixture Models (GMMs) are a model, similar to a Naïve Bayes model but with important differences. Expectation-Maximization (EM) is a parameter-estimation algorithm for training GMMs using unlabeled data. To explain these further, we first need to review Gaussian (normal) distributions.

The Normal (aka Gaussian) Distribution f mean, σ2: variance σ

Quiz: MLE for Gaussians Based on your statistics knowledge, • What is the MLE for μ from a bunch of example X points? • What is the MLE for σ from a bunch of example X points?

Answer: MLE for Gaussians Based on your statistics knowledge, • What is the MLE for μ from a bunch of example X points? • What is the MLE for σ from a bunch of example X points? (average of the X values) (average deviation from the mean) Note: this is a so-called “biased” estimator for ; there is also an “unbiased” estimator which basically just uses (M-1) instead of M. We’ll stick to the “biased” one here, but either one is fine.

Quiz: Deriving the ML estimators How would you derive the MLE equations for Gaussian distributions?

Answer: Deriving the ML estimators How would you derive the MLE equations for Gaussian distributions? Same plan of attack as for MLE estimates of Bayes Nets: • Write down the Likelihood function P(D | M) • Make the assumption that each data point Xi is independently distributed, so P(D|M) = • Take the log • Take the partial derivative with respect to μ, set this equal to zero, and solve for μ. • Take the partial derivative with respect to σ, set this equal to zero, and solve for σ.

Quiz: Estimating a Gaussian On the left is a dataset with the following X values: 0, 3, 4, 5, 6, 7, 10 Find the maximum likelihood Gaussian distribution. 0

Answer: Estimating a Gaussian On the left is a dataset with the following X values: 0, 3, 4, 5, 6, 7, 10 0 f

Clustering by fitting K Gaussians Suppose our dataset looks like the one above. It doesn’t really look Gaussian anymore; it looks like it has 3 clusters. Fitting a single Gaussian to this data will still give you an estimate. But that Gaussian will have a low Likelihood value: it will give very low probability to the leftmost and rightmost clusters. 0

Clustering by fitting K Gaussians What we’d like to do instead is to fit K Gaussians. A model for data that involves multiple Gaussian distributions is called a Gaussian Mixture Model (GMM). 0

Clustering by fitting K Gaussians Another way of drawing these is with “Level sets”: Curves that show points with equal probability for each Guassian. Wider curves having lower probability than narrower curves. Notice that each point is contained within every Gaussian, but is most tightly bound to the closest Gaussian. 0 μred μblue μgreen

Expectation-Maximization (EM) EM is “K-Means for GMMs”. It is a parameter estimation algorithm for GMMs that will determine a (locally-optimal) setting for all of the GMM parameters, using a bunch of unlabeled X points. Input: 1. Data points X1, …, XM 2. A number K Output: , , …, , such that the GMM with those means and standard deviations has a locally-maximum likelihood for the training data set.

Visualization of EM • Initialize the mean and standard deviation of each Gaussian randomly. • Repeat until convergence: • Expectation: For each point X and each Gaussian k, find P(X | Gaussian k)

Visualization of EM • Initialize the mean and standard deviation of each Gaussian randomly. • Repeat until convergence: • Expectation: For each point X and each Gaussian k, find f(X | Gaussian k) • Maximization: Estimate new parameters for each Gaussian. (Technically, you also need to estimate a third parameter, called πk. More later.)

Gaussian Mixture Model K Gaussian distributions with parameters through . It also involves K additional parameters, called prior probabilities, through . These describe the relative importance of each of the K Gaussian distributions in the full model. The likelihood equation for this model looks like this: (i.i.d. assumption) Gaussian Prior

GMMs as Bayes Nets Cluster (1, 2, …, K) GMMs are simple Bayes Nets. Two differences from previous BNs we’ve seen: • We’re used to binary variables in BNs. Here, the “Cluster” variable has K possible values (1, 2, …, K) instead of just two (+cluster and –cluster). We used to store P(+a) and P(-a) for the parent variable; now we store through . • The “X” variable has infinitely many values (any real number) instead of just (+x and –x). We used to store P(+x | +a) and P(+x | -a). Now we store through , and we say f(X |Cluster is j) = X (a real number)

Formal Description of the Algorithm • Init: For each k in {1, …, K}, create a random πk, μk, σ2k • Repeat until all πk, μk, σ2k remain the same from one iteration to the next: Expectation (aka Assignment in K-Means): For each Xi, for each k: let C[Xi,k]  Maximization (aka Update in K-Means): For each k, 3. Return (for all values of k) πk, μk, σ2k

Evaulation metric for GMMs and EM LOSS Function (or Objective function) for EM: EM (locally) maximizes “Marginal” Likelihood: EM(X1, …, XM) = argmax, …, f(X1,…XM| , …, ) Notice that this is the Likelihood function for just the X variable in our Bayes Net, rather than the Likelihood for (X and Cluster), which is why it is called “marginal likelihood” rather than just “likelihood”.

Analysis of EM Performance EM is guaranteed to find a local optimum of the Likelihood function. Theorem: After one iteration of EM, the Likelihood of the new GMM >= the Likelihood of the previous GMM. (Dempster, A.P.; Laird, N.M.; Rubin, D.B. 1977. "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological)39 (1): 1–38.JSTOR2984875.)

EM Generality Even though EM was originally invented for GMMs, the same basic algorithm can be used for learning with arbitrary Bayes Nets when some of the training data has missing values. This has made EM one of the most popular unsupervised learning techniques in machine learning.

EM Quiz b a c g1 g2 g3 Which Gaussian(s) have a nonzero value for f(a)? How about f(c)?

Answer: EM Quiz b a c g1 g2 g3 Which Gaussian(s) have a nonzero f(a)? All Gaussians (g1, g2, and g3) have a nonzero value for f(a). How about f(c)? Ditto. All Gaussians have a nonzero value for f(c).

Quiz: EM vs. K-Means a c g1 g2 Option 1 Option 2 Option 3 Option 4 At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2? At the end of EM, where will cluster center g1 end up – Option 1 or Option 2?

Answer: EM vs. K-Means a c g1 g2 Option 1 Option 2 Option 3 Option 4 At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2? Option 1: K-Means puts the “mean” at the center of all points in the cluster, and point a will be the only point in g1’s cluster. At the end of EM, where will cluster center g1 end up – Option 1 or Option 2? Option 2: EM puts the “mean” at the center of all points in the dataset, where each point is weighted by how likely it is according to the Gaussian. Point a and Point b will both have some likelihood, but Point a’s likelihood will be much higher. So the “mean” for g1 will be very close to Point a, but not all the way at Point a.

How many clusters? We’ve been assuming a fixed K. Here’s a technique to determine this automatically, from data. New objective function: Minimize: Algorithm: 1. Initialize K somehow. Repeat until convergence: 2. Run EM. 3. Remove unnecessary clusters (low π value) 4. Create new random clusters (more or fewer than before, depending on a heuristic estimate of whether there were too many or too few before). This is slow. But one nice property is that it can overcome some difficulties with local maxima.

Quiz Is EM for GMMs Classification or Regression? Generative or Discriminative? Parametric or Nonparametric?

Answer Is EM for GMMs Classification or Regression? Two possible answers: • classification: output is a discrete value (cluster label) for each point • Regression: output is a real value (probability) for each possible cluster label for each point Generative or Discriminative? • normally, it’s used with a fixed set of input and output variables. However, GMMs are Bayes Nets that store a full joint distribution. Once it’s trained, a GMM can actually make predictions for any subset of the variables given any other subset. Technically, this is generative. Parametric or Nonparametric? - parametric: the number of parameters is 3K, which does not change with the number of training data points.

Quiz Is EM for GMMs Supervised or Unsupervised? Online or batch? Closed-form or iterative?

Answer Is EM for GMMs Supervised or Unsupervised? - Unsupervised Online or batch? - batch: if you add a new data point, you need to revisit all the training data to recompute the locally-optimal model Closed-form or iterative? -iterative: training requires many passes through the data

Unsupervised Learning