Probability Density Data Points: Gaussian Mixture Models in Unsupervised Learning

Unsupervised Learning II:Soft Clustering with Gaussian Mixture Models

K-means assigns each data point to exactly one cluster. But:

Soft Clustering with Gaussian mixture models • A “soft” version of K-means clustering • Given: Training set S = {x1, ..., xN}, and K. • Assumption:Data is generated by sampling probabilistically from a “mixture” (linear combination) of K Gaussians. Data Points

Soft Clustering with Gaussian mixture models • A “soft” version of K-means clustering • Given: Training set S = {x1, ..., xN}, and K. • Assumption:Data is generated by sampling probabilistically from a “mixture” (linear combination) of K Gaussians. Probability Density Data Points

Gaussian Mixture Models Assumptions

Gaussian Mixture Models Assumptions • K clusters

Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance).

Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.]

Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] • Assume that each data instance we have was generated by the following procedure:

Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] • Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi

Gaussian Mixture Models Assumptions • K clusters • Each cluster is modeled by a Gaussian distribution with a certain mean and standard deviation (or covariance). [This contrasts with K-means, in which each cluster is modeled only by a mean.] • Assume that each data instance we have was generated by the following procedure: 1. Select cluster ci with probability P(ci) = πi 2. Sample point fromci’sGaussian distribution

Mixture of three Gaussians(one dimensional data) Probability Density Data Points

Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters.

Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters. • Goal:Given the data S, find the Gaussians! (And their probabilities πi .)

Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters. • Goal:Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized.

Clustering via Gaussian mixture models • Clusters: Each cluster will correspond to a single Gaussian. Each point x S will have some probability distribution over the K clusters. • Goal:Given the data S, find the Gaussians! (And their probabilities πi .) i.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is maximized. • This is called a Maximum Likelihood method. • S is the data • {θK} is the “hypothesis” or “model” • P(S | {θK}) is the “likelihood”.

General form of one-dimensional (univariate) Gaussian Mixture Model

Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian

Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian • Assume training set S has N values generated by a univariant Gaussian distribution:

Learning a GMM Simplest Case: Maximum Likelihood for Single Univariate Gaussian • Assume training set S has N values generated by a univariant Gaussian distribution: • Likelihood function: probability of the data given the model:

How to estimate parameters  and  from S?

How to estimate parameters  and  from S? • Maximize the likelihood function with respect to  and  .

How to estimate parameters  and  from S? • Maximize the likelihood function with respect to  and  . • We want the  and  that maximize the probability of the data.

How to estimate parameters  and  from S? • Maximize the likelihood function with respect to  and  . • We want the  and  that maximize the probability of the data. • Use log of likelihood to make this tractable.

Log-Likelihood version:

Log-Likelihood version: Find a simplified expression for this.

Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ.

Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to .

Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . = 0 Result: (ML = “Maximum Likelihood”)

Now, given simplified expression for ln p(S | ,σ), find the maximum likelihood parameters,  and σ. First, maximize with respect to . Result: (ML = “Maximum Likelihood”)

Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.

Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2.

Now, maximize with respect to σ. It’s actually easier to maximize with respect to σ2. = 0

The resulting distribution is called a generative model because it can generate new data values. • We say that parameterizes the model. • In general, θ is used to denote the (learnable) parameters of a probabilistic model ML

Discriminative vs. generative models in machine learning • Discriminative model: Creates a discrimination curve to separate the data into classes • Generative model: Gives the parameters θMLof a probability distribution that gives maximum likelihood p (S | θ)for the data.

Learning a GMMMore general case: Multivariate Gaussian Distribution Multivariate (D-dimensional) Gaussian:

Covariance: Variance: Covariance Matrix Σ : Σi,j= cov (xi , xj)

Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}.

Let S be a set of multivariate data points (vectors): S = {x1, ..., xm}. • General expression for finite Gaussian mixture model:

Probability Density Data Points: Gaussian Mixture Models in Unsupervised Learning