Part IV: Inference algorithms

Part IV: Inference algorithms

Estimation and inference • Actually working with probabilistic models requires solving some difficult computational problems… • Two key problems: • estimating parameters in models with latent variables • computing posterior distributions involving large numbers of variables

Part IV: Inference algorithms • The EM algorithm • for estimation in models with latent variables • Markov chain Monte Carlo • for sampling from posterior distributions involving large numbers of variables

SUPERVISED dog dog cat dog dog cat dog cat dog cat dog cat dog cat cat dog

Supervised learning Category A Category B What characterizes the categories? How should we categorize a new observation?

Parametric density estimation • Assume that p(x|c) has a simple form, characterized by parameters  • Given stimuli X = x1, x2, …, xn from category c, find  by maximum-likelihood estimation or some form of Bayesian estimation

Spatial representations • Assume a simple parametric form for p(x|c): a Gaussian • For each category, estimate parameters • mean • variance c P(c) x p(x|c) } 

standard deviation mean The Gaussian distribution Probability density p(x) (x-)/ variance = 2

Estimating a Gaussian X = {x1, x2, …, xn}independently sampled from a Gaussian

Estimating a Gaussian X = {x1, x2, …, xn}independently sampled from a Gaussian maximum likelihood parameter estimates:

mean variance/covariance matrix quadratic form Multivariate Gaussians

Estimating a Gaussian X = {x1, x2, …, xn} independently sampled from a Gaussian maximum likelihood parameter estimates:

Bayesian inference Probability x

UNSUPERVISED

Unsupervised learning What latent structure is present? What are the properties of a new observation?

An example: Clustering Assume each observed xi is from a cluster ci, where ci is unknown What characterizes the clusters? What cluster does a new x come from?

Density estimation • We need to estimate some probability distributions • what is P(c)? • what is p(x|c)? • But… c is unknown, so we only know the value of x c P(c) x p(x|c)

Supervised and unsupervised Supervised learning: categorization • Given x = {x1, …, xn} and c = {c1, …, cn} • Estimate parameters of p(x|c) and P(c) Unsupervised learning: clustering • Given x = {x1, …, xn} • Estimate parameters of p(x|c) and P(c)

mixture weights Mixture distributions mixture distribution mixture components Probability x

More generally… Unsupervised learning is density estimation using distributions with latent variables z P(z) Latent (unobserved) x Marginalize out (i.e. sum over) latent structure P(x|z) Observed

A chicken and egg problem • If we knew which cluster the observations were from we could find the distributions • this is just density estimation • If we knew the distributions, we could infer which cluster each observation came from • this is just categorization

Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments ci: 2. Given assignments ci, solve for maximum likelihood parameter estimates: 3. Go to step 1

Alternating optimization algorithm x c: assignments to cluster ,, P(c): parameters  For simplicity, assume , P(c) fixed: “k-means” algorithm

Alternating optimization algorithm Step 0: initial parameter values

Alternating optimization algorithm Step 1: update assignments

Alternating optimization algorithm Step 2: update parameters

Alternating optimization algorithm Step 1: update assignments

Alternating optimization algorithm Step 2: update parameters

Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments ci: 2. Given assignments ci, solve for maximum likelihood parameter estimates: 3. Go to step 1 why “hard” assignments?

Estimating a Gaussian(with hard assignments) X = {x1, x2, …, xn} independently sampled from a Gaussian maximum likelihood parameter estimates:

Estimating a Gaussian(with soft assignments) the “weight” of each point is the probability of being in the cluster maximum likelihood parameter estimates:

The Expectation-Maximizationalgorithm(clustering version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over assignments ci: 2. Solve for maximum likelihood parameter estimates, weighting each observation by the probability it came from that cluster 3. Go to step 1

The Expectation-Maximizationalgorithm(more general version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over latent variablesz: 2. Find parameter estimates 3. Go to step 1

A note on expectations • For a function f(x) and distribution P(x), the expectation of f with respect to P is • The expectation is the average of f, when x is drawn from the probability distribution P

Good features of EM • Convergence • guaranteed to converge to at least a local maximum of the likelihood (or other extremum) • likelihood is non-decreasing across iterations • Efficiency • big steps initially (other algorithms better later) • Generality • can be defined for many probabilistic models • can be combined with a prior for MAP estimation

Limitations of EM • Local minima • e.g., one component poorly fits two clusters, while two components split up a single cluster • Degeneracies • e.g., two components may merge, a component may lock onto one data point, with variance going to zero • May be intractable for complex models • dealing with this is an active research topic

EM and cognitive science • The EM algorithm seems like it might be a good way to describe some “bootstrapping” • anywhere there’s a “chicken and egg” problem • a prime example: language learning

1.0 NP VP 0.7 1.0 T N V NP 0.7 0.8 0.5 0.6 man T N hit the 0.8 0.5 the ball Probabilistic context free grammars S S  NP VP 1.0 NP  T N 0.7 NP  N 0.3 VP  V NP 1.0 T  the 0.8 T  a 0.2 N  man 0.5 N  ball 0.5 V  hit 0.6 V  took 0.4 P(tree) = 1.00.71.00.80.50.60.70.80.5

EM and cognitive science • The EM algorithm seems like it might be a good way to describe some “bootstrapping” • anywhere there’s a “chicken and egg” problem • a prime example: language learning • Fried and Holyoak (1984) explicitly tested a model of human categorization that was almost exactly a version of the EM algorithm for a mixture of Gaussians

Part IV: Inference algorithms • The EM algorithm • for estimation in models with latent variables • Markov chain Monte Carlo • for sampling from posterior distributions involving large numbers of variables

The Monte Carlo principle • The expectation of f with respect to P can be approximated by where the xi are sampled from P(x) • Example: the average # of spots on a die roll

The Monte Carlo principle The law of large numbers Average number of spots Number of rolls

Markov chain Monte Carlo • Sometimes it isn’t possible to sample directly from a distribution • Sometimes, you can only compute something proportional to the distribution • Markov chain Monte Carlo: construct a Markov chain that will converge to the target distribution, and draw samples from that chain • just uses something proportional to the target

Markov chains Variables x(t+1) independent of all previous variables given immediate predecessor x(t) x x x x x x x x Transition matrix T = P(x(t+1)|x(t))

An example: card shuffling • Each state x(t) is a permutation of a deck of cards (there are 52! permutations) • Transition matrix T indicates how likely one permutation will become another • The transition probabilities are determined by the shuffling procedure • riffle shuffle • overhand • one card

Convergence of Markov chains • Why do we shuffle cards? • Convergence to a uniform distribution takes only 7 riffle shuffles… • Other Markov chains will also converge to a stationary distribution, if certain simple conditions are satisfied (called “ergodicity”) • e.g. every state can be reached in some number of steps from every other state

Markov chain Monte Carlo • States of chain are variables of interest • Transition matrix chosen to give target distribution as stationary distribution x x x x x x x x Transition matrix T = P(x(t+1)|x(t))

Metropolis-Hastings algorithm • Transitions have two parts: • proposal distribution: Q(x(t+1)|x(t)) • acceptance: take proposals with probability A(x(t),x(t+1)) = min( 1, ) P(x(t+1)) Q(x(t)|x(t+1)) P(x(t)) Q(x(t+1)|x(t))

Metropolis-Hastings algorithm p(x)

Part IV: Inference algorithms