Créer une présentation
Télécharger la présentation

Télécharger la présentation
## CS 570 Artificial Intelligence Chapter 20. Bayesian Learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**CS 570 Artificial IntelligenceChapter 20. Bayesian Learning**Jahwan Kim Dept. of CS, KAIST Jahwan Kim – CS 570 Artificial Intelligence**Contents**• Bayesian Learning • Bayesian inference • MAP and ML • Naïve Bayes method • Bayesian network • Parameter Learning • Examples • Regression and LMS • EM Algorithm • Algorithm • Mixture of Gaussian Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Learning**• Let h1,…,hn be possible hypotheses. • Let d=(d1,…dn)be the observed data vectors. • Often (always) iid assumption is made. • Let X denote the prediction. • In Bayesian Learning, • Compute the probability of each hypothesis given the data. Predict based on that basis. • Predictions are made by using all hypotheses. • Learning in Bayesian setting is reduced to probabilistic inference. Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Learning**• The probability that the prediction is X, when the data d is observed is P(X|d)=åi P(X|d, hi)P(hi|d) =åi P(X|hi)P(hi|d) • Prediction is weighted average over the predictions of individual hypothesis. • Hypotheses are intermediaries between the data and the predictions. • Requires computing P(hi|d) for all i. This is usually intractable. Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Learning BasicsTerms**• P(hi|d) is called posterior (or a posteriori) probability. • Using Bayes’ rule, P(hi|d)/ P(d|hi)P(hi) • P(hi) is called the (hypothesis) prior. • We can embed knowledge by means of prior. • It also controls the complexity of the model. • P(d|hi) is called the likelihood of the data. • Under iid assumption, P(d|hi)=Õj P(dj|hi). • Let hMAP be the hypothesis for which the posterior probability P(hi|d) is maximal. It is called the maximum a posteriori (or MAP) hypothesis. Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Learning BasicsMAP Approximation**• Since calculating the exact probability is often impractical, we use approximation by MAP hypothesis. That is, P(X|d)¼P(X|hMAP). • MAP is often easier than the full Bayesian method, because instead of large summation (integration), an optimization problem can be solved. Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Learning BasicsMDL Principle**• Since P(hi|d)/ P(d|hi)P(hi), instead of maximizing P(hi|d), we may maximize P(d|hi)P(hi). • Equivalently, we may minimize –log P(d|hi)P(hi)=-log P(d|hi)-log P(hi). • We can interpret this as choosing hi to minimize the number of bits that is required to encode the hypothesis hi and the data d under that hypothesis. • The principle of minimizing code length (under some pre-determined coding scheme) is called the minimum description length (or MDL) principle. • MDL is used in wide range of practical machine learning applications. Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Learning BasicsMaximum Likelihood**• Assume furthermore that P(hi)’s are all equal, i.e., assume the uniform prior. • It is a reasonable approach when there is no reason to prefer one hypothesis over another a priori. • In that case, to obtain MAP hypothesis, it suffices to maximize P(d|hi), the likelihood. Such hypothesis is called the maximum likelihood hypothesis hML. • In other words, MAP and uniform prior , ML Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Learning BasicsCandy Example**• Two flavors of candy, cherry and lime. • Each piece of candy is wrapped in the same opaque wrapper. • Sold in verylarge bags, of which there are known to be five kinds: h1: 100% cherry, h2: 75% cherry + 25% lime, h3: 50-50, h4: 25-75, h5: 100% lime • Priors known: P(h1),…,P(h5) are 0.1, 0.2, 0.4, 0.2, 0.1 • Suppose from a bag of candy, we took N pieces of candy and all of them were lime (data dN). What are posterior probabilities P(hi|dN)? Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Learning BasicsCandy Example**• P(h1|dN) / P(dN|h1)P(h1)=0,P(h2|dN) / P(dN|h2)P(h2)= 0.2(.25)N,P(h3|dN) / P(dN|h3)P(h3)=0.4(.5)N,P(h4|dN) / P(dN|h4)P(h4)=0.2(.75)N,P(h5|dN) / P(dN|h5)P(h5)=P(h5)=0.1. • Normalize them by requiring them to sum up to 1. Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Learning BasicsParameter Learning**• Introduce parametric probability model with parameter q. • Then the hypotheses are hq, i.e., hypotheses are parametrized. • In the simplest case, q is a single scalar. In more complex cases, q consists of many components. • Using the data d, predict the parameter q. Jahwan Kim – CS 570 Artificial Intelligence**Parameter Learning ExampleDiscrete Case**• A bag of candy whose lime-cherry proportions are completely unknown. • In this case we have hypotheses parametrized by the probability q of cherry. • P(d|hq)=Õj P(dj|hq)=qcherry(1-q)lime • Two wrappers, green and red, are selected according to some unknown conditional distribution, depending on the flavor. • It has three parameters: q=P(F=cherry), q1=P(W=red|F=cherry), q2=P(W=red|F=lime). P(d|hQ)= qcherry(1-q)lime q1red,cherry(1-q1)green,cherry q2red,lime(1-q2)green,lime Jahwan Kim – CS 570 Artificial Intelligence**Parameter Learning ExampleSingle Variable Gaussian**• Gaussian pdf on a single variable: • Suppose x1,…,xN are observed. Then the log likelihood is • We want to find m and s that will maximize this. Find where gradient is zero. Jahwan Kim – CS 570 Artificial Intelligence**Parameter Learning ExampleSingle Variable Gaussian**• Solving this, we find • This verifies ML agrees with our common sense. Jahwan Kim – CS 570 Artificial Intelligence**Parameter Learning ExampleLinear Regression**• Consider a linear Gaussian model with one continuous parent X and a continuous child Y. • Y has a Gaussian distribution whose mean depends linearly on the value of X • Y has fixed standard deviation s. • The data are (xi, yi). • Let the mean of Y be q1X+q2. • Then P(y|x) / exp(-(y-(q1X+q2))2/2s2)/s. • Maximizing the log likelihood is equivalent to minimizing E=åj (yj-(q1xj+q2))2. • This quantity is the well-known sum of squared errors. Thus in linear regression case, ML ,Least Mean-Square (LMS) Jahwan Kim – CS 570 Artificial Intelligence**Parameter Learning ExampleBeta Distribution**• Candy example revisited. • q is the value of a random variable Qin Bayesian view. • P(Q) is a continuous distribution. • Uniform density is one candidate. • Another possibility is to use beta distributions. • Beta distribution has two hyperparameters a and b, and is given by (a normalizing constant) ba,b(q)=aqa-1(1-q)b-1. • Has mean a/(a+b). • More peaked when a+b is large, suggesting greater certainty about the value of Q. Jahwan Kim – CS 570 Artificial Intelligence**Parameter Learning ExampleBeta Distribution**• Beta distribution has nice property that if Q has a prior ba,b, then the posterior distribution for Q is also a beta distribution. • P(q|d=cherry) / P(d=cherry|q)P(q) /q ba,b(q) /q¢qa-1(1-q)b-1 / qa(1-q)b-1 / ba+1,b • Beta distribution is called the conjugate prior for the family of distributions for a Boolean variable. Jahwan Kim – CS 570 Artificial Intelligence**Naïve Bayes Method**• Attributes (components of observed data) are assumed to be indepdendent in Naïve Bayes Method. • Works well for about 2/3 of real-world problems, despite naivete of such assumption. Goal: Predict the class C, given the observed data Xi=xi. • By the independent assumption, P(C|x1,…xn)/ P(C)Õi P(xi|C) • We choose the most likely class. • Merits of NB • Scales well: No search is required. • Robust against noisy data. • Gives probabilistic predictions. Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Network**• Combine all observations according to their dependency relations. • More formally, a Bayesian Network consists of the following: • A set of variables (nodes) • A set of directed edges between variables • The graph is assumed to be acyclic (i.e., there’s no directed cycle). • To each variable A with parents B1,…,Bn, there is attached the potential table P(A|B1,…,Bn). Jahwan Kim – CS 570 Artificial Intelligence**Bayesian Network**• A compact representation of the joint probability table (distribution) • Without dependency relation, the joint probability is intractable. Examples of Bayesian Network Jahwan Kim – CS 570 Artificial Intelligence**Issues in Bayesian Network**• Learning the structure: No systematic method exists. • Updating the network after observation is also hard: NP-hard in general. • There are algorithms to overcome this computational complexity. • Hidden (latent) variables can simplify the structure substantially. Jahwan Kim – CS 570 Artificial Intelligence**EM Algorithm:Learning with Hidden Variables**• Latent (hidden) variables are not directly observable. • Latent variables are everywhere, in HMM, mixture of Gaussians, Bayesian Networks, … • EM (Expectation-Maximization) Algorithm solves the problem of learning parameters in the presence of latent variables • In a very general way • Also in a very simple way. • EM algorithm is an iterative algorithm: • It iterates over E- and M-steps repeatedly, updating the parameter at each step. Jahwan Kim – CS 570 Artificial Intelligence**EM Algorithm**• An iterative algorithm. • Let qbe the parameters of the model,q(i)be its estimated value at i-th step,Z be the hidden variable. • Expectation (E-Step) Compute expectation w.r.t the hidden variable of completed data log-likelihood function åzP(Z=z|x, q(i)) log P(x,Z=z|q) • Maximization (M-Step) Update q by maximizing this expectation: q(i+1) =arg maxq åzP(Z=z|x, q(i)) log P(x,Z=z|q) • Iterate (1)-(2) until convergence! Jahwan Kim – CS 570 Artificial Intelligence**EM Algorithm**• Resembles gradient-descent algorithm, but no step-size parameter. • EM increases log likelihood at every step. • May have problems in convergence. • Several variants of EM algorithm are suggested to overcome such difficulties. • Putting priors, different initialization, and reasonable initial values all help. Jahwan Kim – CS 570 Artificial Intelligence**EM Algorithm Prototypical ExampleMixture of Gaussians**• A mixture distribution P(X)=åi=1k P(C=i) P(X|C=i) • P(X|C=i) is a distribution for i-th component. • When each P(X|C=i) is (multivariate) Gaussian, this distribution is called a mixture of Gaussians. • Has the following parameters: • Weight wi=P(C=i) • Means mi • Covariances Si • Problem in learning parameters: we don’t know which component generated each data points. Jahwan Kim – CS 570 Artificial Intelligence**EM Algorithm Prototypical ExampleMixture of Gaussians**• Introduce the indicator hidden variables Z=(Zj): From which component xj was generated? • Can derive answer analytically, but it’s complicated. (See for example http://www.lans.ece.utexas.edu/course/ee380l/2002sp/blimes98gentle.pdf) • Skipping the details, the answers are as follows: • Let pij=P(C=i|xj)/ P(xj|C=i)P(C=i), pi=åj pij, wi=P(C=i). • Update miÃåj pijxj/p SiÃåj pijxjxjT/pj wiÃ pi Jahwan Kim – CS 570 Artificial Intelligence**EM Algorithm Prototypical ExampleMixture of Gaussians**• For nice “look-and-feel” demo of EM algorithms on mixture of Gaussians, see http://www.neurosci.aist.go.jp/~akaho/MixtureEM.html Jahwan Kim – CS 570 Artificial Intelligence**EM Algorithm ExampleBayesian Network, HMM**• Omitted. • Covered later in class/student presentation (?) Jahwan Kim – CS 570 Artificial Intelligence