Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 2. Probability DistributionPattern Recognition and Machine Learning, C. M. Bishop, 2006. Update by B.-H. Kim Summarized by M.H. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

Content • 2.3 The Gaussian Distribution • 2.3.6 Bayesian inference for the Gaussian • 2.3.7 Student's t-distribution • 2.3.8 Periodic variables • 2.3.9 Mixtures of Gaussians • 2.4 The Exponential Family • 2.4.1 Maximum likelihood and sufficient statistics • 2.4.2 Conjugate priors • 2.4.3 Noninformative priors • 2.5 Nonparametric Methods • 2.5.1 Kernel density estimators • 2.5.2 Nearest-neighbour methods (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.3.6 Bayesian inference for the Gaussian • In maximum likelihood framework • Pointestimation for the parameters and • Bayesian treatment for parameter estimation • Introduce prior distributions over parameters • Three cases • [C1] When the covariance is known, inferring the mean • [C2] When the mean is known, inferring the covariance • [C3] Both mean and covariance is unknown • Conjugate prior • Makes the form of the distribution is consistent • Means effective fictitious data points (exponential family) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

[C1] When the covariance is known, inferring the mean • Single Gaussian random variable x • given a set of N observation X={x1,…,xN} • The likelihood function • the form of the exponential of a quadratic form in μ • Thus if we choose a prior p(μ) given by a Gaussian, it will be a conjugate distribution for this likelihood function • The conjugate prior distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

[C1] When the covariance is known, inferring the mean • The posterior distribution • Posterior mean and variance • Posterior mean is a compromise btw and • Precision (inverse variance) is additive (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

[C1] When the covariance is known, inferring the mean • Illustration of this Bayesian inference • Data points generation : N(x|0.8, 0.12) • Prior N(μ |0,0.12) • Sequential estimation of mean • Bayesian paradigm leas very naturally to a sequential view of the inference problem Likelihood function associated with xN The posterior distribution after observing N-1 data points, or a prior distribution before the N th data is observed (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Plot of the Gam(λ|a,b) [C2] When the mean is known, inferring the covariance • Likelihood function for the precision λ • The conjugate prior distribution – gamma distribution • Let the prior to be , then the posterior becomes (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Meaning of the conjugate prior for exponential family • The effect of observing N data points • Increase the value of the coefficient a0 by N/2 • 2a0: ‘effective’ prior observations • Increase the value of the coefficient b0 by Nσ2ML/2 • b0 arises from the 2a0 effective prior observations having variance b0/a0 • A conjugate prior is interpreted as effective fictitious data points • General property of exponential family of distributions (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

[C3] Both mean and covariance is unknown • Likelihood function • Prior distribution a gamma distribution a Gaussian whose precision is a linear function of λ (Normal-gamma or Gaussian-gamma dist) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

In the case of the multivariate Gaussian • Three cases • [C1] When the covariance is known, inferring the mean • [C2] When the mean is known, inferring the covariance • [C3] Both mean and covariance is unknown • The form of conjugate prior distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.3.7 Student's t-distribution • If we have a univariate Gaussian together with a Gamma prior and we integrate out the precision, we obtain the marginal distribution of x • Student’s t-distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

ML solutions of Student’s t-distribution (red) and Gaussian (green) Properties of Student's t-distribution • Adding up an infinite number of Gaussian distributions having the same mean but different precisions • Infinite mixture of Gaussians • Longer ‘tails’ than a Gaussian => robustness : much less sensitive than the Gaussian to the outliers St∼Gaussian outlier (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

More on Student's t-distribution • For regression problems • The least squares approach does not exhibit robustness, because it corresponds to ML under a (conditional) Gaussian dist. • We obtain a more robust model based on a heavy-tailed distribution such as a t-distribution • Multivariate t-distribution where D is the dimensionality of x (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

angular coord. The maximum likelihood estimator 2.3.8 Periodic variables • Problem Setting • We want to evaluate the mean of a set of observations of a periodic variable: • To find an invariant measure of the mean, observations are considered as points on the unit circle Cartesian coord. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Gaussian-like distribution that satisfies these properties 2D Gaussian, conditioned by the unit circle Von Mises distribution Von Mises distribution (circular normal) • Setting for periodic generalization of the Gaussian • Conditions of the distribution p(θ) that have period 2π (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ zeroth-order Bessel function of the first kind

Some alternative techniques for the construction of periodic distribution • The simplest approach is to use a histogram of observations in which the angular coordinate is divided into fixed bins => simple and flexible, but significantly limited (Section 2.5) • Approach 2 : like the von Mises distribution from a Gaussian distribution over a Euclidean space but now marginalizes onto the unit circle rather than conditioning. • Approach 3 : ‘wrapping’ the real axis around unit circle • Mapping successive intervals of width 2π onto (0, 2π) • Mixtures of von Mises distributions (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.3.9 Mixtures of Gaussians • Mixture of Gaussians • From the sum and product rules, the marginal density is given by Example of a Gaussian mixture distribution (3 Gaussians) Example data set which requires mixture distributions : mixing coefficients : responsibility, plays an important role (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.3.9 Mixtures of Gaussians • The maximum likelihood solution for the parameters • No closed-form analytical solution • Need iterative numerical optimization techniques, or • Expectation maximization (Chapter 9) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.4 The Exponential Family • The exponential family of distributions over x, given parameters , is defined to be the set of distributions of the form : scalar or vector : natural parameters : some function of x (sufficient statistic) : inverse of the normalizer (alternative form) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Exponential family Multinomial distribution as an exponential family • Multinomial distribution Removing a constraint and using M-1 parameters : softmax function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.4.1 Maximum likelihood and sufficient statistics • Maximum likelihood estimation of the parameter vector in the general exponential family distribution • The covariance of u(x) ~ the second derivatives of • The higher order moments ~ the nth derivatives of Taking the gradient of both sides The property of the partition function (alternative form) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Sufficientstatistics • With i.i.d. (independent identically distributed) data • The likelihood function • Setting the gradient of log likelihood with respect to to zero • The solution depends on the data only through • : sufficient statistic of the distribution • We do not need to store the entire data set itself but only the s.s. • Ex) the Bernoulli distribution : s.s. is the data points {xn} • Ex) Gaussian : s.s. are sum of {xn} and the sum of {x2n} Sufficiency in Bayesian? (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

= 2.4.2 Conjugate priors • The exponential family • Conjugate prior • Posterior distribution • This takes the same functional form as the prior, • confirming conjugacy : a normalization coefficient : effective number of pseudo-observations (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.4.3 Noninformative priors • Role of a prior • If prior knowledge can be conveniently expressed through the prior distribution… very good~ • When we have little idea, we use noninformative prior • Noninformative prior is intended to have as little influence on the posterior distribution as possible. • ‘letting the data speak for themselves’ • Two difficulties in the case of continuous parameters • If the domain of is unbounded => cannot be normalized : improper • The transformation behaviour of a prob. density under a nonlinear change of variables (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Two examples of noninformative priors • Family of densities with translation invariance • Ex) mean of a Gaussian distribution • Family of densities with scale invariance • Ex) stdev of a Gaussian distribution shifting x by a constant (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2.5 Nonparametric Methods If you want more details, See Ch 4 of Duda & Hart • Parametric approach to density estimation • Use of p.d.f. with specific functional forms (unimodal!) • Governed by a small number of parameters • Parameters are determined from a data set • Limitation: chosen density might be a poor model => poor prediction • Nonparametric approach • Make few assumptions about the form of the distribution • The form of the dist. typically depends on the size of the data set • Still contain parameters, but these control the model complexity rather than the form of the distribution • Nonparametric Bayesian methods are attracting interest (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

N: # observations K: # points within some region V: the volume of the region Three nonparametric methods • Histogram methods • Kernel density estimators • Nearest-neighbour methods • Common points • Concept of locality • Smoothing parameter Data: 50 points from mixture of 2 Gaussians (green) Δi: width of ith bin V is fixed, K is fixed (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

More on kernel density estimators • Set a kernel function (or Parzen window) for each data point • Property of kernel function on local region u around a data point • Uniform kernel function • Gaussian kernel function (smoother one) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Pros and Cons • Histogram methods • Once computed, the data set can be discarded • Easily applied to the sequential data processing • Setting bin edges produces (artificial) discontinuous density • Weak scaling with dimensionality • Kernel density estimators • No computation form ‘training’ • Requires the storage of the entire training set => computational cost of evaluating the density grows linearly with the |data| • Fixed ‘h’ : the optimal choice may be dependent on location • K nearest-neighbour method • The model is not a true density model (integral diverges) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Classification with K-NN • We apply the K-nearest-neighbour density estimation to each class separately • And then make use of Bayes’ theorem • We wish to minimize the prob. of misclassification • => assign the test point x to the class having the largest posterior probability ~ (Kk/K) (density of each class) (unconditional density) (class priors) (the posterior prob. of class membership) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Good density models • Nonparametric methods • Flexible, but require the entire training data set to be stored • Parametric methods • Very restricted in terms of the forms of distributions • What we want is • Density models that are flexible yet • The complexity of the models can be controlled independently of the size of the training set • We shall see in subsequent chapters how to achieve this (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Presentation Transcript

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Pattern Recognition and Machine Learning

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M. Bishop, 2006.