460 likes | 788 Vues
Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Update by B.-H. Kim Summarized by M.H. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/. Content. 2.3 The Gaussian Distribution
Ch 2. Probability DistributionPattern Recognition and Machine Learning, C. M. Bishop, 2006. Update by B.-H. Kim Summarized by M.H. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/
Content • 2.3 The Gaussian Distribution • 2.3.6 Bayesian inference for the Gaussian • 2.3.7 Student's t-distribution • 2.3.8 Periodic variables • 2.3.9 Mixtures of Gaussians • 2.4 The Exponential Family • 2.4.1 Maximum likelihood and sufficient statistics • 2.4.2 Conjugate priors • 2.4.3 Noninformative priors • 2.5 Nonparametric Methods • 2.5.1 Kernel density estimators • 2.5.2 Nearest-neighbour methods (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.3.6 Bayesian inference for the Gaussian • In maximum likelihood framework • Pointestimation for the parameters and • Bayesian treatment for parameter estimation • Introduce prior distributions over parameters • Three cases • [C1] When the covariance is known, inferring the mean • [C2] When the mean is known, inferring the covariance • [C3] Both mean and covariance is unknown • Conjugate prior • Makes the form of the distribution is consistent • Means effective fictitious data points (exponential family) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
[C1] When the covariance is known, inferring the mean • Single Gaussian random variable x • given a set of N observation X={x1,…,xN} • The likelihood function • the form of the exponential of a quadratic form in μ • Thus if we choose a prior p(μ) given by a Gaussian, it will be a conjugate distribution for this likelihood function • The conjugate prior distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
[C1] When the covariance is known, inferring the mean • The posterior distribution • Posterior mean and variance • Posterior mean is a compromise btw and • Precision (inverse variance) is additive (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
[C1] When the covariance is known, inferring the mean • Illustration of this Bayesian inference • Data points generation : N(x|0.8, 0.12) • Prior N(μ |0,0.12) • Sequential estimation of mean • Bayesian paradigm leas very naturally to a sequential view of the inference problem Likelihood function associated with xN The posterior distribution after observing N-1 data points, or a prior distribution before the N th data is observed (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Plot of the Gam(λ|a,b) [C2] When the mean is known, inferring the covariance • Likelihood function for the precision λ • The conjugate prior distribution – gamma distribution • Let the prior to be , then the posterior becomes (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Meaning of the conjugate prior for exponential family • The effect of observing N data points • Increase the value of the coefficient a0 by N/2 • 2a0: ‘effective’ prior observations • Increase the value of the coefficient b0 by Nσ2ML/2 • b0 arises from the 2a0 effective prior observations having variance b0/a0 • A conjugate prior is interpreted as effective fictitious data points • General property of exponential family of distributions (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
[C3] Both mean and covariance is unknown • Likelihood function • Prior distribution a gamma distribution a Gaussian whose precision is a linear function of λ (Normal-gamma or Gaussian-gamma dist) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
In the case of the multivariate Gaussian • Three cases • [C1] When the covariance is known, inferring the mean • [C2] When the mean is known, inferring the covariance • [C3] Both mean and covariance is unknown • The form of conjugate prior distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.3.7 Student's t-distribution • If we have a univariate Gaussian together with a Gamma prior and we integrate out the precision, we obtain the marginal distribution of x • Student’s t-distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
ML solutions of Student’s t-distribution (red) and Gaussian (green) Properties of Student's t-distribution • Adding up an infinite number of Gaussian distributions having the same mean but different precisions • Infinite mixture of Gaussians • Longer ‘tails’ than a Gaussian => robustness : much less sensitive than the Gaussian to the outliers St∼Gaussian outlier (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
More on Student's t-distribution • For regression problems • The least squares approach does not exhibit robustness, because it corresponds to ML under a (conditional) Gaussian dist. • We obtain a more robust model based on a heavy-tailed distribution such as a t-distribution • Multivariate t-distribution where D is the dimensionality of x (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
angular coord. The maximum likelihood estimator 2.3.8 Periodic variables • Problem Setting • We want to evaluate the mean of a set of observations of a periodic variable: • To find an invariant measure of the mean, observations are considered as points on the unit circle Cartesian coord. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Gaussian-like distribution that satisfies these properties 2D Gaussian, conditioned by the unit circle Von Mises distribution Von Mises distribution (circular normal) • Setting for periodic generalization of the Gaussian • Conditions of the distribution p(θ) that have period 2π (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ zeroth-order Bessel function of the first kind
Von Mises distribution (circular normal) • Plot of the von Mises distribution Cartesian plot polar plot (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
ML estimators for the parameters (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Some alternative techniques for the construction of periodic distribution • The simplest approach is to use a histogram of observations in which the angular coordinate is divided into fixed bins => simple and flexible, but significantly limited (Section 2.5) • Approach 2 : like the von Mises distribution from a Gaussian distribution over a Euclidean space but now marginalizes onto the unit circle rather than conditioning. • Approach 3 : ‘wrapping’ the real axis around unit circle • Mapping successive intervals of width 2π onto (0, 2π) • Mixtures of von Mises distributions (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.3.9 Mixtures of Gaussians • Mixture of Gaussians • From the sum and product rules, the marginal density is given by Example of a Gaussian mixture distribution (3 Gaussians) Example data set which requires mixture distributions : mixing coefficients : responsibility, plays an important role (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.3.9 Mixtures of Gaussians • The maximum likelihood solution for the parameters • No closed-form analytical solution • Need iterative numerical optimization techniques, or • Expectation maximization (Chapter 9) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.4 The Exponential Family • The exponential family of distributions over x, given parameters , is defined to be the set of distributions of the form : scalar or vector : natural parameters : some function of x (sufficient statistic) : inverse of the normalizer (alternative form) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Exponential family Bernoulli distribution as an exponential family • Bernoulli distribution logistic sigmoid function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Exponential family Multinomial distribution as an exponential family • Multinomial distribution Removing a constraint and using M-1 parameters : softmax function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Gaussian distribution as an exponential family • Gaussian distribution (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.4.1 Maximum likelihood and sufficient statistics • Maximum likelihood estimation of the parameter vector in the general exponential family distribution • The covariance of u(x) ~ the second derivatives of • The higher order moments ~ the nth derivatives of Taking the gradient of both sides The property of the partition function (alternative form) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Sufficientstatistics • With i.i.d. (independent identically distributed) data • The likelihood function • Setting the gradient of log likelihood with respect to to zero • The solution depends on the data only through • : sufficient statistic of the distribution • We do not need to store the entire data set itself but only the s.s. • Ex) the Bernoulli distribution : s.s. is the data points {xn} • Ex) Gaussian : s.s. are sum of {xn} and the sum of {x2n} Sufficiency in Bayesian? (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
= 2.4.2 Conjugate priors • The exponential family • Conjugate prior • Posterior distribution • This takes the same functional form as the prior, • confirming conjugacy : a normalization coefficient : effective number of pseudo-observations (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.4.3 Noninformative priors • Role of a prior • If prior knowledge can be conveniently expressed through the prior distribution… very good~ • When we have little idea, we use noninformative prior • Noninformative prior is intended to have as little influence on the posterior distribution as possible. • ‘letting the data speak for themselves’ • Two difficulties in the case of continuous parameters • If the domain of is unbounded => cannot be normalized : improper • The transformation behaviour of a prob. density under a nonlinear change of variables (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Two examples of noninformative priors • Family of densities with translation invariance • Ex) mean of a Gaussian distribution • Family of densities with scale invariance • Ex) stdev of a Gaussian distribution shifting x by a constant (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2.5 Nonparametric Methods If you want more details, See Ch 4 of Duda & Hart • Parametric approach to density estimation • Use of p.d.f. with specific functional forms (unimodal!) • Governed by a small number of parameters • Parameters are determined from a data set • Limitation: chosen density might be a poor model => poor prediction • Nonparametric approach • Make few assumptions about the form of the distribution • The form of the dist. typically depends on the size of the data set • Still contain parameters, but these control the model complexity rather than the form of the distribution • Nonparametric Bayesian methods are attracting interest (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
N: # observations K: # points within some region V: the volume of the region Three nonparametric methods • Histogram methods • Kernel density estimators • Nearest-neighbour methods • Common points • Concept of locality • Smoothing parameter Data: 50 points from mixture of 2 Gaussians (green) Δi: width of ith bin V is fixed, K is fixed (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
More on kernel density estimators • Set a kernel function (or Parzen window) for each data point • Property of kernel function on local region u around a data point • Uniform kernel function • Gaussian kernel function (smoother one) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Pros and Cons • Histogram methods • Once computed, the data set can be discarded • Easily applied to the sequential data processing • Setting bin edges produces (artificial) discontinuous density • Weak scaling with dimensionality • Kernel density estimators • No computation form ‘training’ • Requires the storage of the entire training set => computational cost of evaluating the density grows linearly with the |data| • Fixed ‘h’ : the optimal choice may be dependent on location • K nearest-neighbour method • The model is not a true density model (integral diverges) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Classification with K-NN • We apply the K-nearest-neighbour density estimation to each class separately • And then make use of Bayes’ theorem • We wish to minimize the prob. of misclassification • => assign the test point x to the class having the largest posterior probability ~ (Kk/K) (density of each class) (unconditional density) (class priors) (the posterior prob. of class membership) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Illustrations of the K-NN classification (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Good density models • Nonparametric methods • Flexible, but require the entire training data set to be stored • Parametric methods • Very restricted in terms of the forms of distributions • What we want is • Density models that are flexible yet • The complexity of the models can be controlled independently of the size of the training set • We shall see in subsequent chapters how to achieve this (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/