1 / 47

Drawing samples from high dimensional Gaussians using polynomials

Drawing samples from high dimensional Gaussians using polynomials. Al Parker September 14, 2010. Acknowledgements. Colin Fox, Physics, University of Otago New Zealand Institute of Mathematics, University of Auckland

jett
Télécharger la présentation

Drawing samples from high dimensional Gaussians using polynomials

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Drawing samples from high dimensional Gaussians usingpolynomials Al Parker September 14, 2010

  2. Acknowledgements • Colin Fox, Physics, University of Otago • New Zealand Institute of Mathematics, University of Auckland • Center for Biofilm Engineering, , Bozeman

  3. The normal or Gaussian distribution

  4. How to sample from a Gaussian N(µ,σ2)? • y = (σ2)1/2 z + µ • ~ N(µ,σ2) • Sample z ~ N(0,1)

  5. The multivariate Gaussian distribution

  6. How to sample from a Gaussian N(µ,Σ)? • y = Σ1/2 z+ µ • ~ N(µ,Σ) • Sample z ~ N(0,I) • (eg y = WΛ1/2z + µ)

  7. Example: From 64 faces, modeling “face space” with a Gaussian Process N(μ,Σ) Pixel intensity at the ith row and jth column is y(s(i,j)), y(s) є R112 x R112 μ(s) є R112 x R112 Σ(s,s)є R12544 x R12544

  8. ~N( ,Σ)

  9. How to estimate μ,Σ for N(μ,Σ)? • MLE/BLUE (least squares) • MVQUE • Use a Bayesian Posterior via MCMC

  10. Another example: Interpolation

  11. One can assume a covariance function which has some parameters θ

  12. I used a Bayesian posterior for θ|data to construct μ|data

  13. Simulating the process:samples from N(μ,Σ|data) ~N(μ, ) • y = Σ1/2 z + µ

  14. Gaussian Processes modeling global ozone Cressie and Johannesson, Fixed rank kriggingfor very large spatial datasets, 2006

  15. Gaussian Processes modeling global ozone

  16. The problem • To generate a sample y = Σ1/2 z+ µ ~ N(µ,Σ), how to calculate the factorization Σ =Σ1/2(Σ1/2)T ? • Σ1/2 = WΛ1/2 by eigen-decomposition, 10/3n3 flops • Σ1/2 = C by Cholesky factorization, 1/3n3 flops • For LARGE Gaussians (n>105, eg in image analysis and global data sets), these approaches are not possible • n3 is computationally TOO EXPENSIVE • storing an n x n matrix requires TOO MUCH MEMORY

  17. Some solutions Work with sparse precision matrix Σ-1 models (Rue, 2001) Circulant embeddings (Gneiting et al, 2005) Iterative methods: • Advantages: • COST: n2 flops per iteration • MEMORY: Only vectors of size n x 1 need be stored • Disadvantages: • If the method runs for n iterations, then there is no cost savings over a direct method

  18. Gibbs: an iterative sampler of N(0,A) and N(0,A-1 ) Let A=Σ or A= Σ-1 • Split A into D=diag(A), L=lower(A), LT=upper(A) • Sample z ~ N(0,I) • Take conditional samples in each coordinate direction, so that a full sweep of all n coordinates is yk =-D-1 L yk - D-1 LT yk-1 + D-1/2 z yk converges in distribution geometrically to N(0,A-1) Aykconverges in distribution geometrically to N(0,A)

  19. Gibbs: an iterative sampler Gibbs sampling from N(µ,Σ) starting from (0,0)

  20. Gibbs: an iterative sampler Gibbs sampling from N(µ,Σ) starting from (0,0)

  21. There’s a link to solvingAx=b Solving Ax=b is equivalent to minimizing an n-dimensional quadratic (when A is spd) A Gaussian is sufficiently specified by the same quadratic (with A= Σ-1and b=Aμ):

  22. Gauss-Siedel Linear Solve of Ax=b • Split A into D=diag(A), L=lower (A), LT=upper(A) • Minimize the quadratic f(x) in each coordinate direction, so that a full sweep of all n coordinates is • xk =-D-1 L xk - D-1 LT xk-1 + D-1 b xk converges geometrically A-1b

  23. Gauss-Siedel Linear Solve of Ax=b

  24. Gauss-Siedel Linear Solve of Ax=b xk converges geometrically A-1b, (xk- A-1b) = Gk( x0 - A-1b) where ρ(G) < 1

  25. Theorem: A Gibbs sampler is a Gauss Siedel linear solver Proof: • A Gibbs sampler is yk =-D-1 L yk - D-1 LT yk-1 + D-1/2 z • A Gauss-Siedel linear solve of Ax=b is xk =-D-1 L xk - D-1 LT xk-1 + D-1 b

  26. Gauss Siedel is a Stationary Linear Solver • A Gauss-Siedel linear solve of Ax=b is xk =-D-1 L xk - D-1 LT xk-1 + D-1 b • Gauss Siedel can be written as M xk = N xk-1 + b where M = D + L and N = D - LT , A = M – N, the general form of a stationary linear solver

  27. Stationary linear solvers of Ax=b • Split A=M-N • Iterate Mxk = N xk-1 + b • Split A=M-N • Iterate xk = M-1Nxk-1+ M-1b • = Gxk-1 + M-1b xk converges geometrically A-1b, (xk- A-1b) = Gk( x0 - A-1b) when ρ(G) = ρ(M-1N)< 1

  28. Stationary Samplers from Stationary Solvers • Solving Ax=b: • Split A=M-N • Iterate Mxk = N xk-1 + b • xk A-1b if ρ(M-1N)< 1 • Sampling from N(0,A) and N(0,A-1): • Split A=M-N • Iterate Myk = N yk-1 + ck-1 • where ck-1 ~ N(0, MT + N) • yk N(0,A-1) if ρ(M-1N)< 1 • Ayk N(0,A) if ρ(M-1N)< 1

  29. How to sample ck-1 ~ N(0, MT + N) ? • Gauss Siedel M = D + L, ck-1 ~ N(0, D) • SOR (successive over-relaxation) M = 1/wD + L, ck-1 ~ N(0, (2-w)/w D) • Richardson M = I, ck-1 ~ N(0, 2I-A) • Jacobi M = D, ck-1 ~ N(0, 2D-A)

  30. Theorem: Stat Linear Solver converges iff Stat Sampler convergesandthe geometric convergence is geometric • Proof: They have the same iteration operator: For linear solves: xk = Gxk-1 + M-1 b so that (xk- A-1b) = Gk( x0 - A-1b) For sampling: yk = Gyk-1 + M-1 ck-1 E(yk)=Gk E(y0) Var(yk) = A-1 - Gk A-1 GkT Proof for Gaussians given by Barone and Frigessi, 1990. For arbitrary distributions by Duflo, 1997

  31. Acceleration schemes for stationary linear solvers can be used to accelerate stationary samplers Polynomial acceleration of a stationary solver of Ax=b is 1. Split A = M - N 2.xk+1 = (1- vk) xk-1 + vkxk + vkuk M-1 (b-A xk) which replaces (xk- A-1b) = Gk( x0 - A-1b) with a kth order polynomial (xk- A-1b) = p(G)( x0 - A-1b)

  32. Chebyshev acceleration xk+1 = (1- vk) xk-1 + vkxk + vkuk M-1 (b-A xk) wherevk , ukare functions of the 2 extreme eigenvalues of G (not very expensive to get estimates of these eigenvalues) Gauss-Siedel converged like this …

  33. Chebyshev acceleration • xk+1 = (1- vk) xk-1 + vkxk + vkuk M-1 (b-A xk) wherevk , ukare functions of the 2 extreme eigenvalues of G (not very expensive to get estimates of these eigenvalues) … convergence (geometric-like) with Chebyshev acceleration

  34. Polynomial Accelerated Stationary Sampler from N(0,A) and N(0,A-1) 1. Split A = M - N 2.yk+1 = (1- vk) yk-1 + vkyk + vkuk M-1(ck-A yk) where ck~ N(0, (2-vk)/vk( (2 – uk)/ ukMT+ N)

  35. Theorem A polynomial accelerated sampler converges with the same convergence rate as the corresponding linear solver as long as vk , ukare independent of the iterates yk. Gibbs Sampler Chebyshev Accelerated Gibbs

  36. Chebyshevacceleration is guaranteed to be faster than a Gibbs sampler Covariance matrix convergence ||A-1 – Sk||2

  37. Chebyshev accelerated Gibbs sample • in 106dimensions: • data = SPHERE + ε, Sample from π(SPHERE|data) • ε ~ N(0,σ2I)

  38. Conclusions Gaussian Processes are cool! Common techniques from numerical linear algebra can be used to sample from Gaussians • Cholesky factorization (precise but expensive) • Any stationary linear solver can be used as a stationary sampler (inexpensive but with geometric convergence) • Stationary samplers can be accelerated by polynomials (guaranteed!) • Polynomial accelerated Samplers • Chebyshev • Conjugate Gradients • Lanczos Sampler

  39. Estimation of Σ(θ,r) from the data using a a Markov Chain

  40. Marginal Posteriors

  41. Conjugate Gradient (CG) acceleration • xk+1 = (1- vk) xk-1 + vkxk + vkuk M-1 (b-A xk) wherevk , ukare functions of the residuals b-Axk … convergence guaranteed in n finite steps with CG acceleration

  42. Conjugate Gradient (CG) Acceleration • The theorem does not apply since the parameters vk , ukare functions of the residuals bk- A yk • We have devised an approach called a CD sampler to construct samples with covariance • Var(yk) = VkDk-1VkT A-1 • where Vkis a matrix of • unit length residuals • b - Axkfrom the standard • CG algorithm.

  43. CD sampler (CG accelerated Gibbs) • A GOOD THING: The CG algorithm is a great linear solver! If the eigenvalues of A are in c clusters, then a solution to Ax=b is found in c << n steps. • A PROBLEM: When the CG residuals get small , the CD sampler is forced to stop after only c << n steps. Thus, covariances with well separated eigenvalues work well. • The covariance of the CD samples yk ~ N(0,A-1) and Ayk ~ N(0,A) have the correct covariances if A’s eigenvectors in the Krylov space spanned by the residuals have small/large eigenvalues.

  44. Lanczos sampler • Fix the problem of small residuals is easy: hijack the iterative Lanczoseigen-solver to produce samples yk ~ N(0,A-1) with Var(yk) = WkDk-1WkT A-1 where Wkis a matrix of “Lanczos vectors”

  45. One extremely effective sampler for LARGE Gaussians Use a combination of the ideas presented: • Generate samples with the CD or Lanczos sampler while at the same time cheaply estimating the extreme eigenvalues of G. • Seed these samples and extreme eigenvalues into a Chebyshev accelerated SOR sampler

More Related