Download
bayesian regularization of learning n.
Skip this Video
Loading SlideShow in 5 Seconds..
Bayesian regularization of learning PowerPoint Presentation
Download Presentation
Bayesian regularization of learning

Bayesian regularization of learning

114 Vues Download Presentation
Télécharger la présentation

Bayesian regularization of learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC

  2. Induction F.Bacon Machine Deduction R.Descartes Math. modeling Learning Scientific methods Models Data

  3. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  4. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  5. Problem statement • Learning isinverse, ill-posed problem • Model Data • Learning paradoxes • Infinite predictions Finite data? • How to optimize future predictions? • How to select regular from casual in data? • Regularization of learning • Optimal model complexity

  6. Well-posed problem • Solution is unique • Solution is stable • Hadamard (1900-s) • Tikhonoff (1960-s)

  7. Learning from examples • Problem: • Find hypothesish, generating observed dataDin modelH • Well defined if not sensitive to: • noise in data (Hadamard) • learning procedure (Tikhonoff)

  8. Learning is ill-posed problem • Example: Function approximation • Sensitive tonoise in data • Sensitive tolearning procedure

  9. Learning is ill-posed problem • Solution is non-unique

  10. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  11. Problem regularization • Main idea: restrict solutions – sacrifice precision to stability How to choose?

  12. + + … + Statistical Learning practice • DataLearning set+ Validation set • Cross-validation: • Systematic approach to ensembles Bayes

  13. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  14. Statistical Learning theory • Learning as inverse Probability • Probability theory.H:hD • Learning theory.H:hD Bernoulli (1713) H Bayes (~ 1750)

  15. Bayesian learning Prior Posterior Evidence

  16. Coin tossing game H

  17. Monte Carlo simulations

  18. Bayesian regularization • Most Probable hypothesis  Learning error Regularization Example: Function approximation

  19. 0 1 11 10 110 111 Minimal Description Length Rissanen (1978) • Most Probable hypothesis hypothesis Code length for: Data Example: Optimal prefix code

  20. Data Complexity • ComplexityK(D |H) =min L(h,D|H) Kolmogoroff (1965) Code lengthL(h,D) = codeddata L(D|h)+ decodingprogram L(h) Decoding DataD

  21. Complex = Unpredictable Solomonoff (1978) • Prediction error ~ L(h,D)/L(D) • Random data is uncompressible • Compression = predictability Example: block coding Programh:lengthL(h,D) Decoding DataD

  22. L(h,D) UniversalPrior H D • All 2L programs with lengthL are equiprobable • Data complexity Solomonoff (1960) Bayes (~1750)

  23. Statistical ensemble • Shorter description length • Proof: • Corollary: Ensemble predictions are superior to most probable prediction

  24. Ensemble prediction

  25. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  26. Model comparison Posterior Evidence

  27. Statistics: Bayes vs. Fisher • Fisher: maxLikelihood • Bayes: maxEvidence

  28. Historical outlook • 20 – 60s of ХХ century • Parametric statistics • AsymptoticN • 60 - 80s of ХХ century • Non-Parametric statistics • Regularization of ill-posed problems • Non-asymptotic learning • Algorithmic complexity • Statistical physics of disordered systems Fisher (1912) Chentsoff (1962) Tikhonoff (1963) Vapnik (1968) Kolmogoroff (1965) Gardner (1988)

  29. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  30. Statistical physics • Probability of hypothesis - microstate • Optimal model - macrostate

  31. Free energy • F = - log Z: • Log ofSum  • F = E – TS: • Sum of logs • P = P{L}

  32. EM algorithm.Main idea • Introduce independent P: • Iterations • E-step: • М-step:

  33. EM algorithm • Е-step • Estimate Posterior for given Model • М-step • Update Model for given Posterior

  34. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  35. x y h P(x|H) y h(x) x Bayesian regularization: Examples • Hypothesis testing • Function approximation • Data clustering

  36. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  37. y h0 Hypothesis testing • Problem • Noisy observations:y • Is theoretical value h0true? • ModelH: Gaussian noise Gaussian prior

  38. Optimal model: Phase transition • Confidence •  finite •  infinite

  39. P(h) h y P(h) y Threshold effect • Student coefficient • Hypothesis h0 is true • Corrections to h0

  40. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  41. y h(x) x Function approximation • Problem • Noisy data:y(x) • Find approximation h(x) • Model: Noise Prior

  42. Optimal model • Free energy minimization

  43. Saddle point approximation • Function of best hypothesis

  44. ЕМ learning • Е-step. Optimal hypothesis • М-step. Optimal regularization

  45. LaplacePrior • Pruned weights • Equisensitive weights

  46. Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

  47. x P(x|H) Clustering • Problem • Noisy data:x • Find prototypes (mixture density approximation) • How many clusters? • Модель: Noise

  48. Optimal model • Free energy minimization • Iterations • E-step: • М-step:

  49. ЕМ algorithm • Е-step: • М-step:

  50. h(m) 1/ How many clusters? • Number of clusters M() • Optimal number of clusters