Bayesian Regularization in Learning: A Comprehensive Approach to Ill-Posed Problems
This document explores the intricacies of Bayesian regularization as a solution to ill-posed learning problems. It delves into model generalization, hypothesis and model comparison, and the application of the Free Energy and EM algorithm. Emphasizing the importance of model regularization, it discusses how to balance precision and stability in learning. Key concepts include function approximation, data clustering, and the use of statistical learning theory. By understanding these foundational elements, practitioners can enhance the predictability and reliability of their learning models.
Bayesian Regularization in Learning: A Comprehensive Approach to Ill-Posed Problems
E N D
Presentation Transcript
Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC
Induction F.Bacon Machine Deduction R.Descartes Math. modeling Learning Scientific methods Models Data
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Problem statement • Learning isinverse, ill-posed problem • Model Data • Learning paradoxes • Infinite predictions Finite data? • How to optimize future predictions? • How to select regular from casual in data? • Regularization of learning • Optimal model complexity
Well-posed problem • Solution is unique • Solution is stable • Hadamard (1900-s) • Tikhonoff (1960-s)
Learning from examples • Problem: • Find hypothesish, generating observed dataDin modelH • Well defined if not sensitive to: • noise in data (Hadamard) • learning procedure (Tikhonoff)
Learning is ill-posed problem • Example: Function approximation • Sensitive tonoise in data • Sensitive tolearning procedure
Learning is ill-posed problem • Solution is non-unique
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Problem regularization • Main idea: restrict solutions – sacrifice precision to stability How to choose?
+ + … + Statistical Learning practice • DataLearning set+ Validation set • Cross-validation: • Systematic approach to ensembles Bayes
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Statistical Learning theory • Learning as inverse Probability • Probability theory.H:hD • Learning theory.H:hD Bernoulli (1713) H Bayes (~ 1750)
Bayesian learning Prior Posterior Evidence
Bayesian regularization • Most Probable hypothesis Learning error Regularization Example: Function approximation
0 1 11 10 110 111 Minimal Description Length Rissanen (1978) • Most Probable hypothesis hypothesis Code length for: Data Example: Optimal prefix code
Data Complexity • ComplexityK(D |H) =min L(h,D|H) Kolmogoroff (1965) Code lengthL(h,D) = codeddata L(D|h)+ decodingprogram L(h) Decoding DataD
Complex = Unpredictable Solomonoff (1978) • Prediction error ~ L(h,D)/L(D) • Random data is uncompressible • Compression = predictability Example: block coding Programh:lengthL(h,D) Decoding DataD
L(h,D) UniversalPrior H D • All 2L programs with lengthL are equiprobable • Data complexity Solomonoff (1960) Bayes (~1750)
Statistical ensemble • Shorter description length • Proof: • Corollary: Ensemble predictions are superior to most probable prediction
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Model comparison Posterior Evidence
Statistics: Bayes vs. Fisher • Fisher: maxLikelihood • Bayes: maxEvidence
Historical outlook • 20 – 60s of ХХ century • Parametric statistics • AsymptoticN • 60 - 80s of ХХ century • Non-Parametric statistics • Regularization of ill-posed problems • Non-asymptotic learning • Algorithmic complexity • Statistical physics of disordered systems Fisher (1912) Chentsoff (1962) Tikhonoff (1963) Vapnik (1968) Kolmogoroff (1965) Gardner (1988)
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Statistical physics • Probability of hypothesis - microstate • Optimal model - macrostate
Free energy • F = - log Z: • Log ofSum • F = E – TS: • Sum of logs • P = P{L}
EM algorithm.Main idea • Introduce independent P: • Iterations • E-step: • М-step:
EM algorithm • Е-step • Estimate Posterior for given Model • М-step • Update Model for given Posterior
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
x y h P(x|H) y h(x) x Bayesian regularization: Examples • Hypothesis testing • Function approximation • Data clustering
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
y h0 Hypothesis testing • Problem • Noisy observations:y • Is theoretical value h0true? • ModelH: Gaussian noise Gaussian prior
Optimal model: Phase transition • Confidence • finite • infinite
P(h) h y P(h) y Threshold effect • Student coefficient • Hypothesis h0 is true • Corrections to h0
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
y h(x) x Function approximation • Problem • Noisy data:y(x) • Find approximation h(x) • Model: Noise Prior
Optimal model • Free energy minimization
Saddle point approximation • Function of best hypothesis
ЕМ learning • Е-step. Optimal hypothesis • М-step. Optimal regularization
LaplacePrior • Pruned weights • Equisensitive weights
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
x P(x|H) Clustering • Problem • Noisy data:x • Find prototypes (mixture density approximation) • How many clusters? • Модель: Noise
Optimal model • Free energy minimization • Iterations • E-step: • М-step:
ЕМ algorithm • Е-step: • М-step:
h(m) 1/ How many clusters? • Number of clusters M() • Optimal number of clusters