Bayesian Learning

Bayesian Learning Machine Learning Chapter 6 발표자 : 김 석 준

Bayesian Reasoning • Basic assumption • The quantities of interest are governed by probability distribution • These probability + observed data ==> reasoning ==> optimal decision • 의의, 중요성 • 직접적으로 확률을 다루는 알고리듬의 근간 • 예) naïve Bayes classifier • 확률을 다루지 않는 알고리듬을 분석하기 위한 틀 • 예) cross entropy , Inductive bias decision tree, MDL principle

Feature & Limitation • Feature of Bayesian Learning • 관측된 데이터들은 추정된 확률을 점진적으로 증감 • Prior Knowledge : P(h) , P(D|h) • Probabilistic Prediction에 응용 • multiple hypothesis의 결합에 의한 prediction • 문제점 • initial knowledge 요구 • significant computational cost

Bayes Theorem • Terms • P(h) : prior probability of h • P(D) : prior probability that D will be observed • P(D|h) : prior knowledge • P(h|D) : posterior probability of h , given D • Theorem • machine learning : 주어진 데이터 들로부터 the most probable hypothesis를 찾는 과정

MAP hypothesis MAP(Maximum a posteriori) hypothesis

ML hypothesis • maximum likelihood (ML) hypothesis • basic assumption : equally probable a priori • basic formular • P(a^b) = P(A|B)P(B) = P(B|A)P(A)

Bayes Theorem and Concept Learning • Brute-force MAP learning • for each calculate P(h|D) • find hMAP • consistent assumption • noise free data D • target concept c in hypothesis space H • every hypothesis is equally probable • Result • every consistent hypothesis is MAP hypothesis (if h is consistent with D) P(h|D) = 0 (otherwise)

Consistent learner • 정의 : training example들에 대해 에러가 없는 hypothesis를 출력해 주는 알고리듬 • result : • every consistent hypothesis output == MAP hypothesis • every consistent learner output == MAP hypothesis • if uniform prior probability distribution over H • if deterministic, noise-free training data

ML and LSE hypothesis • Least squared error hypothesis • NN , curve fitting, linear regression • continuous-valued target function • task : find f : di=f(xi)+ei • preliminary : • probability densities, Normal distribution • target value independence • result : • limitation : noise only in the target value

ML hypothesis for predicting Probability • Task : find g : g(x) = P(f(x)=1) • question : what criterion should we optimize in order to find a ML hypothesis for g • result : cross entropy • entropy function :

(BP) Gradient search to ML in NN Let G(h,D) = cross entropy By gradient ascent

MDL principle • 목적 : Bayesian method에 의한 inductive bias 와 MLD principle 해석 • Shannon and weaver’s optimal code length

Bayes optimal classifier • Motivation : 새로운 instance의 classification은 모든 hypothesis에 의한 prediction의 결합으로 인하여 최적화 되어진다. • task : Find the most probable classification of the new instance given the training data • answer :combining the prediction of all hypotheses • Bayes optimal classification • limitation : significant computational cost ==> Gibbs algorithm

Bayes optimal classifier example

Gibbs algorithm • Algorithm • 1. Choose h from H, according to the posterior probability distribution over H • 2. Use h to predict the classification of x • Gibbs algorithm의 유용성 • Haussler , 1994 • Error(Gibbs algorithm)< 2*Error(Bayes optimal classifier)

Naïve Bayes classifier • Naïve Bayes classifier • difference • no explicit search through H • by counting the frequency of existing examples • m-estimate of probability = • m : equivalent sample size , p : prior estimate of probability

Bayes Belief Networks • 정의 • describe the joint probability distribution for a set of variables • 모든 변수들이 conditional independence일것을 요구하지 않음 • 변수들간의 부분적 의존 관계를 확률로 표현 • representation

Bayesian Belief Networks

Inference • Task : infer the probability distribution for the target variables • methods • exact inference : NP hard • approximate inference • theoretically NP hard • practically useful • Monte Carlo methods

Learning • Env • structure known + fully observable data • easy , by naïve Bayes classifier • structure known + partially observable data • gradient ascent procedure ( by Russel , 1995 ) • ML hypothesis 와 유사 P(D|h) • structure unknown

Learning(2) • Structure unknown • Bayesian scoring metric ( cooper, Herskovits, 1992 ) • K2 algorithm • cooper, Herskovits, 1992 • heuristic greedy search • fully observed data • constraint-based approach • Spirtes, 1993 • infer dependency and independency relationship • construct structure using this relationship

EM algorithm • EM : estimation, maximization • env • learning in the presence of unobserved variables • the form of probability distribution is known • application • training Bayesian belief networks • training radial basis function networks • basis for many unsupervised clustering algorithm • basis for Baum-Welch’s forward-backward algorithm

K-means algorithm • Env : k normal distribution들로부터 임의로 data 생성 • task : find mean values of each distribution • instance : < xi,z11,z12> • if z is known : using • else use EM algorithm

K-means algorithm • Initialize • calculate E[z] • calculate a new ML hypothesis ==> converge to a local ML hypothesis

General statement of EM algo • Terms •  : underlying probability distribution • x : observed data from each distribution • z : unobserved data • Y = X union Z • h : current hypothesis of  • h’ : revised hypothesis • task : estimate  from X

guideline • Search h’ • if h =  : calculate function Q

EM algorithm • Estimation step • maximization step • converge to a local maxima

Bayesian Learning