Download Presentation
## Classification

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Classification**Yan Pan**Non-negativity and unit measure**• 0 ≤ p(y) , p() = 1, p() = 0 • Conditional probability – p(y|x) • p(x, y) = p(y|x) p(x) = p(x|y) p(y) • Bayes’ Theorem • p(y|x) = p(x|y) p(y) / p(x) • Marginalization • p(x) = yp(x, y) dy • Independence • p(x1, x2) = p(x1) p(x2) p(x1|x2) = p(x1) • Chris Bishop, “Pattern Recognition & Machine Learning” Probability Theory**p(x|,) = exp( -(x – )2/22) / (22)½**The Univariate Gaussian Density -3 -2 -1 1 2 3**p(x|,) = exp( -½(x – )t-1 (x – ) )/**(2)D/2||½ The Multivariate Gaussian Density**p(|a,b) = a-1(1 – )b-1(a+b) / (a)(b)**The Beta Density**Bernoulli: Single trial with probability of success =**• n {0, 1}, [0, 1] • p(n|) = n(1 – )1-n • Binomial: N iid Bernoulli trials with n successes • n {0, 1, …, N}, [0, 1], • p(n|N,) = NCnn(1 – )N-n Probability Distribution Functions**We don’t know whether a coin is fair or not. We are told**that heads occurred n times in N coin flips. • We are asked to predict whether the next coin flip will result in a head or a tail. • Let y be a binary random variable such that y = 1 represents the event that the next coin flip will be a head and y = 0 that it will be a tail • We should predict heads if p(y=1|n,N) > p(y=0|n,N) A Toy Example**Let p(y=1|n,N) = and p(y=0|n,N) = 1 - so that we**should predict heads if > ½ • How should we estimate ? • Assuming that the observed coin flips followed a Binomial distribution, we could choose the value of that maximizes the likelihood of observing the data • ML = argmaxp(n|) = argmaxNCnn(1 – )N-n • = argmaxn log() + (N – n) log(1 – ) • = n / N • We should predict heads if n > ½ N The Maximum Likelihood Approach**We should choose the value of maximizing the posterior**probability of conditioned on the data • We assume a • Binomial likelihood : p(n|) = NCnn(1 – )N-n • Beta prior : p(|a,b)=a-1(1–)b-1(a+b)/(a)(b) • MAP = argmaxp(|n,a,b) = argmaxp(n|) p(|a,b) • = argmaxn (1 – )N-na-1 (1–)b-1 • = (n+a-1) / (N+a+b-2) as if we saw an extra a – 1 heads & b – 1 tails • We should predict heads if n > ½ (N + b – a) The Maximum A Posteriori Approach**We should marginalize over **• p(y=1|n,a,b) = p(y=1|n,) p(|a,b,n) d • = p(|a,b,n) d • = (|a + n, b + N –n) d • = (n + a) / (N + a + b) as if we saw an extra a heads & b tails • We should predict heads if n > ½ (N + b – a) • The Bayesian and MAP prediction coincide in this case • In the very large data limit, both the Bayesian and MAP prediction coincide with the ML prediction (n > ½ N) The Bayesian Approach**Memorization**• Can not deal with previously unseen data • Large scale annotated data acquisition cost might be very high • Rule based expert system • Dependent on the competence of the expert. • Complex problems lead to a proliferation of rules, exceptions, exceptions to exceptions, etc. • Rules might not transfer to similar problems • Learning from training data and prior knowledge • Focuses on generalization to novel data Approaches to Classification**Training Data**• Set of N labeled examples of the form (xi, yi) • Feature vector – xD. X = [x1x2 … xN] • Label – y {1}. y = [y1, y2 … yN]t. Y=diag(y) • Example – Gender Identification Notation (x1 = , y1 = +1) (x2 = , y2 = +1) (x3 = , y3 = +1) (x4 = , y4 = -1)**Binary Classification**b w wtx + b = 0 = [w; b]**Machine Learning from the Optimization View**• Before we go into the details of classification and regression methods, we should take a close look at the objective functions of machine learning • Machine Learning：根据数据找规律（从多个候选规律里面选最好的），选择的标准是什么？ • 把候选规律放到训练数据上预测一下，看看预测的错误率是多少，预测错误最少的规律就是我们要找的。**Common Form of Supervised Learning Problems**• Minimize the following objective function • Regularization term + Loss function • Regularization term: control the model complexity, avoid over fitting • Loss function: measure the quality of the learned function, i.e. predict error on the training data.**Ex.1 Linear Regression**• E(w)= ½Sn(yn- wtxn)^2 + ½wtw**Ex.2 Logistic Regression (classification method)**• (w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi)))**Ex.3 SVM**• E(w)= ½wtw+ I max(0,1-yiwtxi) • Or • E(w)= ½wtw+ I max(0,1-yiwtxi)^2**How to measure error?**• True: yi • Predicted: wtxi • 越像越好。相等？ • I (yi！= wtxi ） • （ yi- wtxi ）^2 • 假设取值范围为[-1,1]: 乘积尽量大 • yi wtxi**Approximate the Zero-One Loss**• Squared Error • Exponential Loss • Logistic Loss • Hinge Loss • Sigmoid Loss**Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS**01 Regularized Logistic Regression**Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS**01 Regularized Logistic Regression**Convex f : f(x1 + (1- )x2) f(x1) + (1-**)f(x2) • The Hessian 2f is always positive semi-definite • The tangent is always a lower bound to f Convex Functions**Iteration : xn+1 = xn - nf(xn)**• Step size selection : Armijo rule • Stopping criterion : Change in f is “miniscule” Gradient Descent**(w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi)))**• w(w, b) =w –Ip(-yi|xi,w) yi xi • b(w, b) = –Ip(-yi|xi,w) yi • Beware of numerical issues while coding! Gradient Descent – Logistic Regression**Gradient Decent Algorithm**• Input: x0, objective f(x), e, T • Output: x_star that minimize f(x) • t=0 • While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<100000 )){ • g_t = gradient of f(x) at x_t • for( i=10; i>=-6; i--) • { • s=2^i • x_{t+1}=x_t – s*g_t • if (f(x_{t+1} < f(x_t)) • break; • } • t++; • } • Output x_t**Iteration : xn+1 = xn - nH-1f(xn)**• Approximate f by a 2nd order Taylor expansion • The error can now decrease quadratically Newton Methods**Newton Decent Algorithm**• Input: x0, objective f(x), e, T • Output: x_star that minimize f(x) • t=0 • While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<10)){ • g_t = gradient of f(x) at x_t • h_t = hessian matrix of f(x) at x_t • s = inverse matrix of h_t • x_{t+1}=x_t – s*g_t • t++; • } • Output x_t**Computing and inverting the Hessian is expensive**• Quasi-Newton methods can approximate H-1 directly (LBFGS) • Iteration : xn+1 = xn - nBn-1f(xn) • Secant equation : f(xn+1) – f(xn) = Bn+1(xn+1 – xn) • The secant equation does not fully determine B • LBFGS updates Bn+1-1 using two rank one matrices Quasi-Newton Methods**Bayes’ decision rule**• p(y=+1|x) > p(y=-1|x) ? y = +1 : y = -1 • p(y=+1|x) > ½ ? y = +1 : y = -1 Bayes’ Decision Rule**p(y|x,X,Y) = fp(y,f|x,X,Y) df**• = fp(y|f,x,X,Y) p(f|x,X,Y) df • = fp(y|f,x) p(f|X,Y) df • This integral is often intractable. • To solve it we can • Choose the distributions so that the solution is analytic (conjugate priors) • Approximate the true distribution of p(f|X,Y) by a simpler distribution (variational methods) • Sample from p(f|X,Y) (MCMC) Bayesian Approach**p(y|x,X,Y) = fp(y|f,x) p(f|X,Y) df**• = p(y|fMAP,x) when p(f|X,Y) = (f – fMAP) • The more training data there is the better p(f|X,Y) approximates a delta function • We can make predictions using a single function, fMAP, and our focus shifts to estimating fMAP. Maximum A Posteriori (MAP)**fMAP = argmaxfp(f|X,Y)**• = argmaxfp(X,Y|f) p(f) / p(X,Y) • = argmaxfp(X,Y|f) p(f) • fML argmaxfp(X,Y|f) (Maximum Likelihood) • Maximum Likelihood holds if • There is a lot of training data so that • p(X,Y|f) >> p(f) • Or if there is no prior knowledge so that p(f) is uniform (improper) MAP & Maximum Likelihood (ML)**fML = argmaxfp(X,Y|f)**• = argmaxfIp(xi,yi|f) • The independent and identically distributed assumption holds only if we know everything about the joint distribution of the features and labels. • In particular, p(X,Y) Ip(xi,yi) IID Data