Classification

Classification Yan Pan

Under and Over Fitting

Non-negativity and unit measure • 0 ≤ p(y) , p() = 1, p() = 0 • Conditional probability – p(y|x) • p(x, y) = p(y|x) p(x) = p(x|y) p(y) • Bayes’ Theorem • p(y|x) = p(x|y) p(y) / p(x) • Marginalization • p(x) = yp(x, y) dy • Independence • p(x1, x2) = p(x1) p(x2)  p(x1|x2) = p(x1) • Chris Bishop, “Pattern Recognition & Machine Learning” Probability Theory

p(x|,) = exp( -(x – )2/22) / (22)½ The Univariate Gaussian Density -3 -2 -1  1 2 3

p(x|,) = exp( -½(x – )t-1 (x – ) )/ (2)D/2||½ The Multivariate Gaussian Density

p(|a,b) = a-1(1 – )b-1(a+b) / (a)(b) The Beta Density

Bernoulli: Single trial with probability of success = • n {0, 1},  [0, 1] • p(n|) = n(1 – )1-n • Binomial: N iid Bernoulli trials with n successes • n {0, 1, …, N},  [0, 1], • p(n|N,) = NCnn(1 – )N-n Probability Distribution Functions

We don’t know whether a coin is fair or not. We are told that heads occurred n times in N coin flips. • We are asked to predict whether the next coin flip will result in a head or a tail. • Let y be a binary random variable such that y = 1 represents the event that the next coin flip will be a head and y = 0 that it will be a tail • We should predict heads if p(y=1|n,N) > p(y=0|n,N) A Toy Example

Let p(y=1|n,N) =  and p(y=0|n,N) = 1 -  so that we should predict heads if  > ½ • How should we estimate ? • Assuming that the observed coin flips followed a Binomial distribution, we could choose the value of  that maximizes the likelihood of observing the data • ML = argmaxp(n|) = argmaxNCnn(1 – )N-n • = argmaxn log() + (N – n) log(1 – ) • = n / N • We should predict heads if n > ½ N The Maximum Likelihood Approach

We should choose the value of  maximizing the posterior probability of  conditioned on the data • We assume a • Binomial likelihood : p(n|) = NCnn(1 – )N-n • Beta prior : p(|a,b)=a-1(1–)b-1(a+b)/(a)(b) • MAP = argmaxp(|n,a,b) = argmaxp(n|) p(|a,b) • = argmaxn (1 – )N-na-1 (1–)b-1 • = (n+a-1) / (N+a+b-2) as if we saw an extra a – 1 heads & b – 1 tails • We should predict heads if n > ½ (N + b – a) The Maximum A Posteriori Approach

We should marginalize over  • p(y=1|n,a,b) = p(y=1|n,) p(|a,b,n) d • = p(|a,b,n) d • = (|a + n, b + N –n) d • = (n + a) / (N + a + b) as if we saw an extra a heads & b tails • We should predict heads if n > ½ (N + b – a) • The Bayesian and MAP prediction coincide in this case • In the very large data limit, both the Bayesian and MAP prediction coincide with the ML prediction (n > ½ N) The Bayesian Approach

Classification

Binary Classification

Memorization • Can not deal with previously unseen data • Large scale annotated data acquisition cost might be very high • Rule based expert system • Dependent on the competence of the expert. • Complex problems lead to a proliferation of rules, exceptions, exceptions to exceptions, etc. • Rules might not transfer to similar problems • Learning from training data and prior knowledge • Focuses on generalization to novel data Approaches to Classification

Training Data • Set of N labeled examples of the form (xi, yi) • Feature vector – xD. X = [x1x2 … xN] • Label – y {1}. y = [y1, y2 … yN]t. Y=diag(y) • Example – Gender Identification Notation (x1 = , y1 = +1) (x2 = , y2 = +1) (x3 = , y3 = +1) (x4 = , y4 = -1)

Binary Classification

Binary Classification b w wtx + b = 0  = [w; b]

Machine Learning from the Optimization View • Before we go into the details of classification and regression methods, we should take a close look at the objective functions of machine learning • Machine Learning：根据数据找规律（从多个候选规律里面选最好的），选择的标准是什么？ • 把候选规律放到训练数据上预测一下，看看预测的错误率是多少，预测错误最少的规律就是我们要找的。

Supervised Learning

Common Form of Supervised Learning Problems • Minimize the following objective function • Regularization term + Loss function • Regularization term: control the model complexity, avoid over fitting • Loss function: measure the quality of the learned function, i.e. predict error on the training data.

Ex.1 Linear Regression • E(w)= ½Sn(yn- wtxn)^2 + ½wtw

Ex.2 Logistic Regression (classification method) • (w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi)))

Ex.3 SVM • E(w)= ½wtw+ I max(0,1-yiwtxi) • Or • E(w)= ½wtw+ I max(0,1-yiwtxi)^2

How to measure error? • True: yi • Predicted: wtxi • 越像越好。相等？ • I (yi！= wtxi ） • （ yiwtxi ）^2 • 假设取值范围为[-1,1]: 乘积尽量大 • yi wtxi

Approximate the Zero-One Loss • Squared Error • Exponential Loss • Logistic Loss • Hinge Loss • Sigmoid Loss

Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01 Regularized Logistic Regression

Convex f : f(x1 + (1- )x2)  f(x1) + (1- )f(x2) • The Hessian 2f is always positive semi-definite • The tangent is always a lower bound to f Convex Functions

Iteration : xn+1 = xn - nf(xn) • Step size selection : Armijo rule • Stopping criterion : Change in f is “miniscule” Gradient Descent

(w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi))) • w(w, b) =w –Ip(-yi|xi,w) yi xi • b(w, b) = –Ip(-yi|xi,w) yi • Beware of numerical issues while coding! Gradient Descent – Logistic Regression

Gradient Decent Algorithm • Input: x0, objective f(x), e, T • Output: x_star that minimize f(x) • t=0 • While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<100000 )){ • g_t = gradient of f(x) at x_t • for( i=10; i>=-6; i--) • { • s=2^i • x_{t+1}=x_t – s*g_t • if (f(x_{t+1} < f(x_t)) • break; • } • t++; • } • Output x_t

Iteration : xn+1 = xn - nH-1f(xn) • Approximate f by a 2nd order Taylor expansion • The error can now decrease quadratically Newton Methods

Newton Decent Algorithm • Input: x0, objective f(x), e, T • Output: x_star that minimize f(x) • t=0 • While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<10)){ • g_t = gradient of f(x) at x_t • h_t = hessian matrix of f(x) at x_t • s = inverse matrix of h_t • x_{t+1}=x_t – s*g_t • t++; • } • Output x_t

Computing and inverting the Hessian is expensive • Quasi-Newton methods can approximate H-1 directly (LBFGS) • Iteration : xn+1 = xn - nBn-1f(xn) • Secant equation : f(xn+1) – f(xn) = Bn+1(xn+1 – xn) • The secant equation does not fully determine B • LBFGS updates Bn+1-1 using two rank one matrices Quasi-Newton Methods

Machine Learning Problems from the Probability View

Bayes’ decision rule • p(y=+1|x) > p(y=-1|x) ? y = +1 : y = -1 •  p(y=+1|x) > ½ ? y = +1 : y = -1 Bayes’ Decision Rule

fMAP = argmaxfp(f|X,Y) • = argmaxfp(X,Y|f) p(f) / p(X,Y) • = argmaxfp(X,Y|f) p(f) • fML  argmaxfp(X,Y|f) (Maximum Likelihood) • Maximum Likelihood holds if • There is a lot of training data so that • p(X,Y|f) >> p(f) • Or if there is no prior knowledge so that p(f) is uniform (improper) MAP & Maximum Likelihood (ML)

fML = argmaxfp(X,Y|f) • = argmaxfIp(xi,yi|f) • The independent and identically distributed assumption holds only if we know everything about the joint distribution of the features and labels. • In particular, p(X,Y) Ip(xi,yi) IID Data

Discriminative Methods Logistic Regression

Classification

Classification

Presentation Transcript

Classification

Classification

Classification

Classification

Classification Dewey Decimal Classification

Classification Techniques: Bayesian Classification

CLASSIFICATION

Classification

Classification

Classification

Classification

Classification

Classification

Classification

Classification

Classification

Classification