An Introduction to Support Vector Machine (SVM)

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung

Outline • Background • Linear Separable SVM • Lagrange Multiplier Method • Karush-Kuhn-Tucker (KKT) Conditions • Non-linear SVM: Kernel • Non-Separable SVM • libsvm

Background – Classification Problem • The goal of classification is to organize and categorize data into distinct classes • A model is first created based on the previous data (training samples) • This model is then used to classify new data (unseen samples) • A sample is characterized by a set of features • Classification is essentially finding the best boundary between classes

Background – Classification Problem • Applications: • Personal Identification • Credit Rating • Medical Diagnosis • Text Categorization • Denial of Service Detection • Character recognition • Biometrics • Image classification

Classification Formulation • Given • an input space • a set of classes ={ } • the Classification Problem is • to define a mapping f: g where each xin  is assigned to one class • This mapping function is called a Decision Function

Decision Function • The basic problem in classification problem is to find c decision functions with the property that, if a pattern x belongs to class i, then is some similarity measure between x and class i, such as distance or probability concept

Decision Function • Example d1=d3 Class 1 d2,d3<d1 Class 3 d1,d2<d3 d1=d2 d3=d2 Class 2 d1,d3<d2

Single Classifier • Most popular single classifiers: • Minimum Distance Classifier • Bayes Classifier • K-Nearest Neighbor • Decision Tree • Neural Network • Support Vector Machine

Minimum Distance Classifier • Simplest approach to selection of decision boundaries • Each class is represented by a prototype (or mean) vector: where = the number of pattern vectors from • A new unlabelled sample is assigned to a class whose prototype is closest to the sample

Bayes Classifier • Bayes rule • is the same for each class, therefore • Assign x to class j if • for all i

Bayes Classifier • The following information must be known: • The probability density functions of the patterns in each class • The probability of occurrence of each class • Training samples may be used to obtain estimations on these probability functions • Samples assumed to follow a known distribution pattern

10 8 6 4 2 0 2 4 6 8 10 K-Nearest Neighbor • K-Nearest Neighbor Rule (k-NNR) • Examine the labels of the k-nearest samples and classify by using a majority voting scheme (7, 3) 1NN 3NN 5NN 7NN 9NN

Decision Tree • The decision boundaries are hyper-planes parallel to the feature-axis • A sequential classification procedure may be developed by considering successive partitions of R

Decision Trees • Example

Connection Node Neural Network • A Neural Network generally maps a set of inputs to a set of outputs • Number of inputs/outputs vary • The network itself is composed of an arbitrary number of nodes with an arbitrary topology • It is an universal approximator

Neural Network • A popular NN is the feed forward neural network • E.g. • Multi-layer Perceptron (MLP) • Radial-Based Function (RBF) • Learning algorithm: back propagation • Weights of nodes are adjusted based on how well the current weights match an objective

Support Vector Machine • Basically a 2-class classifier developed by Vapnik and Chervonenkis (1992) • Which line is optimal?

ρ r Separating plane Margin Class 1 Class 2 Support Vector (Class 1) Support Vector (Class 2) Support Vector Machine • Training vectors : xi , i=1….n • Consider a simple case with two classes : • Define a vector y yi = 1 if xi in class 1 = -1 if xi in class 2 A hyperplane which separates all data

Linear Separable SVM • Label the training data • Suppose we have some hyperplanes which separates the “+” from “-” examples (a separating hyperplane) • x which lie on the hyperplane, satisfy • w is noraml to hyperplane, |b|/||w|| is the perpendicular distance from hyperplane to origin

Linear Separable SVM • Define two support hyperplane as H1:wTx = b +δ and H2:wTx = b –δ • To solve over-parameterized problem, set δ=1 • Define the distance between OSH and two support hyperplanes as • Margin = distance between H1 and H2 = 2/||w||

The Primal problem of SVM • Goal: Find a separating hyperplane with largest margin. A SVM is to find w and b that satisfy (1) minimize ||w||/2 = wTw/2 (2) yi(xi·w+b)-1 ≥ 0 • Switch the above problem to a Lagrangian formulation for two reason (1) easier to handle by transforming into quadratic eq. (2) training data only appear in form of dot products between vectors => can be generalized to nonlinear case

Langrange Muliplier Method • a method to find the extremum of a multivariate function f(x1,x2,…xn) subject to the constraint g(x1,x2,…xn) = 0 • For an extremum of f to exist on g, the gradient of f must line up with the gradient of g . • for all k = 1, ...,n , where the constant λis called the Lagrange multiplier • The Lagrangian transformation of the problem is

Langrange Muliplier Method • To have , we need to find the gradient of L with respect to w and b. (1) (2) • Substitute them into Lagrangian form, we have a dual problem Inner product form => Can be generalize to nonlinear case by applying kernel

KKT Conditions • Since the problems for SVM is convex, the KKT conditions are necessary and sufficient for w, b and αto be a solution. • w is determinded by training procedure. • b is easily found by using KKT complementary conditions, by choosing any i for which αi≠ 0 Complementary slackness

Non-Linear Separable SVM : Kernal • To extend to non-linear case, we need to the data to some other Euclidean space.

Kernal • Φ is a mapping function. • Since the training algorithm only depend on data thru dot products. We can use a “kernal function” K such that • One commonly used example is radial based function (RBF) • A RBF is a real-valued function whose value depends only on the distance from the origin, so that Φ(x)= Φ(||x||) ; or alternatively on the distance from some other point c, called a center, so that Φ(x,c)= Φ(||x-c||).

Non-separable SVM • Real world application usually have no OSH. We need to add an error term ζ. => • To give penalty to error term, define • New Lagrangian form is

Non-separable SVM • New KKT Conditions

An Introduction to Support Vector Machine (SVM)