Support Vector Machine

Support Vector Machine Chan-Gu Gang, MK Hasan and Ha-Yong, Jung 2004/11/17

Introduction

Learning Theory • Objective: • Two classes of objects  new object  assign it to one of the two classes • Binary Pattern Recognition (Binary Classification) • xi : pattern, case, input, instance, .. • X : domain (where the values of xi are taken from) • yi : label, target, output • In order to map xi values to yi values, we need the notion of similarity in X and yi • For yi : trivial • For X : ?

Similarity Measure • Given two patterns x and x', returns a real number characterizing their similarity • k is called a kernel

Simple Example of Similarity Measure : Dot Product • Dot Product of Vectors • However, the patterns xi are not yet in the form of vectors. The patterns xi don't exist in dot product space yet. They can be any kind of object. • To use dot product as a similarity measure, Transform patterns into vectors (Dot Product Space H) • Three benefits of the transformation (into vector form) • definesimilarity measure from the dot product in H • deal with the patterns geometrically apply linear algebra and analytic geometry • freedom to choose the mapping phi • enable large variety of similarity measures and learning algorithms • change the representation into one that is moresuitable for the given problem

Simple Pattern Recognition Algorithm • Basic idea: Assign a previously unseen pattern to the class with closer mean. • Means of two classes (+,-) • When c is mid-point of c+ and c- • x will be classified into the class with the closer mean

Decision Function • From above formulas, substitution result

Parzen Windows Estimators • Condition for the resulting Decision Function to be Bayes Classifier • The class means have the same distance to the origin • b=0 • k is a probability density function • which of the p+(x) and p-(x) is larger  the new sample is labeled

Generalization Can be generalized to

To Make the Classification Technique More Sophisticated • Two ways to make it more sophisticated • selection of the patterns on which the kernels are centered • remove the influence of patterns that are very far away from the decision boundary • because we expect that they will notimprove the generalization error of the decision function, or to reduce the computational cost of evaluating the decision function • choice of weights ai that are placed on the individual kernels in the decision function • the weights above are only (1/m+) or (1/m-) above  more variety of weights

Some Insights from Statistical Learning Theory • Some exceptions are allowed ("outliers")  the boundary is ambiguous • Almost linear separation of the classes misclassifies the two outliers , and other "easy" points which are so close to the decision boundary that the classifier really should be able to get them right. • Compromise gets most points right, without putting too much trust in any individual point

More into Statistical Learning Theory • Put the above intuitive arguments in a mathematical framework • Assumption: data are generated independently with probability distribution of P(x,y) iid (independent and identically distributed) • Goal: find a function f that will correctly classify unseen examples (x,y) • Measurement of correctness: • zero-one loss function • C(x,y,f(x)) := 0.5*|f(x)-y| • Without restriction on the set of functions from which we choose our estimated f, might not generalize well.

Training Error & Test Error • Minimizing the training error (empirical risk) does not imply a small test error (risk) • Restrict the set of functions from which f is chosen to one that has a capacity suitable for the amount of available training data

VC Dimension • Each function of the class separates the patterns in a certain way. Labels are {+1|-1} At most 2^m different labeling for m patterns • “Shatter”: when a function class can realize all 2^m separations, it is “Shatter”ing the m points • VC Dimension: largest m such that there exists a set of m points which the class can shatter, and infinity if no such m exists  one number summary of a learning machine’s capacity

Example of VC Bound • If h<m is the VC dimension of the class of functions that the learning machine can implement, • For all functions of that class, independent of underlying probability distribution generating the data, • With a probability of at least 1-delta • To reproduce the random labeling by correctly separating all training examples, this machine will require a large VC Dimension h. phi(h,m,delta) will be large Small training error does not guarantee a small test error • To get nontrivial predictions of bound, • The function class must be restricted • Capacity is small enough • Class should be large enough : to provide functions that can model the dependencies hidden in P(x,y) • The choice of the set of functions is crucial for learning from data

Kernel Machine Classifier

Hyper plane Classifier • We have a set of points • Each point belongs to class +1 or class -1 • The points are linearly separable

A Point Set

Growing Ball

Growing BallSeveral hyper planes exists

Growing BallsBigger balls fewer hyper planes

Growing BallsA single hyper plane is left

Growing BallSupport vectors

Why Maximum Margin • Generalization capability increases with increasing margin • We are skipping the proof of this statement • The problem can be solved using quadratic programming technique which is quite efficient • A single global optimum exists. • Turning point of choosing Support Vector Machine instead of Neural Network as a tool.

How to get hyper plane with maximum margin

How to get optimum margin

Formulation

What if the points are not linearly separable • We can map the points into some higher dimension space using some nonlinear transformation so that the points become linearly separable in higher dimension.

How to avoid computation to map in higher dimension • We always use the dot product of the input vector • Let Φ(x) be the function that map input vector xito the vector in some higher dimension. • If we can compute k(xi,xj) = Φ(xi).Φ(xj) without calculating Φ(xi) and Φ(xj) individually then we can save the time to map the input vectors in some higher dimension at the same time we can use the previous formulation in input space with nonlinear decision boundary. • This k(x,y) is called the kernel function.

Formulation using kernel function

Some Applications

Text Categorization • Why is it needed? • As the volume of electronic information increases, there is growing interest in developing tools to help people better find, filter, and manage these resources. • What is it? • The assignment of natural-languagetexts to one or more predefinedcategories based on their contents • Text Categorization • Representing Text: a bag of words in a document • Feature Selection: because of too big feature dimension • Machine Learning • Extended Applications • Patent classification, Spam-mail filtering, Categorization of Web pages, Automatic essay grading(?), …

Text Categorization with SVM • Conventional Learning Methods • Naïve Bayes Classifier • Rocchio Algorithm • Decision Tree Classifier • k-Nearest Neighbors • Experiments • Test Collections • Reuters-21578 dataset • 9603 training, 3299 test, 90 categories, 9947 distinct terms • Direct correspondence, single category • Ohsumed corpus • 10000 training, 10000 test, 23 Mesh “diseases” categories, 15561 distinct terms • Less direct correspondence, multiple category

Text Categorization with SVM Best Best No overfitting Fail Almost SVMs perform better independent of the choice of parameters. SVM is better than k-NN on 62 of the 90 categories (20 ties), which is a significant improvement according to the binomial sign test SVM outperforms k-NN on all 23 categories

Text Categorization with SVM • Why Should SVMs Work Well for Text Categorization? • High dimensional input space • SVMs use overfitting protection which does not necessarily depend on the number of features • Few irrelevant feature • Even features ranked lowest still contain considerable information and are somewhat relevant a good classier should combine many features • Document vectors are sparse • the mistake bound model that additive algorithms which have a similar inductive bias like SVMs are well suited for problems with dense concepts and sparse instances • Most text categorization problems are linearly separable • The idea of SVMs is to find such linear (or polynomial, RBF, etc) separators

TREC11 • Kernel Methods for Document filtering (MIT) • Ranking • Adaptive T11F/U-assessor 1st • Batch T11F/U-assessor/intersection 1st • Routing assessor/intersection 1st • Feature: Words in documents • Filtering – Digits, words below two times • Title has double weight • Applying various kernel • Second-order perceptron (2) • SVM uneven margin • SVM + new threshold-selection (3) • Conclusion • Good ranking except intersection topics • More Complex Kernel, poorer results • Various performance by each category

Face Detection • We can define the face-detection problem as follows: • Given as input an arbitrary image, which could be a digitized video signal or a scanned photograph, • determine whether there are any human faces in the image, • and if there are, return an encoding of their location. • The encoding in this system is to fit each face in a bounding box defined by the image coordinates of the corners. • It can be extended to many applications • Face-recognition, HCI, surveillance systems, …

Applying SVMs to Face Detection • Overview of overall process • Training on a database of face and nonface patterns (fixed size) using SVM. • Testing candidate image locations for local patterns that appear like faces, using a classification procedure that determines whether a given local image pattern is a face. • the face-detection problem a classification problem : faces or nonfaces.

Applying SVMs to Face Detection • The SVM face-detection system 1. Rescale the input image several times 2. Cut 19x19 window patterns out of the scaled image 4. Classify the pattern using the SVM 5. If the class corresponds to a face, draw a rectangle around the face in the output image. 3. Preprocess the window using masking, light correction and histogram equalization

Applying SVMs to Face Detection • Experimental results on static images • Set A: 313 high-quality, same number of faces • Set B: 23 mixed quality, total of 155 faces

Applying SVMs to Face Detection Face Detection on the PC-based Color Real Time System • Extension to a real-time system An example of the skin detection module implemented using SVMs

Summary • Single layer neural network have simpleand efficient learning algorithm, but have very limited expressive power. • Multilayer networks, on the other hand, are much more expressive but are very hard to be trained. • Kernel machine overcomes this problem. That is it can be trainedvery easily and at the same time, it can represent complex nonlinear function. • Kernel machine is very efficient in hand writing recognition, text categorization, and face recognition.

Support Vector Machine