Statistical Learning Theory and Classification based on Support Vector Machines

Presentation by Michael Sullivan Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor Andrew Moore of Carnegie Mellon University Statistical Learning Theory and Classification based on Support Vector Machines

Observations of a system are collected Based on these observations a process of induction is used to build up a model of the system This model is used to deduce responses of the system not yet observed Empirical Data Modeling

Data obtained through observation is finite and sampled by nature Typically this sampling is non-uniform Due to the high dimensional nature of some problems the data will form only a sparse distribution in the input space Creating a model from this type of data is an ill posed problem Empirical Data Modeling

Empirical Data Modeling Globally Optimal Model Selected model Best Reachable Model The goal in modeling is to choose a model from the hypothesis space, which is closest (with respect to some error measure) to the underlying function in the target space.

Approximation Error is a consequence of the hypothesis space not exactly fitting target space, • The underlying function may lie outside the hypothesis space • A poor choice of the model space will result in a large approximation error (model mismatch) • Estimation Error is the error due to the learning procedure converging to a non-optimal model in the hypothesis space • Together these form the Generalization Error Modeling Error

Empirical Data Modeling Generalization Error Approximation Error Estimation Error Globally Optimal Model Best Reachable Model Selected model The goal in modeling is to choose a model from the hypothesis space, which is closest (with respect to some error measure) to the underlying function in the target space.

Definition: “Consider the learning problem as a problem of finding a desired dependence using a limited number of observations.” (Vapnick 17) What is statistical Learning?

Training: The supervisor takes each generated x value and returns an output value y. • Each (x,y) pair is part of the training set: • F(x,y) = F(x),F(y|x) = (x1, y1), ….., (xl, yl) Model of Supervised Learning

Model of Supervised Learning Goal: For each (x,y) pair, we want to choose the LM’s estimation function : that closest estimates according the supervisor’s response, y. Once we have the estimation function, we can classify new and unseen data.

To find the best function, we need to measure loss. is the discrepancy function based on the y’s generated by the supervisor and the ‘s generated by the estimation functions. Risk Minimization

Risk Minimization To do this, we calculate the risk functional: We choose the function, f(x, α) that minimizes the risk functional R(α) over the class functions f(x, α), αϵΛ Remember, F(x,y) is unknown except for the information contained in the training set.

Risk minimization with Pattern Recognition With pattern recognition, the supervisor’s output y can only take on 2 values, y = {0, 1} and the loss takes the following values. So the risk functional determines the probability of different answers being given by the supervisor and the estimation function.

The expected value of loss with regards to some estimation function : where Problem: We still don’t don’t know Risk Minimization

To Simplify these terms… From this point on, we’ll refer to the training set, {(x1, y1), (x2, y2),…,(xl, yl) }, as {z1, z2, …, zl} And we’ll refer to the loss functional, , as

Instead of measuring risk over the set of all just measure it over just the training set giving the empirical risk functional of The empirical risk must converge uniformly to the actual risk over the set of loss functions in both directions Empirical Risk minimization (ERM)

i. What are the (necessary and sufficient) conditions for consistency of a learning process based on the ERM principle? ii. How fast is the rate of convergence of the learning process? iii. How can one control the rate of convergence (the generalization ability) of the learning process? iv. How can one construct algorithms that can control the generalization ability? So what does learning theory need to address?

The VC dimension is a scalar value that measures the capacity of a set of functions. The VC dimension of a set of functions is responsible for the generalization ability of learning machines. The VC dimension of a set of indicator functions αϵΛ is the maximum number h of vectors z1, …zh that can be separated into two classes in all possible ways using functions of the set. VC dimension (Vapnik–Chervonenkis)

VC Dimension 3 vectors can be shattered, but not 4 since vectors z2, z4 cannot be separated by a line from vectors z1, z3 Rule: The set of linear indicator functions in n dimensional space has a VC dimension h = n + 1

It can be shown that , where is the confidence interval and his the VC dimension ERM only minimizes and , the confidence interval, is fixed based on the VC dimension of the set of functions determined a priori When implementing ERM one must tune the confidence interval based on the problem to avoid underfitting/overfitting the data Upper Bound for risk

SRM attempts to minimize the right hand side of the inequality over both terms simultaneously The first term is dependent upon a specific function’s error and the second depends on the VC dimension of the space that function is in Therefore VC dimension must be a controlling variable Structural risk minimiziation (SRM)

We define our hypothesis space Sto be the set of functions We say that is the hypothesis space of VC dimension, k,such that: For a set of observations SRM chooses the function minimizing the empirical risk in subset for which the guaranteed risk is minimal Structural Risk Minimization (SRM)

SRM defines a trade-off between the quality of the approximation of the given data and the complexity of the approximating function As VC dimension increases the minima of the empirical risks decrease but the confidence interval increases SRM is more general than ERM because it uses the subset for which minimizing yields the best bound on Structural risk minimization (SRM)

Uses the SRM principal to separate two classes by a linear indicator function which is induced from available examples in the training set. The goal is to produce a classifier that will work well on unseen test examples. We want to the classifier with the maximum generalizing capacity i.e. the lowest risk. Support vector classification

Simplest case: Linear classifiers How would you classify this data?

Simplest case: Linear classifiers All of these lines work as linear classifiers Which one is the best?

Simplest case: Linear classifiers Define the margin of a linear classifier as the width the boundary can be increased by before hitting a datapoint.

Simplest case: Linear classifiers We want the maximum margin linear classifier. This is the simplest SVM called a linear SVM Support vectors are the datapoints the margin pushes up against

Simplest case: Linear classifiers +1 zone Minus Plane -1 zone Plus Plane We can define these two planes by x, the y-intercept, b, and w, a vector perpendicular to the lines they lie on so that the dot product gives the perpendicular planes

But how can we find M in terms of w and b when the planes are defined as: Positive plane = (w * x) + b = 1 Negative plane = (w * x) +b = -1 Note: Linear classifier plane: (w * x) + b = 0 The optimal separating hyperplane

The optimal separating hyperplane The margin is defined the distance from any point on the minus plane to the closest point on the plus plane (w * x) + b ≥ 1 (w * x) + b ≤ 1

The optimal Separating hyperplane Why?

The optimal Separating hyperplane

The optimal Separating hyperplane =

The optimal Separating hyperplane = = So

The optimal Separating hyperplane

The optimal Separating hyperplane So we want to maximize Or minimize

Possible to extend to non-separable training sets by adding a error parameter and minimizing: Data can be split into more than two classifications by using successive runs on the resulting classes Generalized optimal hyperplane

Quadratic Programming Optimization algorithms used to maximize a quadratic function of some real-valued variables subject to linear constraints. If we were working in the linear world, we’d want to minimize Now, we want to maximize: In the nonnegative quadrant Under the constraint

Support vector machines (SVM) Maps the input vectors x into a high-dimensional feature space using a kernel function In this feature space the optimal separating hyperplane is constructed

How do sv machines handle data in different circumstances? Basic one dimensional example?

How do sv machines handle data in different circumstances? Easy!

How do sv machines handle data in different circumstances? Harder one dimensional example?

How do sv machines handle data in different circumstances? Project the lower dimensional training points into higher dimensional space

How are SV Machines implemented? • Polynomial Learning Machines • Radial Basis Functions Machines • Two Layer Neural Networks Each of these methods and all SV Machine implementation techniques use a different kernel function. SV Machines

Two-layer Neural network Approach The kernel is a sigmoid function: Implementing the rules: Using this technique the following are found automatically: Architecture of the two layer machine, determining the number N of units in the first layer (the number of support vectors) ii. The vectors of the weights in the first layer iii. The vector of weights for the second layer (values of )

Two-layer neural network approach

Handwritten digit recognition Data used from the U.S. Postal Service Database (1990) Purpose was to experiment on learning the recognition of handwritten digits using different SV machines -7300 training patterns -2000 test patterns collected from real-life zip codes 16X16 pixel resolution of database  256 dimensional input space

Handwritten digit recognition

Statistical Learning Theory and Classification based on Support Vector Machines

Statistical Learning Theory and Classification based on Support Vector Machines

Presentation Transcript

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support vector machines for classification

Support Vector Machines

Support Vector Machines

Classification / Regression Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Online Knowledge-Based Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Classification Part V: Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines