400 likes | 614 Vues
Mathematical Programming in Support Vector Machines. Olvi L. Mangasarian University of Wisconsin - Madison. High Performance Computation for Engineering Systems Seminar MIT October 4, 2000. What is a Support Vector Machine?. An optimally defined surface
E N D
Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering Systems Seminar MIT October 4, 2000
What is a Support Vector Machine? • An optimally defined surface • Typically nonlinear in the input space • Linear in a higher dimensional space • Implicitly defined by a kernel function
What are Support Vector Machines Used For? • Classification • Regression & Data Fitting • Supervised & Unsupervised Learning (Will concentrate on classification)
Outline of Talk • Generalized support vector machines (SVMs) • Completely general kernel allows complex classification (No Mercer condition!) • Smooth support vector machines • Smooth & solve SVM by a fast Newton method • Lagrangian support vector machines • Very fast simple iterative scheme- • One matrix inversion: No LP. No QP. • Reduced support vector machines • Handle large datasets with nonlinear kernels
Generalized Support Vector Machines2-Category Linearly Separable Case A+ A-
Given m points in n dimensional space • Represented by an m-by-n matrix A • Membership of each in class +1 or –1 specified by: • An m-by-m diagonal matrix D with +1 & -1 entries • Separate by two bounding planes, • More succinctly: where e is a vector of ones. Generalized Support Vector MachinesAlgebra of 2-Category Linearly Separable Case
Generalized Support Vector MachinesMaximizing the Margin between Bounding Planes A+ A-
Solve the following mathematical program for some : • The nonnegative slack variable is zero iff: • Convex hulls of and do not intersect • is sufficiently large Generalized Support Vector MachinesThe Linear Support Vector Machine Formulation
Breast Cancer Diagnosis Application97% Tenfold Cross Validation Correctness780 Samples:494 Benign, 286 Malignant
Another Application: Disputed Federalist PapersBosch & Smith 199856 Hamilton, 50 Madison, 12 Disputed
Linear SVM: Linear separating surface: • Set . Resulting linear surface: • Replace by arbitrary nonlinear kernel • Resulting nonlinear surface: Generalized Support Vector Machine Motivation(Nonlinear Kernel WithoutMercer Condition)
SSVM: Smooth Support Vector Machine(SVM as Unconstrained Minimization Problem) Changing to 2-norm and measuring margin in( ) space:
Integrating the sigmoid approximation to the step function: gives a smooth, excellent approximation to the plus function: • Replacing the plus function in the nonsmooth SVM by the smooth approximation gives our SSVM: SSVM: The Smooth Support Vector MachineSmoothing the Plus Function
Newton: Minimize a sequence of quadratic approximations to the strongly convex objective function, i.e. solve a sequence of linear equations in n+1 variables. (Small dimensional input space.) Armijo: Shorten distance between successive iterates so as to generate sufficient decrease in objective function. (In computational reality, not needed!) Global Quadratic Convergence: Starting from any point, the iterates guaranteed to converge to the unique solution at a quadratic rate, i.e. errors get squared. (Typically, 6 to 8 iterations without an Armijo.)
SSVM with a Nonlinear KernelNonlinear Separating Surface in Input Space
Polynomial Kernel • Gaussian (Radial Basis) Kernel • Neural Network Kernel Examples of KernelsGenerate Nonlinear Separating Surfaces in Input Space
Taking the dual of the SVM formulation: , gives the following simple dual problem: The variables of SSVM are related to by: LSVM: Lagrangian Support Vector MachineDual of SVM
Defining the two matrices: Reduces the dual SVM to: The optimality condition for this dual SVM is the LCP: which, by Implicit Lagrangian Theory, is equivalent to: LSVM: Lagrangian Support Vector MachineDual SVM as Symmetric Linear Complementarity Problem
Where: LSVM AlgorithmSimple & Linearly Convergent – One Small Matrix Inversion Key Idea: Sherman-Morrison-Woodbury formula allows the inversion inversion of an extremely large m-by-m matrix Q by merely inverting a much smaller n-by-n matrix as follows:
LSVM Algorithm – Linear Kernel11 Lines of MATLAB Code function [it, opt, w, gamma] = svml(A,D,nu,itmax,tol)% lsvm with SMW for min 1/2*u'*Q*u-e'*u s.t. u=>0,% Q=I/nu+H*H', H=D[A -e]% Input: A, D, nu, itmax, tol; Output: it, opt, w, gamma% [it, opt, w, gamma] = svml(A,D,nu,itmax,tol); [m,n]=size(A);alpha=1.9/nu;e=ones(m,1);H=D*[A -e];it=0; S=H*inv((speye(n+1)/nu+H'*H)); u=nu*(1-S*(H'*e));oldu=u+1; while it<itmax & norm(oldu-u)>tol z=(1+pl(((u/nu+H*(H'*u))-alpha*u)-1)); oldu=u; u=nu*(z-S*(H'*z)); it=it+1; end; opt=norm(u-oldu);w=A'*D*u;gamma=-e'*D*u;function pl = pl(x); pl = (abs(x)+x)/2;
SVM classified in 178 seconds & 4497 iterations LSVMAlgorithm – Linear KernelComputational Results • 2 Million random points in 10 dimensional space • Classified in 6.7 minutes in 6 iterations & e-5 accuracy • 250 MHz UltraSPARC II with 2 gigabyte memory • CPLEX ran out of memory • 32562 points in 123-dimensional space (UCI Adult Dataset) • Classified in141 seconds & 55 iterations to 85% correctness • 400 MHz Pentium II with 2 gigabyte memory
For the nonlinear kernel: the separating nonlinear surface is given by: Where u is the solution of the dual problem: with Q redefined as: LSVM– Nonlinear KernelFormulation
LSVM Algorithm – Nonlinear Kernel Application 100 Iterations, 58 Seconds on Pentium II, 95.9% Accuracy
Key idea: Use a rectangular kernel. where is a small random sample of has 1% to 10% of the rows of Typically Two important consequences: only Nonlinear separator depends on Separating surface: gives lousy results Reduced Support Vector Machines (RSVM)Large Nonlinear Kernel Classification Problems • RSVM can solve very large problems
Conventional SVM Result on Checkerboard Using 50 Random Points Out of 1000
RSVM Result on Checkerboard Using SAME 50 Random Points Out of 1000
RSVM on Large Classification ProblemsStandard Error over 50 Runs = 0.001 to 0.002RSVM Time = 1.24 * (Random Points Time)
Conclusion • Mathematical Programming plays an essential role in SVMs • Theory • New formulations • Generalized SVMs • New algorithm-generating concepts • Smoothing (SSVM) • Implicit Lagrangian (LSVM) • Algorithms • Fast : SSVM • Massive: LSVM, RSVM
Chunking for massive classification: Future Research • Theory • Concave minimization • Concurrent feature & data selection • Multiple-instance problems • SVMs as complementarity problems • Kernel methods in nonlinear programming • Algorithms • Multicategory classification algorithms
Talk & Papers Available on Web www.cs.wisc.edu/~olvi