1 / 52

Support vector machine

Support vector machine . LING 572 Fei Xia Week 7: 2/19-2/21/08. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A. Why another learning method?. It is based on some “beautifully simple” ideas. It produces good performance on many applications.

weston
Télécharger la présentation

Support vector machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support vector machine LING 572 Fei Xia Week 7: 2/19-2/21/08 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A

  2. Why another learning method? • It is based on some “beautifully simple” ideas. • It produces good performance on many applications. • It has some nice properties w.r.t. training and test error rates.

  3. Kernel methods • Rich family of “pattern analysis” algorithms • Best known element is the Support Vector Machine (SVM) • Very general task: given a set of data, find patterns • Examples: • Classification • Regression • Correlation • Clustering • ….

  4. History of SVM • Linear classifier: 1962 • Use a hyperplane to separate examples • Choose the hyperplane that maximizes the minimal margin • Kernel trick: 1992

  5. History of SVM (cont) • Soft margin: 1995 • To deal with non-separable data or noise • Transductive SVM: 1999 • To take advantage of unlabeled data

  6. Main ideas • Use a hyperplane to separate the examples. • Among all the hyperplanes, choose the one with the maximum margin. • Maximizing the margin is the same as minimizing ||w|| subject to some constraints.

  7. Main ideas (cont) • For the data set that are not linear separable, map the data to a higher dimensional space and separate them there by a hyperplane. • The Kernel trick allows the mapping to be “done” efficiently. • Soft margin deals with noise and/or inseparable data set.

  8. Outline • Linear SVM • Maximizing the margin • Soft margin • Nonlinear SVM • Kernel trick • Using SVM

  9. Papers • (Scholkopf, 2000) • Sections 3, 4, and 7 • (Hearst et al., 1998)

  10. Linear SVM

  11. The setting • Input: x 2 X • Output: y 2 Y, Y = {-1, +1} • Training set: S = {(x1, y1), …, (xi, yi)} • Goal: Find a function y = f(x) that fits the data: f: X ! R

  12. Basic notation • Inner product between two vectors • Hyperplane: <w, x> + b = 0 ||w|| is the Euclidean norm of w. <w, x> + b > 0 <w, x> + b < 0

  13. An example of hyperplane: <w,x> + b = 0 x2 (0,1) x1 (2,0) x1 + 2 x2 – 2 = 0 w=(1,2), b=-2 10x1 + 20 x2 – 20 = 0 w=(10,20), b=-20

  14. Finding a hyperplane • Given the training instances, we want to find hyperplanes that separate them. • If there are more than one hyperplane, SVM chooses the one with the maximum margin.

  15. Maximizing the margin + + + + + + Training: to find w and b. <w,x>+b=0

  16. Support vectors + + + + + + <w,x>+b=1 <w,x>+b=0 <w,x>+b=-1

  17. Minimizing subject to the constraint + + + + + + This is a quadratic programming (QP) problem.

  18. Lagrangian

  19. The dual problem • Maximize • Subject to

  20. Finding the solution • This is a Quadratic Programming (QP) problem. • The function is convex and there is no local minima. • Solvable in polynomial time.

  21. The solution Training: Set the weight ®i for each xi Decoding:

  22. Decoding Hyperplane: x1 + 2 x2 – 2 = 0 x=(3,1) x=(0,0)

  23. Soft margin

  24. The highlight • Problem: Some data set is not separable or there are mislabeled examples. • Idea: split the data as cleanly as possible, while maximizing the distance to the nearest cleanly split examples. • Mathematically, introduce the slack variables

  25. Objective function • Minimizing such that where C is the penalty, k = 1 or 2

  26. The dual problem • Maximize • Subject to

  27. Non-linear SVM

  28. The highlight • Problem: Some data are not linear separable. • Intuition: to transform the data to a high dimension space Input space Feature space

  29. Example: the two spirals Separated by a hyperplane in feature space (Gaussian kernels)

  30. Learning a non-linear SVM: • Define Á • Calculate Á(x) for each training example • Find a linear SVM in the feature space. • Problem: • Feature space can be high dimensional or even have infinite dimensions. • Calculating Á(x) is very inefficient and even impossible.

  31. Kernels • Kernels are functions that return inner products between the images of data points. • Kernels can often be computed efficiently even for very high dimensional spaces. • Choosing K is equivalent to choosing Á.

  32. The kernel trick • No need to know what Á is and what the feature space is. • No need to map the data to the feature space. • Define a kernel function k, and replace the dot produce <x,z> with a kernel function k(x,z) in both training and testing.

  33. An example

  34. Another example

  35. Common kernel functions • Linear • Polynominal: • Gaussian Radial basis function (RBF): • Sigmoid:

  36. Linear kernel • The map Á is linear. • The kernel adjusts the weight of the features according to their importance.

  37. Other kernels • Kernels for • trees • sequences • sets • graphs • general structures • …

  38. Making kernels • The set of kernels is closed under some operations. For instance, if K1 and K2 are kernels, so are the following: • K1+K2 • cK1 and cK2 for c > 0 • cK1 +dK2 for c> 0 and d>0 • One can make complicated kernels from simples ones

  39. The choice of kernel function • Given a function, we can test whether it is a kernel function by using Mercer’s theme. • Different kernel functions lead to different classifiers: e.g., polynomial classifiers, RBF classifiers. • The experiment results with different kernels could vary a lot.

  40. Decoding Linear SVM: Non-linear SVM: w could be infinite dimensional

  41. Using SVM

  42. Steps • Define features in the input space • Scale the data before training/test • Choose a kernel function • Tune parameters using cross-validation

  43. Scaling the data • To avoid features with larger variance to dominate those with smaller variance. • Scale each feature to the range [-1,+1] or [0,1]. • [0,1] is faster than [-1,1]

  44. Summary • Find the hyperplane that maximizes the margin. • Introduce soft margin to deal with noisy data • Implicitly map the data to a higher dimensional space. • The kernel trick allows easy computation of the dot product in the feature space.

  45. More info • Website: www.kernel-machines.org • Textbook (2000): www.support-vector.net • Tutorials: http://www.svms.org/tutorials/ • Workshops at NIPS

  46. Hw8

  47. LibSVM Package • Website: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ • On Patas: /NLP_TOOLS/svm/libsvm/latest

  48. Hw8 • Q1: Learn to use the package • Q2: Write a SVM decoder • Q3: Incorporate the decoder into the beam search code

  49. Additional slides

More Related