Understanding Kernel Tricks: Features Beyond Dimensions

Kernel-class Jan. 13 2005

Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert space) example:

Recap: Kernel Trick Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : kernel If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features This is the crucial point of kernel methods

Recap: Properties of a Kernel Definition:A finitely positive semi-definite function is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi-definite. Theorem:A function can be written as where is a feature map iff k(x,y) satisfies the semi-definiteness property. Relevance: We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, i.e. without the need to know the feature map!

Reproducing Kernel Hilbert Spaces The proof of the above theorem proceeds by constructing a very special feature map (note that more feature maps may give rise to a kernel) i.e. we map to a function space. definition function space: reproducing property:

Modularity Kernel methods consist of two modules: 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. some kernel algorithms: - support vector machine - Fisher discriminant analysis - kernel regression - kernel PCA - kernel CCA some kernels:

Niceties and Challenges • Niceties: • Kernel algorithms are typically constrained convex optimization • problems  solved with either spectral methods or convex optimization tools. • Efficient algorithms do exist in most cases. • The similarity to linear methods facilitates analysis. There are strong • generalization bounds on test error. • Challenges: • You need to choose the appropriate kernel • Kernel learning is prone to over-fitting • All information must go through the kernel-bottleneck.

Regularization • regularization is very important! • regularization parameters typically determined by out of sample. • measures (cross-validation, leave-one-out). Example: Gaussian Kernel: if c is very small: G=I (all data are dissimilar): over-fitting if c is very large: G=1 (all data are very similar): under-fitting In RKHS view we compute overlap between 2 Gaussians with width “c”. Demo Trevor Hastie.

cone k1 k2 Learning Kernels • All information is tunneled through the Gram-matrix information • bottleneck. • The real art is to pick an appropriate kernel for the data domain. • Warning: Since kernels can overfit, we need to regularize. Solution: We need to learn the kernel. Here is some ways to combine kernels to improve them: any positive polynomial parameters can be set by i) cross-validation, ii) Bayesian methods, iii) test-error bound minimization.

Understanding Kernel Tricks: Features Beyond Dimensions

Understanding Kernel Tricks: Features Beyond Dimensions

Presentation Transcript

Welcome to the Kernel-Class

Kernel Regression

Kernel Methods

Kernel III

Kernel Methods

Kernel Methods

Kernel Properties

Kernel Methods

Micro-kernel

Kernel Modules

Linux Kernel

Kernel III

Kernel

Kernel Regression

Kernel Synchronization

UNIX Kernel

Kernel Machines