240 likes | 560 Vues
Input Space versus Feature Space in Kernel-Based Methods. Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of Computer Science and Engineering University of California, San Diego. Goals. Objectives of the paper.
E N D
Input Space versus Feature Space in Kernel-Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of Computer Science and Engineering University of California, San Diego
Goals Objectives of the paper • Introduce and illustrate the kernel trick • Discuss the kernel mapping from input space to feature space F • Review kernel algorithms: SVMs and kernel PCA • Discuss interpretation of the return from F to after the dot product computation • Discuss the form of constructing sparse approximations of feature space expansions • Evaluate and discuss the performance of SVMs and PCA Applications of kernel methods • Handwritten digit recognition • Face recognition • De-noising: this paper
Definition A reproducing kernel k is a function k: R. • The domain of k consists of the data patterns {x1, …, xl} • is a compact set in which the data lives • is typically a subset of RN Computing k is equivalent to mapping data patterns into a higher dimensional space F, and then taking the dot product there. A feature map : RN F is a function that maps the input data patterns into a higher dimensional space F.
Illustration Using a feature map to map the data from input space into a higher dimensional feature space F: Φ(X) X X O Φ(X) Φ(O) X Φ(X) O Φ(O) X Φ(X) O Φ(O) Φ(O) O F
Kernel Trick We would like to compute the dot product in the higher dimensional space, or (x) · (y). To do this we only need to compute k(x,y), since k(x,y) = (x) · (y). Note that the feature map is never explicitly computed. We avoid this, and therefore avoid a burdensome computational task.
Example kernels Gaussian: Polynomial: Sigmoid: Nonlinear separation can be achieved.
Mercer Theory Input Space to Feature Space Necessary condition for the kernel-mercer trick: NF is equal to the rank of ui uiT – the outer product is the normalized eigenfunction – analogous to a normalized eigenvector
Mercer :: Linear Algebra Linear algebra analogy: Eigenvector problemEigenfunction problem A k(x,y) u, , x and y are vectors u is the normalized eigenvector is the eigenvalue • is the normalized eigenfunction
RKHS, Capacity, Metric Reproducing kernel Hilbert space (RKHS) • Hilbert space of functions f on some set X such that all evaluation functions are continuous, and the functions can be reproduced by the kernel Capacity of the kernel map • Bound on the how many training examples are required for learning, measured by the VC-dimension h Metric of the kernel map • Intrinsic shape of the manifold to which the data is mapped
Support Vector Machines The decision boundary takes the form: • Similar to single layer perceptron • Training examples xi with non-zero coefficients i are support vectors
Kernel Principal Component Analysis KPCA carries out a linear PCA in the feature space F The extracted features take the nonlinear form The are the components of the k-th eigenvector of the matrix
KPCA and Dot Products Wish to find eigenvectors V and eigenvalues of the covariance matrix Again, replace (x) · (y). with k(x,y).
From Feature Space to Input Space Pre-image problem: Here, is not in the image.
Projection Distance Illustration Approximate the vector F:
Minimizing Projection Distance z is an approximate pre-image for if: Maximize: For kernels where k(z,z) = 1 (Gaussian), this reduces to:
Fixed-point iteration So assuming a Gaussian kernel: • i are the eigenvectors of the centered Gram matrix • xi are the input space • is the width Requiring no step-size, we can iterate:
Kernel PCA Toy Example Generated an artificial data set from three point sources, 100 point each.
De-noising by Reconstruction, Part One • Reconstruction from projections onto the eigenvectors from previous example • Generated 20 new points from each Gaussian • Represented by their first n = 1, 2, …, 8 nonlinear principal components
De-noising by Reconstruction, Part Two • Original points are moving in the direction of de-noising
De-noising in 2-dimensions • A half circle and a square in the plane • De-noised versions are the solid lines
De-noising USPS data patterns Patterns 7291 train 2007 test Size: 16 x 16 Linear PCA Kernel PCA