Kernel Methods

Kernel Methods MesfinAdane DEMA

Polynomial regression

K-Nearest Neighbour

Advantage of increase in Dimension Classifier of 2D data Classifier of 1D data

Kernel Methods

Kernel Methods • Are used to find and study general types of relations (eg. Clustering, correlation, classification) in general types of data (vectors, images, sets). • Approaches the problem by mapping into higher dimensions where each coordinates corresponds to one data item. • For nonlinear feature space mapping , the kernel function is given as

Kernel Methods • Kernel Trick (Kernel Substitution) :- If the input vector enters in the form of scalar product, then the scalar product can be replaced by some other kernel functions . • Linear Kernels: • Stationary Kernels: • Homogeneous Kernels :

Dual Representations • Consider linear regression model whose regularized sum-of-squares error function is given by • Minimizing with respect to w, where Φ is the design matrix , whose nth row is given by

Dual Representations • Substituting into • Define Gram Matrix which is an N x N symmetric matrix with elements • Re-writing the sum-of-squares error using the Gram Matrix

Dual Representations • Minimizing with respect to a ,and solving • Substituting back to the original linear regression model • Note that vector a is computed by inverting N x N matrix. • Compare:

Dual Representations • Even though the dual formulation requires higher dimensions , it comes with some advantages. • We can see that is expressed entirely in terms of the kernel functions. • This allows us to work with a kernel with a higher dimension, even with infinite dimensionality.

Constructing Kernels • First Approach: Select the feature space mapping and use it to construct the corresponding kernel. • Examples of basis function: • Polynomial: • Gaussian: • Sigmoid:

Constructing Kernels

Constructing Kernels • Second Approach: Construct kernel functions directly. We need to make sure that we are selecting a valid kernel. • Valid Kernels: kernels who has a Gram Matrix whose components are positive semi-definite for all possible choices of input data. • Note that a positive semi-definite matrix has a property that the conjugate transpose of itself does not change the matrix. Furthermore, all the eigenvalues are real and non-negative.

Constructing Kernels • Consider a simple example: • Considering a 2D input space • New complex Kernels can also be reconstructed by using simpler kernels as building blocks.

Constructing Kernels • Gaussian Kernels Taking the inner part: Substituting kernel

Constructing Kernels • Kernels are a powerful tool to combine generative and discriminative probabilistic approaches. • By combining the two approaches, we can benefit from generative model in case of missing data and obtain better performance on discriminative tasks from discriminative model. • Define a kernel using generative model and use this kernel in discriminative approach.

Constructing Kernels • Given a generative model ,define a kernel: • Note that the kernel measures the similarity between x and x’. • Taking different probabilistic distributions • Example: Hidden Markov Model

Radial Basis Function • Each basis function depends only the radial distance(typically Euclidean) from a centre. • Given input vectors and corresponding targets • Goal: • Achieved by expressing as linear combination of RBF one centred on every data

Radial Basis Function • For a noisy input data having distribution the sum-of-squares error is given by • Optimizing it, we can get • where

Radial Basis Function

Radial Basis Function • Computationally costly while predicting for new data points. • Orthogonal Least square: the data point to be chosen is the one which gives great reduction on sum-of-squares error.

Nadaraya-Watson Model

Gaussian Process

Linear Regression Revisited • Consider • Taking prior distribution over w • Note that this induces a probability distribution over function y(x). • In terms of vector representation

Linear Regression Revisited • Note that since y is a linear combination of Gaussian distributed values w, and hence it is also a Gaussian distributed. • where K is the Gram matrix with elements • Definition: A Gaussian process is defined as a probability distribution over functions evaluated at an arbitrary set of points jointly have a Gaussian distribution.

Linear Regression Revisited • Note that the joint distribution of Gaussian process can be completely specified by the mean and covariance. • Note also that the covariance can be evaluated from the kernel function. • Taking the Gaussian and exponential kernel functions, for example, samples are drawn.

Linear Regression Revisited

Gaussian processes for regression • Considering noise on the observed target • Considering Gaussian noise • For independent noise, the joint distribution is given by • Using the Gaussian process • Marginalizing the probability

Gaussian processes for regression • Where C is the covariance matrix given as • Note the summation of the covariance. • Widely used Kernel function

Gaussian processes for regression

Gaussian processes for regression • Goal of regression: predict • Using the joint distribution • Partitioning the covariance • K has elements and c is given as • Using conditional distribution

Gaussian processes for regression • Note that the mean and covariance are dependent on the term k which is dependent on the input • Note also that the additional kernel matrix should be a valid kernel. • The Gaussian process viewpoint is advantageous in that we can consider covariance functions that can be expressed in terms of an infinite number of basis functions.

Gaussian processes for regression

Gaussian processes for classification • Since the Gaussian process makes prediction over the entire real axis, it cannot be directly applied for classification. • Solution: apply an activation function over the output of the Gaussian process. • Note that the covariance term does not include noise term because of correct label. But introduce noise-like term for compational reason.

Gaussian processes for classification

Gaussian processes for classification • Considering two class problem • Intractable integral and approximation is necessary • Variational Inference • Expectation Propagation • Laplace Approximation

Gaussian processes for classification

Kernel Methods

Kernel Methods

Presentation Transcript

Chapter 6 Kernel Smoothing Methods

Kernel Methods Part 2

Overview of Kernel Methods

Kernel Methods: Basics

Kernel Methods and SVM’s

Kernel Methods

Kernel Methods

Kernel methods

Kernel Methods for Relation Extraction

Neural Networks and Kernel Methods

Kernel synchronization methods

Speaker Verification via Kernel Methods

Kernel – Based Methods

Kernel Methods

Overview of Kernel Methods (Part 2)

Kernel Methods for fMRI Pattern Prediction

Support Vector and Kernel Methods

Reproducing Kernel Hilbert Space (RKHS), Regularization Theory, and Kernel Methods

Kernel Methods: Support Vector Machines

Kernel Density Estimation, Kernel Methods, and fast learning

Kernel Methods

Lecture 7. Kernel Smoothing Methods