Fisher kernels for image representation & generative classification models

Fisher kernels for image representation & generative classification models Jakob Verbeek December 11, 2009

Plan for this course • Introduction to machine learning • Clustering techniques • k-means, Gaussian mixture density • Gaussian mixture density continued • Parameter estimation with EM • Classification techniques 1 • Introduction, generative methods, semi-supervised • Fisher kernels • Classification techniques 2 • Discriminative methods, kernels • Decomposition of images • Topic models, …

Classification Training data consists of “inputs”, denoted x, and corresponding output “class labels”, denoted as y. Goal is to correctly predict for a test data input the corresponding class label. Learn a “classifier” f(x) from the input data that outputs the class label or a probability over the class labels. Example: Input: image Output: category label, eg “cat” vs. “no cat” Classification can be binary (two classes), or over a larger number of classes (multi-class). In binary classification we often refer to one class as “positive”, and the other as “negative” Binary classifier creates a boundaries in the input space between areas assigned to each class

Example of classification Given: training images and their categories What are the categories of these test images?

Discriminative vs generative methods Generative probabilistic methods Model the density of inputs x from each class p(x|y) Estimate class prior probability p(y) Use Bayes’ rule to infer distribution over class given input Discriminative (probabilistic) methods Directly estimate class probability given input: p(y|x) Some methods do not have probabilistic interpretation, eg. they fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0

Generative classification methods Generative probabilistic methods Model the density of inputs x from each class p(x|y) Estimate class prior probability p(y) Use Bayes’ rule to infer distribution over class given input Modeling class-conditional densities over the inputs x Selection of model class: Parametric models: such as Gaussian (for continuous), Bernoulli (for binary), … Semi-parametric models: mixtures of Gaussian, Bernoulli, … Non-parametric models: Histograms over one-dimensional, or multi-dimensional data, nearest-neighbor method, kernel density estimator Given class conditional model, classification is trivial: just apply Bayes’ rule Adding new classes can be done by adding a new class conditional model Existing class conditional models stay as they are

Histogram methods Suppose we have N data points use a histogram with C cells How to set the density level in each cell ? Maximum (log)-likelihood estimator. Proportional to nr of points n in cell Inversely proportional to volume V of cell Problems with histogram method: # cells scales exponentially with the dimension of the data Discontinuous density estimate How to choose cell size?

The ‘curse of dimensionality’ Number of bins increases exponentially with the dimensionality of the data. Fine division of each dimension: many empty bins Rough division of each dimension: poor density model Probability distribution of D discrete variables takes at least 2D values At least 2 values for each variable The number of cells may be reduced assuming independency between the components of x: the naïve Bayes model Model is “naïve” since it assumes that all variables are independent… Unrealistic for high dimensional data, where variables tend to be dependent Poor density estimator Classification performance can still be good using derived p(y|x)

Example of generative classification Hand-written digit classification Input: binary 28x28 scanned digit images, collect in 784 long vector Desired output: class label of image Generative model Independent Bernoulli model for each class Probability per pixel per class Maximum likelihood estimator is average value per pixel per class Classify using Bayes’ rule:

k-nearest-neighbor estimation method Idea: fix number of samples in the cell, find the right cell size. Probability to find a point in a sphere A centered on x with volume v is Smooth density approximately constant in small region, and thus Alternatively: estimate P from the fraction of training data in a sphere on x Combine the above to obtain estimate

k-nearest-neighbor estimation method Method in practice: Choose k For given x, compute the volume v which contain k samples. Estimate density with Volume of a sphere with radius r in d dimensions is What effect does k have? Data sampled from mixture of Gaussians plotted in green Larger k, larger region, smoother estimate Selection of k Leave-one-out cross validation Select k that maximizes data log-likelihood

k-nearest-neighbor classification rule Use k-nearest neighbor density estimation to find p(x|category) Apply Bayes rule for classification: k-nearest neighbor classification Find sphere volume v to capture k data points for estimate Use the same sphere for each class for estimates Estimate global class priors Calculate class posterior distribution

k-nearest-neighbor classification rule Effect of k on classification boundary Larger number of neighbors Larger regions Smoother class boundaries

Kernel density estimation methods Consider a simple estimator of the cumulative distribution function: Derivative gives an estimator of the density function, but this is just a set of delta peaks. Derivative is defined as Consider a non-limiting value of h: Each data point adds 1/(2hN) in region of size h around it, sum of “blocks” gives estimate

Kernel density estimation methods Can use other than “block” function to obtain smooth estimator. Widely used kernel function is the (multivariate) Gaussian Contribution decreases smoothly as a function of the distance to data point. Choice of smoothing parameter Larger size of “kernel” function gives smoother desnity estimator Use the average distance between samples. Use cross-validation. Method can be used for multivariate data Or in naïve bayes model

Summary generative classification methods (Semi-) Parametric models (eg p(data |category) = gaussian or mixture) No need to store data, but possibly too strong assumptions on data density Can lead to poor fit on data, and poor classification result Non-parametric models Histograms: Only practical in low dimensional space (<5 or so) High dimensional space will lead to many cells, many of which will be empty Naïve Bayes modeling in higher dimensional cases K-nearest neighbor & kernel density estimation: Need to store all training data Need to find nearest neighbors or points with non-zero kernel evaluation (costly) histogram k-nn k.d.e.

Discriminative vs generative methods Generative probabilistic methods Model the density of inputs x from each class p(x|y) Estimate class prior probability p(y) Use Bayes’ rule to infer distribution over class given input Discriminative (probabilistic) methods (next week) Directly estimate class probability given input: p(y|x) Some methods do not have probabilistic interpretation, eg. they fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0 Hybrid generative-discriminative models Fit density model to data Use properties of this model as input for classifier Example: Fisher-vectors for image respresentation

Clustering for visual vocabulary construction Clustering of local image descriptors using k-means or mixture of Gaussians Recap of the image representation pipe-line Extract image regions at various locations and scales Compute descriptor for each region (eg SIFT) (Soft) assignment each descriptors to clusters Make histogram for complete image Summing of vector representations of each descriptor Input to image classification method Cluster indexes Image regions

Fisher Vector motivation Feature vector quantization is computationally expensive in practice Run-time linear in N: nr. of feature vectors ~ 10^3 per image D: nr. of dimensions ~ 10^2 (SIFT) K: nr. of clusters ~ 10^3 for recognition So in total in the order of 10^8 multiplications per image to assign SIFT descriptors to visual words We use histogram of visual word counts Can we do this more efficiently ?! Reading material: “Fisher Kernels on Visual Vocabularies for Image Categorization” F. Perronnin and C. Dance, in CVPR'07 Xerox Research Centre Europe, Meylan

Fisher vector image representation MoG / k-means stores nr of points per cell Need many clusters to represent distribution of descriptors in image But increases computational cost Fischer vector adds 1st & 2nd order moments More precise description regions assigned to cluster Fewer clusters needed for same accuracy Representation (2D+1) times larger, at same computational cost Terms already calculated when computing soft-assignment 2 2 3 5 1 1 5 3 2 2 4 8 4 2 2 8 4 2 2 4 1 2 1 3 2 2 3 2 qnk: soft-assignment of image region to cluster (Gaussian mixture component)

Image representation using Fisher kernels General idea of Fischer vector representation Fit probabilistic model to data Use derivative of data log-likelihood as data representation, eg.for classification [Jaakkola & Haussler. “Exploiting generative models in discriminative classifiers”, in Advances in Neural Information Processing Systems 11, 1999.] Here, we use Mixture of Gaussians to cluster the region descriptors Concatenate derivatives to obtain data representation

Image representation using Fisher kernels Extended representation of image descriptors using MoG Displacement of descriptor from center Squares of displacement from center From 1 number per descriptor per cluster, to 1+D+D2 (D = data dimension) Simplified version obtained when Using this representation for a linear classifier Diagonal covariance matrices, variance in dimensions given by vector vk For a single image region descriptor Summed over all descriptors this gives us 1: Soft count of regions assigned to cluster D: Weighted average of assigned descriptors D: Weighted variance of descriptors in all dimensions

Fisher vector image representation MoG / k-means stores nr of points per cell Need many clusters to represent distribution of descriptors in image Fischer vector adds 1st & 2nd order moments More precise description regions assigned to cluster Fewer clusters needed for same accuracy Representation (2D+1) times larger, at same computational cost Terms already calculated when computing soft-assignment Comp. cost is O(NKD), need difference between all clusters and data 2 2 3 5 1 1 5 3 2 2 4 8 4 2 2 8 4 2 2 4 1 2 1 3 2 2 3 2

Images from categorization task PASCAL VOC Yearly “competition” for image classification (also object localization, segmentation, and body-part localization)

Fisher Vector: results BOV-supervised learns separate mixture model for each image class, makes that some of the visual words are class-specific MAP: assign image to class for which the corresponding MoG assigns maximum likelihood to the region descriptors Other results: based on linear classifier of the image descriptions Similar performance, using 16x fewer Gaussians Unsupervised/universal representation good

Plan for this course • Introduction to machine learning • Clustering techniques • k-means, Gaussian mixture density • Gaussian mixture density continued • Parameter estimation with EM • Classification techniques 1 • Introduction, generative methods, semi-supervised • Reading for next week: • Previous papers (!), nothing new • Available on course website http://lear.inrialpes.fr/~verbeek/teaching • Classification techniques 2 • Discriminative methods, kernels • Decomposition of images • Topic models, …

Fisher kernels for image representation & generative classification models