Advanced Machine Learning & Perception

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara

Tony Jebara, Columbia University Topic 13 • Manifolds Continued and Spectral Clustering • Convex Invariance Learning (CoIL) • Kernel PCA (KPCA) • Spectral Clustering & N-Cuts

Tony Jebara, Columbia University Manifolds Continued • PCA: linear manifold • MDS: get inter-point distances, find 2D data with same • LLE: mimic neighborhoods using low dimensional vectors • GTM: fit a grid of Gaussians to data via nonlinear warp • Linear PCA after Nonlinear normalization/invariance of data • Manifold is Linear PCA in Hilbert space (Kernels) • Spectral Clustering in Hilbert space

Tony Jebara, Columbia University Convex Invariance Learning • PCA is appropriate for finding a linear manifold • Variation in data is only modeled linearly • But, many problems are nonlinear • However, the nonlinear variations may be irrelevant: • Images: morph, rotate, translate, zoom… • Audio: pitch changes, ambient acoustics… • Video: motion, camera view, angles… • Genomics: proteins fold, insertions, deletions… • Databases: fields swapped, formats, scaled… • Imagine a “Gremlin” is corrupting your data by multiplying • each input vector Xt by a type of matrix At to give AtXt • Idea: remove nonlinear irrelevant variations before PCA • But, make this part of PCA optimization, not pre-processing

Tony Jebara, Columbia University Convex Invariance Learning • Example of irrelevant variation in our data: • permutation in image data… each image Xt is multiplied • by a permutation matrix At by gremlin. Must clean it. • When we convert images to a vector, we are assuming • arbitrary meaningless ordering (like Gremlin mixing order) • This arbitrary ordering causes wild nonlinearities (manifold) • We should not trust ordering, assume gremlin has • permuted it with arbitrary permutation matrix…

Tony Jebara, Columbia University Permutation Invariance • Permutation is irrelevant variation in our data. • Gremlin is permuting fields in our input vectors • So, view a datum as “Bag of Vectors” instead single vector • i.e. grayscale image = Set of Vectors or Bag of Pixels • N pixels, each is a D=3 XYI tuple • matrix Ai by gremlin. Must clean it. • Treat each input as permutable “Bag of Pixels” x x x

Tony Jebara, Columbia University Optimal Permutation • Vectorization / Rasterization: uses index in image • to sort pixels into large vector. • If we knew “optimal” correspondence: could fix • sorting pixels in bag into large vector more • appropriately • … we don’t know it, must learn it…

Tony Jebara, Columbia University PCA on Permutated Data • In non-permuted vector images, linear changes & eigenvectors are additions & deletions of intensities (bad!). Translating, raising eyebrows, etc. = erasing & redrawing • In bag of pixels (vectorized only after knowing optimal permutation) get linear changes & eigenvectors are morphings, warpings, jointly spatial & intensity change

Tony Jebara, Columbia University Permutation as a Manifold • Assume order unknown. “Set of Vectors” or “Bag of Pixels” • Get permutational invariance (order doesn’t matter) • Can’t represent invariance by single ‘X’ vector point in DxN • space since we don’t know the ordering • Get permutation invariance by ‘X’ spanning all possible • reorderings. Multiply X by unknown A matrix (permutation • or doubly-stochastic) x x x x x

Tony Jebara, Columbia University Invariant Paths as Matrix Ops • Move vector along manifold by multiplying by matrix: • Restrict A to be permutation matrix (operator) • Resulting manifold of configurations is “orbit” if A is group • Or, for smooth manifold, make A doubly-stochastic matrix • Endow each image in dataset with own transformation matrix . Each is now a bag or manifold:

Tony Jebara, Columbia University A Dataset of Invariant Manifolds • E.g. assume model is PCA, learn 2D subspace of 3D data • Permutation lets points move independently along paths • Find PCA after moving to form ‘tight’ 2D subspace • More generally, move along manifolds to improve fit of any • model (PCA, SVM, probability density, etc.)

Tony Jebara, Columbia University Optimizing the Permutations • Optimize: modeling cost & linear constraints on matrices • Estimate transformation parameters • and model parameters (PCA, Gaussian, SVM) • Cost on matrices A emerges from modeling criterion • Typically, get a Convex Cost with Convex Hull of • Constraints (Unique!) • Since A matrices are soft permutation • matrices (doubly-stochastic) we have:

Tony Jebara, Columbia University Example Cost: Gaussian Mean • Maximum Likelihood Gaussian Mean Model: • Theorem 1: C(A) is convex in A (Convex Program) • Can solve via a quadratic program on the A matrices • Minimizing the trace of a covariance tries to pull the data spherically towards a common mean

Tony Jebara, Columbia University Example Cost: Gaussian Cov • Theorem 2: Regularized log determinant of covariance is • convex. Equivalently, minimize • Theorem 3: Cost non-quadratic but upper boundable by • quad. Iteratively solve with QP with variational bound: • Min’ing determinant flattens data into low volume pancake

Tony Jebara, Columbia University x x x x x x x x x x x x Example Cost: Fisher Discrimin. • Find linear Fisher Discriminant model w that • maximizes ratio of between & within-class scatter • For discriminative invariance, transformation matrices • should increase between-class scatter (numerator) and • should reduce within class scatter (denominator) • Minimizing above permutes data to make classification easy x x x x x x x x x x

Tony Jebara, Columbia University Interpreting C(A) • Maximum Likelihood Mean • Permute data towards common mean • Maximum Likelihood Mean & Covariance • Permute data towards flat subspace • Pushes energy into few eigenvectors • Great as pre-processing before PCA • Fisher Discriminant • Permute data towards two flat • subspaces while repelling away • from each other’s means

Tony Jebara, Columbia University SMO Optimization of QP • Quadratic Programming used for all C(A) since: • Gaussian Mean quadratic • Gaussian Covariance upper boundable by quadratic • Fisher Discriminant upper boundable by quadratic • Use Sequential Minimal Optimization • axis parallel optimization, pick axes to update, • ensure constraints not violated • Soft permutation matrix 4 constraints • or 4 entries at a time

Tony Jebara, Columbia University XY Digits Permuted PCA 20 Images of ‘3’ and ‘9’ Each is 70 (x,y) dots No order on the ‘dots’ PCA compress with same number of Eigenvectors Convex Program first estimates the permutation  better reconstruction Original PCA Permuted PCA

Tony Jebara, Columbia University Interpolation Intermediate images are smooth morphs Points nicely corresponded Spatial morphing versus ‘redrawing’ No ghosting

Tony Jebara, Columbia University XYI Faces Permuted PCA Original PCA Permuted Bag-of XYI Pixels PCA 2000 XYI Pixels: Compress to 20 dims Improve squared error of PCA by Almost 3 orders of magnitude x103

Tony Jebara, Columbia University XYI Multi-Faces Permuted PCA +/- Scaling on Eigenvector Top 5 Eigenvectors All just linear variations in bag of XYI pixels Vectorization nonlinear needs huge # of eigenvectors

Tony Jebara, Columbia University XYI Multi-Faces Permuted PCA +/- Scaling on Eigenvector Next 5 Eigenvectors

Tony Jebara, Columbia University Kernel PCA • Replace all dot-products in PCA with kernel evaluations. • Recall, could do PCA on DxD covariance matrix of data • or NxN Gram matrix of data: • For nonlinearity, do PCA on feature expansions: • Instead of doing explicit feature expansion, use kernel • I.e. d-th order polynomial • As usual, kernel must satisfy Mercer’s theorem • Assume, for simplicity, all • feature data is zero-mean Evals & Evecs satisfy If data is zero-mean

Tony Jebara, Columbia University Kernel PCA • Efficiently find & use eigenvectors of C-bar: • Can dot either side of above equation with feature vector: • Eigenvectors are in span of feature vectors: • Combine equations:

Tony Jebara, Columbia University Kernel PCA • From before, we had: • this is an eig equation! • Get eigenvectors a and eigenvalues of K • Eigenvalues are N times l • For each eigenvector ak there is an eigenvector vk • Want eigenvectors v to be normalized: • Can now use alphas only • for doing PCA projection & • reconstruction!

Tony Jebara, Columbia University Kernel PCA • To compute k’th projection coefficient of a new point f(x) • Reconstruction*: • *Pre-image problem, linear combo in Hilbert goes outside • Can now do nonlinear PCA and do PCA on non-vectors • Nonlinear KPCA eigenvectors satisfy • same properties as usual PCA but • in Hilbert space. These evecs: • 1) Top q have max variance • 2) Top q reconstruction has • with min mean square error • 3) Are uncorrelated/orthogonal • 4) Top have max mutual with inputs

Tony Jebara, Columbia University Centering Kernel PCA • So far, we had assumed the • data was zero-mean: • We want this: • How to do without touching feature space? Use kernels… • Can get alpha eigenvectors from K tilde by adjusting old K

Tony Jebara, Columbia University 2D KPCA • KPCA on 2d • dataset • Left-to-right • Kernel poly • order goes • from 1 to 3 • 1=linear=PCA • Top-to-bottom • top evec • to weaker • evecs

Tony Jebara, Columbia University Kernel PCA Results • Use coefficients of the KPCA for training a linear SVM • classifier to recognize chairs from their images. • Use various polynomial kernel • degrees where 1=linear as in • regular PCA

Tony Jebara, Columbia University Kernel PCA Results • Use coefficients of the KPCA for training a linear SVM • classifier to recognize characters from their images. • Use various polynomial kernel degrees where 1=linear as • in regular PCA (worst case in experiments) • Inferior performance to nonlinear SVMs (why??)

Tony Jebara, Columbia University Spectral Clustering • Typically, use EM or k-means to cluster N data points • Can imagine clustering the data points only from • NxN matrix capturing their proximity information • This is spectral clustering • Again compute Gram matrix using, e.g. RBF kernel • Example: have N pixels from an image, each • x = [xcoord, ycoord, intensity] of each pixel • From eigenvectors of K matrix (or slight, • variant), these seem to capture some • segmentation or clustering of data points! • Nonparametric form of clustering since we • didn’t assume Gaussian distribution…

Tony Jebara, Columbia University Stability in Spectral Clustering • Standard problem when computing & using eigenvectors: • Small changes in data can cause • eigenvectors to change wildly • Ensure the eigenvectors we keep are • distinct & stable: look at eigengap… • Some algorithms ensure the eigenvectors • are going to have a safe eigengap. Adjust • or process Gram matrix to ensure • eigenvectors are still stable. 3 evecs=unsafe 3 evecs=safe gap

Tony Jebara, Columbia University Stabilized Spectral Clustering • Stabilized spectral clustering algorithm:

Tony Jebara, Columbia University Stabilized Spectral Clustering • Example results compared to other clustering • algorithms (traditional kmeans, unstable spectral • clustering, connected components).

Advanced Machine Learning & Perception