Principal Component Analysis

Principal Component Analysis CSE 4310 – Computer Vision Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Images as High-Dimensional Vectors • Consider these five images. • Each of them is a 100x100 grayscale image. • What is the dimensionality of each image?

Do We Need That Many Dimensions? • Consider these five images. • Each of them is a 100x100 grayscale image. • What is the dimensionality of each image? 10,000. • However, each image is generated by: • Picking an original image (like the image on the left). • Translating (moving) the image up or down by a certain amount . • Translating the image left or right by a certain amount . • Rotating the image by a certain degree . • If we know the original image, to reconstruct any other image we just need three numbers: .

Dimensionality Reduction • The goal of dimensionality reduction methods is to build models that allow representing high-dimensional vectors using a smaller number of dimensions. • Hopefully, a much smaller number of dimensions. • The model is built using training data. • In this example, the model consists of: • A projection function . Given an input image , outputs the corresponding translation and rotation parameters . • A backprojection function . Given translation parameters , and rotation parameter , outputs the corresponding image.

Dimensionality Reduction • The goal of dimensionality reduction methods is to build models that allow representing high-dimensional vectors using a smaller number of dimensions. • Hopefully, a much smaller number of dimensions. • The model is built using training data. • In this example, the model consists of: • A projection function . Given an input image , outputs the corresponding translation and rotation parameters . • A backprojection function . Given translation parameters , and rotation parameter , outputs the corresponding image. • If we have a lossless projection function, then • Typically, projection functions are lossy. • We try to find and so that tends to be close to .

Linear Dimensionality Reduction • Linear dimensionality reduction methods use linear functions for projection and backprojection. • Projection: , where: • is a column vector of dimensions. • is an matrix, where hopefully ( is much smaller than ). • This way, projects -dimensional vectors to -dimensional vectors. • Advantage of linear methods: there are well known methods for finding a good . We will study some of those methods. • Disadvantage: they cannot capture non-linear transformations, such as the image translations and rotations of our example.

Intrinsic Dimensionality • Sometimes, high dimensional data is generated using some process that uses only a few parameters. • The translated and rotated images of the digit 3 are such an example. • In that case, the number of those few parameters is called the intrinsic dimensionality of the data. • It is desirable (but oftentimes hard) to discover the intrinsic dimensionality of the data.

Lossy Dimensionality Reduction • Suppose we want to project all points to a single line. • This will be lossy. • What would be the best line?

Lossy Dimensionality Reduction • Suppose we want to project all points to a single line. • This will be lossy. • What would be the best line? • Optimization problem. • The number of choices is infinite. • We must define an optimization criterion.

Optimization Criterion • Consider a pair of 2-dimensional points: . • Let map each 2D point to a point on a line. • So, • Define. • Squared distance from to . • Define . • Define error function . • Will ever be negative?

Optimization Criterion • Consider a pair of 2-dimensional points: . • Let map each 2D point to a point on a line. • So, • Define . • Squared distance from to . • Define . • Define error function . • Will ever be negative? • NO: always. Projecting to fewer dimensions can only shrink distances.

Optimization Criterion • Now, consider all points: • . • Define error function as: • Interpretation: Error function measures how well projection preserves distances.

Optimization Criterion • Now, consider all points: • . • Define error function as: • Suppose that perfectly preserves distances. • Then, : • In that case, ???

Optimization Criterion • Now, consider all points: • . • Define error function as: • Suppose that perfectly preserves distances. • Then, : • In that case, . • In the example shown on the figure, obviously .

Optimization Criterion:Preserving Distances • We have defined an error function that tells us howgood a linear projection is. • Therefore, the best line projection is the one that minimizes.

Optimization Criterion:Preserving Distances • We have defined an optimization criterion, that measures how wella projection preserves the pairwise distances of the original data. • Another criterion we could use:Minimizing the sum of backprojection errors: • We will not prove it here, but the criterion of preserving distances is mathematically equivalent to minimizing backprojection errors.

Finding the Best Projection: PCA • First step: center the data. points centered_points

Finding the Best Projection: PCA First step: center the data.% Each column of points is a vector. % Each vector in our dataset is a column in points. number = size(points, 2); % note that we are transposing twice average = [mean(points')]'; centered_points = zeros(size(points)); for index = 1:number centered_points(:, index) = points(:, index) - average; end plot_points(centered_points, 2);

Finding the Best Projection: PCA • Second step: compute the covariance matrix. % Each column of centered_points is a vectorfrom our % centered dataset. covariance_matrix = centered_points * centered_points';

Finding the Best Projection: PCA • Second step: compute the covariance matrix. % Each column of centered_points is a vectorfrom our % centered dataset. covariance_matrix = centered_points * centered_points'; • Third step: compute the eigenvectors and eigenvalues of the covariance matrix. [eigenvectors eigenvalues] = eig(covariance_matrix);

Eigenvectors and Eigenvalues eigenvectors = 0.4837 -0.8753 -0.8753 -0.4837 eigenvalues = 2.0217 0 0 77.2183 • Each eigenvector v is a column, that specifies a line going through the origin. • The importance of the i-th eigenvector is reflected by the i-th eigenvalue. • second eigenvalue = 77, first eigenvalue = 2, => second eigenvector is far more important.

Eigenvectors and Eigenvalues eigenvectors = 0.4837 -0.8753 -0.8753 -0.4837 eigenvalues = 2.0217 0 0 77.2183 • Suppose we want to find the optimal one-dimensional projection (“optimal” according to the criteria we defined earlier). • The eigenvector with the highest eigenvalue is the best line for projecting our data.

Eigenvectors and Eigenvalues eigenvectors = 0.4837 -0.8753 -0.8753 -0.4837 eigenvalues = 2.0217 0 0 77.2183 • In higher dimensions: • Suppose that each vector is D-dimensional. • Suppose that we want to project our vectors to a d-dimensional space (where d < D). • Then, the optimal subspace to project to is defined by the d eigenvectors with the highest eigenvalues.

Visualizing the Eigenvectors black: v1 (eigenvalue = 2.02) red: v2 (eigenvalue = 77.2)Which of these two lines is betterto project our points to? plot_points(points, 1); p1 = eigenvectors(:, 1); p2 = eigenvectors(:, 2); plot([0, p1(1)], [0, p1(2)], 'k-', 'linewidth', 3); hold on; plot([0, p2(1)], [0, p2(2)], 'r-', 'linewidth', 3);

Visualizing the Eigenvectors black: v1 (eigenvalue = 2.02) red: v2 (eigenvalue = 77.2)Which of these two lines is betterto project our points to?The red line clearly would preserve more information. plot_points(points, 1); p1 = eigenvectors(:, 1); p2 = eigenvectors(:, 2); plot([0, p1(1)], [0, p1(2)], 'k-', 'linewidth', 3); hold on; plot([0, p2(1)], [0, p2(2)], 'r-', 'linewidth', 3);

PCA Code function [average, eigenvectors, eigenvalues] = ... compute_pca(vectors) number = size(vectors, 2); % note that we are transposing twice average = [mean(vectors')]'; centered_vectors = zeros(size(vectors)); for index = 1:number centered_vectors(:, index) = vectors(:, index) - average; end covariance_matrix = centered_vectors * centered_vectors'; [eigenvectors eigenvalues] = eig( covariance_matrix); % eigenvalues is a matrix, but only the diagonal % matters, so we throw away the rest eigenvalues = diag(eigenvalues); [eigenvalues, indices] = sort(eigenvalues, 'descend'); eigenvectors = eigenvectors(:, indices);

PCA Projection of 2D Points to 1D • The compute_pca function shows how to compute : • All eigenvectors and corresponding eigenvalues. • The average of all vectors in our dataset. • Important: note that the eigenvectors returned by compute_pcaare sorted in decreasing order of their eigenvalues. • Suppose that we have applied this function to our 2D point dataset. • How can we get the top eigenvector from the result? • The top eigenvector is simply the first column of eigenvectors (the second return value).

PCA Projection of 2D Points to 1D • Suppose that we have computed the eigenvectors, and now we want to project our 2D points to 1D numbers. • Suppose that P1 is the first eigenvector (i.e., the eigenvector with the highest eigenvalue). • Projection: P(V) = <V-avg, P1> = P1’ * (V – avg) • Dot product between (V-avg) and P1. • NOTE: The eigenvectors that Matlab returns have unit norm. • So, projection of a 2D vector to a 1D number is done in two steps: • First, centerthe vector by subtracting the average computed by compute_pca. • Second, take the dot product of the centered vector with the top eigenvector.

Example: From 2 Dimensions to 1 • Our original set of 2D vectors.

Example: From 2 Dimensions to 1 • We run compute_pca, and we compute the first eigenvector, which we call p1. • The black line shows the direction of p1.

Example: From 2 Dimensions to 1 • We choose a point v4 = [-1.556, 0.576]’. • Shown in red. • We will compute the PCA projection of v4.

Example: From 2 Dimensions to 1 • centered_v4 = v4 – average. • Shown in cyan.

Example: From 2 Dimensions to 1 • projection = p1' * (centered_v4); • result: projection = 1.43

Example: From 2 Dimensions to 1 • Note: the projection is a single number. • One way to visualize this number is shown in pink: it is the point on the x axis with x=projection.

Example: From 2 Dimensions to 1 • A more intuitive way is to show the projection of the centered point (shown in cyan) on the black line. • This point is: projection * p1 = p1' * (centered_v4) * p1

Example: From 2 Dimensions to 1 • b1 = projection * p1; • shown in red, on top of black line. • How are b1 and projection related?

Example: From 1 Dimension to 2 • b1 = projection * p1; • shown in red, on top of black line. • projection = distance of b1 from the origin.

Backprojection • In the previous slides we saw how to compute the projection P(V) from 2D to 1D:P(V) = <V-avg, P1> = P1’ * (V – avg) • Another useful operation is the backprojection: • In backprojection, we are given P(V), and based on that we try to estimate V as best as we can.

Backprojection • Obviously, it is impossible to estimate V with certainty given P(V). • In our example: • P(V) has how many dimensions? • V has how many dimensions?

Backprojection • Obviously, it is impossible to estimate V with certainty given P(V). • In our example: • P(V) has how many dimensions? 1. • V has how many dimensions? 2. • An infinite number of points will project to P(V). • What backprojection gives us is the “best estimate”, that has the smallest squared error (averaged over all vectors V in our dataset). • The backprojection formula for our 2D to 1D example is:B(P(V)) = P1 * P(V) + average

Backprojection from 1D to 2D • Input: P(V) = 1.43, which is just a number.

Backprojection from 1D to 2D • Step 1: map this number on the line corresponding to the top eigenvector: b1 = P(V) * p1. • The result is the red point on the black line.

Backprojection from 1D to 2D • Step 2: add to b1 the average of our dataset. • The result is shown in green.

Backprojection from 1D to 2D • Step 2: add to b1 the average of our dataset. • The result is shown in green. • This is it, the green point is B(P(V)), it is our best estimate.

Example Application: PCA on Faces • In this example, the data are face images, like:

Example Application: PCA on Faces • Each image has size : • Therefore, each image is represented as a 775-dimensional vector.

PCA on Faces • Motivation: If a face is a 31x25 window, we need 775 numbers to describe the face. • With PCA, we can store (approximately) the same information with much fewer number. • One benefit is that we can do much faster computations, using fewer numbers. • Another benefit is that PCA provides useful information for face detection and face recognition. • How? Using the backprojection error. • The backprojection error measures the sum-of-squares error between a vector V and the backprojection B(P(V)). • It shows how much of the information in V is lost by P(V).

PCA vs Template Matching • If we use template matching to detect faces, what is the perfect face (easiest to be detected, gives the best score)? • How about PCA?

PCA vs Template Matching • Template matching (assuming normalized correlation): • The template of the face is perfect. • The only other faces that are perfect are faces that, after we normalize for brightness and contrast), become equal to the normalized template. • This approach is very restrictive. • Out of all normalized images, only one would qualify as a “perfect face”.

PCA vs Template Matching • Just to make it concrete: • As we said before, we have a face dataset where each face is a 775-dimensional vector. • Suppose that we use PCA to project each face to a 10-dimensional vector. • When we do face detection, for every image subwindow V that we consider, we compute its PCA projection P(V) to 10 dimensions. • Then what? Can you guess how we would compute a detection score? Can you guess what type of faces would give a perfect score?

Principal Component Analysis