Multivariate Statistics

Multivariate Statistics Principal Component Analysis W. M. van der Veld University of Amsterdam

Overview • Eigenvectors and eigenvalues • Principal Component Analysis (PCA) • Visualization • Example • Practical issues

Eigenvectors and eigenvalues • Let A be a square matrix of order n x n. • It can be shown that vectors exist, such that Ak=λk, with λ some scalar. Where k the eigenvector, and λ the eigenvalue. • The eigenvectors k and eigenvalues λ have many applications, but we will only use it in this course for principal component analysis. • But they also play a role in cluster analysis, canonical correlations, and other methods.

Eigenvectors and eigenvalues • So, for the system of equations Ak=λk,only A is known. • We have to solve for k and λ to find the eigenvector and eigenvalue. • It is not possible to solve this set of equations straightforward with the method described last week, since m<n. • The trivial solution k=0 is excluded. • A solution can however be found under certain conditions. • First an example to a feeling for the equation.

Eigenvectors and eigenvalues • An example: Ak=λk. Let A be • One solution for k is: • One solution for k is:

Eigenvectors and eigenvalues • How did we find the eigenvectors? • Before that we first have to find the eigenvalues! • From Ak=λk it follows that Ak - λk = 0, which is: • (A - λI)k = 0. • Since k = 0 is excluded, there seems no solution. • However, for homogeneous equations, a solution can be found when rank(A)<n, and this is only the case when |A - λI|=0. which can be rewritten as the characteristic equation: • This (|A - λI|=0) is what I meant with certain conditions! • We can now easily solve for λ.

Eigenvectors and eigenvalues • This determinant gives an equation in λ.

Eigenvectors and eigenvalues • It is now a matter of substitution, of λ; start with λ1 = 5. • Note that any multiplier of k would also satisfy the equation!

Eigenvectors and eigenvalues • The same forλ2 = 2.

Eigenvectors and eigenvalues • This was the 2 x 2 case, but in general the matrix A is of order n x n. • In that case we will find • n different eigenvalues, and • n different eigenvectors. • The eigenvectors could be collected in a matrix K, with k1, k2, …, kn as the eigenvectors. • The eigenvalues could be collected in a matrix Λ., with on the diagonal the eigenvalues, λ1, λ2, …, λn. • Hence the generalized form of Ak=λk is: AK=ΛK

Principal Component Analysis

Harold Hotelling (1895-1973) • PCA was introduced Harold Hotelling (1933). • Harold Hotelling was appointed as a professor of economics at Columbia. • But he was a statistician first, economist second. • His work in mathematical statistics included his famous 1931 paper on the Student's t distribution for hypothesis testing, in which he laid out what has since been called "confidence intervals". • In 1933 he wrote "Analysis of a Complex of Statistical Variables with Principal Components“ in the Journal of Educational Psychology.

Principal Component Analysis • Principal components analysis (PCA) is a technique that can be used to simplify a dataset*. • More formally it is a linear transformation that chooses a new coordinate system for the data set such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis, and so on. • PCA can be used for reducing dimensionality in a dataset while retaining those characteristics of the dataset that contribute most to its variance by eliminating the later principal components (by a more or less heuristic decision). These characteristics may be the "most important", but this is not necessarily the case, depending on the application.

Principal Component Analysis • The data reduction is accomplished via a linear transformation of the observed variables, such that :yi = ai1x1 + ai2x2 + … + aipxp; where i=1..p • Where the y’s are the principal components, which are uncorrelated to each other.

Digression • The equation: yi = ai1x1 + ai2x2 + … + aipxp; What does it say? • Let’s assume for a certain respondent his answers to items x1, x2, …, xp are known, and we also know the coefficients aij. • What does that imply for yi? • The function is a prediction of y. • How does the path model for this equation look like? • What is so special about this equation? • In multiple regression we have an observed y and observed x’s; this allows the estimation of the constants ai. • PCA is a different thing. Although the equation is the same!

Principal Component Analysis • The equation: yi = ai1x1 + ai2x2 + … + aipxp • In the PCA case: • The y variables are not observed (unknown), and • The constants a are unknown. • So there are too many unknowns, to solve the system. • We can do PCA, so we must be able to solve it. But how? • The idea is straightforward. • Choose a’s such that each principal component has maximum variance. • Express the variance of y in terms of the observed (x) and unknown (a). • Make a constraint to limit the number of solutions. • Then we must set the derivative to zero, to find a maximum of the function.

Principal Component Analysis • The basic equation: yi = ai1x1 + ai2x2 + … + aipxp; • Let x be a column vector with p random x variables • The x variables are, without loss of generality, expressed as deviations from the mean. • We usually worked with the data matrix, now suddenly a vector with p random variables?

Digression • The vector x containing random variables is directly related to the data matrix X.

Principal Component Analysis • The basic equation: yi = ai1x1 + ai2x2 + … + aipxp; • Let x be a column vector with p random x variables • The x variables are, without loss of generality, expressed as deviations from the mean. • Let a be a p component column vector, • Then y = a’x. • Because this function is unbounded, we can always find a vector a’ for which the variance of the principal component is larger, and for which the equation is satisfied; hence. • Make a constraint on the unknowns so that a’a = 1. • This (=1) is an arbitrary choice, • but it will show that this makes the algebra simpler. • Now the number of solutions for y is constrained (bounded).

Principal Component Analysis • The variance of y is var(y) = var(a’x) = E((a’x)(a’x)’). • Which is: E((a’x)(x’a)) = a’E(xx’)a = a’Σa; • Because E(xx’) = X’X/n = variantie-covariantie matrix. • Thus f: a’Σa. • We have to find a maximum of this function.

Principal Component Analysis • So, we have to solve the derivative function equal to zero. • Don’t forget the constraint (a’a = 1) in the function, that should be accounted for when finding the maximum. • This is can be achieved using the Lagrange multiplier, a mathematical shortcut. • h: ∂f/∂a’ – λ ∂g/∂a’ = 0; where g: a’a - 1 • ∂h/∂a’ = 2Σa – 2λa • This is the derivative which we need to find the maximum of the variance function. • 2Σa – 2λa = 0 => divide both sides by 2 • Σa – λa = 0 => get the factor a out. • (Σ – λI)a = 0 => this should look familiar!

Principal Component Analysis • (Σ – λI)a = 0 can be solved via |Σ – λI| = 0, and a≠ 0. • Here λ is eigenvalue of the eigenvector a. • Rewrite (Σ – λI)a = 0 so that: Σa = λIaΣa = λa • If we premultiply both sides with a’; then • a’Σa = a’λa a’Σa = λa’a = λ; • Because a’a = 1. • It follows that var(y) = λ; • because var(y) = a’Σa, • So the eigenvalues are the variances of the principal components. • And the largest eigenvalue is the variance of the first principal component, etc. • The elements of the eigenvector a, which are found by substitution of the largest λ, are called loadings of y.

Principal Component Analysis • The next principal component is found taking away the variance of the first principal component, which is: • var(y1) = a1’x • In order to find the y2 we assume that it is uncorrelated with y1, next to the other assumption that a2’a2= a’a = 1. • Therefore: • cor(y2, y1) = 0 • E(y2, y1’) = 0 • E((a2’x) (a1’x)’) = 0 • E(a2’xx’a1) = 0 • a2’E(xx’)a1 = 0 • a2’Σa1 = 0 • a’λ1a1 = λ1a’a1 = 0 => Because Σa1 = λ1a1, since (Σ – λ1I)a1 = 0

Principal Component Analysis • Since y2 = a2’x • The variance of y2 is f2: a2’Σa2 • So, we have to solve the derivative function equal to zero. • Don’t forget to take the constraints into account: • (a2’a2 = 1), and • λa2’a1 = 0. • when finding the maximum. • This is can be achieved using the Lagrange multiplier, a mathematical shortcut.

Principal Component Analysis • The result: ∂h2/∂a2’ = 2Σa2 – 2λ2a2– 2ν2Σa1 = 0 • ν2 = 0 as a consequence of the constraints. • Thus: • 2Σa2 – 2λ2a2= 0 => (Σ – λ2I) a2= 0 • Which can be solved via |Σ – λ2I| = 0, and a2≠ 0 • We solve this equation, then take the highest eigenvalue (λ2), • Solve for the eigenvector (a2) that corresponds to this eigenvalue. • Et cetera, for the other components.

Principal Component Analysis • Estimation is now rather straightforward. • We use the ML estimate of Σ, which is . • Then we simply have to solve for in . • Where is the ML estimate of λ. • Then we simply have to solve for â inby substitution of the solution for . • Where â is the ML estimate of a.

Visualization

Visualization • Let’s consider R, which is that standardized Σ. • And start simple with only two variables, v1 and v2. • How can this situation be visualized? • In a 2-dimensional space, one dimension for each variable. • Each variable is a vector with length 1. • The angle between the vector represents the correlation.

Visualization • Intuitive proof that “the angle between the vector represents the correlation”; cos(angle)=cor(v1, v2) • If there is no angle, then they are actually the same (except for a constant). • In that case cos(0)=1, and cor(v1, v2) = 1 • Now if they are uncorrelated, then the correlation is zero. • In that case cor(v1, v2) = 0, • and if cos(angle)=cor(v1, v2), then cos(angle) = 0, • So angle = ½π, since cos(½π) = 0 • So we can visualize the correlation matrix. • The correlation between v1 and v2 = 0.7; thus angle = ¼π.

Visualization First Principal Component Variable 2 Variable 1

Visualization V2 1st PC V1 Projection of V1 on principal component (equals constant a11 in equation) Projection of V2 on principal component (equals constant a12 in equation)

Visualization V2 1st PC V1 Total projection, which is the variance of the 1st PC, and thus λ1.

Visualization 2nd PC V2 Projection of V2 on 2nd PC V1 Projection (=0) of V1 on 2nd PC 1st PC

Visualization 2nd PC V2 V1 Total projection, which is the variance of the 2nd PC, and thus λ2. 1st PC Total projection, which is the variance of the 1st PC, and thus λ1.

Visualization • Of course, PCA is concerned with finding the largest variance of the first component, etc. • In this example, there is possibly a better alternative. • So, what I said where solution for λ’s and a, where in fact non-optimal solutions. • Let’s find an optimal solution.

Visualization Variable 2 Variable 1 First Principal Component

Visualization Maximizedprojection of V2 on principal component (a12 in equation) V2 V1 Maximized projection of V1 on principal component (a11 in equation) 1st PC

Visualization Total maximum projection, which is the variance of the 1st PC, and thus λ1. V2 ‘Minimized’ projection of V1 and V2 on 2nd PC V1 2nd PC 1st PC

Visualization Total maximum projection, which is the variance of the 1st PC, and thus λ1. Total maximum projection, which is the variance of the 2nd PC, and thus λ2. V2 V1 2nd PC 1st PC

Example

An example • How does a PCA solution look like? • It might make sense to say that the weighted sum of these items is something that we could call loneliness.

An example • The loneliness items not only seem to be related on ‘face value’, but the variables are also correlated.

An example • What can we expect with PCA? • There are 8 items, so there will be 8 PC’s. • On face value the items are related to one thing: loneliness. • So there should be one PC (interpretable as loneliness), that accounts for most variance in the nine observed variables.

An example This is the variance explained by the principal components. Note that it is never 1, due to the fact that only the PC’s with eigenvalue >1 are used.

They absorbed 63.6% of all variance First 2 PC’s have EV>1 An example The ‘practical’ solution, that has thrown away all PC’s with a eigenvalue smaller than 1. This is an arbitrary choice, which is called the Kaiser criterion The complete PCA solution. With all 8 variables and 8 PC’s.

46% 17% An example These are the constants ai from the matrix A’. Also they are the projections on the principal component axes. The square is the variance contributed to the component by the observed variable. When you add the squared loadings, you will obtain the eigenvalue of the component. Σ(squared numbers)=3.7

An example • What can we expect with PCA? • There are 8 items, so there will be 8 PC’s. • On face value the items are related to one thing: loneliness. • So there should be one PC (interpretable as loneliness), that accounts for most variance in the nine observed variables. • We find 1 PC that absorbs almost 50% of the variance, that one might be called loneliness. • However we loose 50% of the variance. The second PC hardly does anything. Let alone the other components. • So the number of variables can be reduced from 10 to 1. • However, at a huge loss.

Practical issues

Practical issues • PCA is NOT factor analysis!! • neither exploratory nor confirmatory. • In factor analysis it is assumed that the structure in the data are the result of an underlying factor structure. So, there is a theory. • In PCA the original data are linearly transformed into a set of new uncorrelated variables, with maximum variance property. This is a mathematical optimization procedure, that lacks a theory about the data. • Many people think they use PCA. However, they use a rotated version of the PCA solution, for which the maximum variance property does not necessarily hold any more. • The advantage is that such rotated solutions are often better interpretable, because the PCA solution has too often no substantive meaning.

Practical issues • In PCA the PC’s are uncorrelated, beware of that when interpreting the PC’s. • I have often seen that the PC’s are interpreted as related constructs, eg. loneliness and shyness, but I assume that such constructs are related, so a different interpretation should be found. • Many times the solutions are rotated, to obtain results that are better interpretable.

Multivariate Statistics