1 / 34

Principal Component Analysis

Principal Component Analysis. Philosophy of PCA. Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data in terms of a set of uncorrelated variables

Télécharger la présentation

Principal Component Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principal Component Analysis

  2. Philosophy of PCA • Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data in terms of a set of uncorrelated variables • We typically have a data matrix of n observations on p correlated variables x1,x2,…xp • PCA looks for a transformation of the xiinto p new variables yithat are uncorrelated

  3. The data matrix

  4. Reduce dimension • The simplet way is to keep one variable and discard all others: not reasonable! • Wheigt all variable equally: not reasonable (unless they have same variance) • Wheigted average based on some citerion. • Which criterion?

  5. Let us write it first • Looking for a transformation of the data matrix X (nxp) such that Y= TX=1 X1+ 2 X2+..+ p Xp • Where =(1 , 2 ,.., p)Tis a column vector of wheights with 1²+ 2²+..+ p²=1

  6. One good criterion • Maximize the variance of the projection of the observations on the Y variables • Find  so that Var(T X)= T Var(X)  is maximal • The matrix C=Var(X) is the covariance matrix of the Xivariables

  7. Let us see it on a figure Good Better

  8. Covariance matrix C=

  9. And so.. We find that • The direction of  is given by the eigenvector 1 correponding to the largest eigenvalue of matrix C • The second vector that is orthogonal (uncorrelated) to the first is the one that has the second highest variance which comes to be the eignevector corresponding to the second eigenvalue • And so on …

  10. So PCA gives • New variables Yi that are linear combination of the original variables (xi): • Yi= ai1x1+ai2x2+…aipxp ; i=1..p • The new variables Yiare derived in decreasing order of importance; • they are called ‘principal components’

  11. Calculating eignevalues and eigenvectors • The eigenvalues i are found by solving the equation det(C-I)=0 • Eigenvectors are columns of the matrix A such that C=A D AT • Where D=

  12. An example • Let us take two variables with covariance c>0 • C= C-I= det(C-I)=(1- )²-c² • Solving this we find1 =1+c 2 =1-c < 1

  13. and eigenvectors • Any eigenvector A satisfies the condition CA=A Solving we find A= CA= = = A2 A1

  14. PCA is sensitive to scale • If you multiply one variable by a scalar you get different results (can you show it?) • This is because it uses covariance matrix (and not correlation) • PCA should be applied on data that have approximately the same scale in each variable

  15. Interpretation of PCA • The new variables (PCs) have a variance equal to their corresponding eigenvalue Var(Yi)=ifor alli=1…p • Small i small variance  data change little in the direction of component Yi • The relative variance explained by each PC is given by li / li

  16. How many components to keep? • Enough PCs to have a cumulative variance explained by the PCs that is >50-70% • Kaiser criterion: keep PCs with eigenvalues >1 • Scree plot: represents the ability of PCs to explain de variation in data

  17. Do it graphically

  18. Interpretation of components • See the wheights of variables in each component • If Y1= 0.89 X1 +0.15X2-0.77X3+0.51X4 • Then X1 and X3have the highest wheights and so are the mots important variable in the first PC • See the correlation between variables Xiand PCs: circle of correlation

  19. Circle of correlation

  20. Normalized (standardized) PCA • If variables have very heterogenous variances we standardize them • The standardized variables Xi* Xi*= (Xi-mean)/variance • The new variables all have the same variance (1), so each variable have the same wheight.

  21. Application of PCA in Genomics • PCA is useful for finding new, more informative, uncorrelated features; it reduces dimensionality by rejecting low variance features • Analysis of expression data • Analysis of metabolomics data (Ward et al., 2003)

  22. However • PCA is only powerful if the biological question is related to the highest variance in the dataset • If not other techniques are more useful : Independent Component Analysis • Introduced by Jutten in 1987

  23. What is ICA?

  24. That looks like that

  25. The idea behind ICA

  26. How it works?

  27. Rationale of ICA • Find the components Si that are as independent as possible in the sens of maximizing some function F(s1,s2,.,sk) that measures indepedence • All ICs (except 1) should be non-Normal • The variance of all ICs is 1 • There is no hierarchy between ICs

  28. How to find ICs ? • Many choices of objective function F • Mutual information • We use the kurtosis of the variables to approximate the distribution function • The number of ICs is chosen by the user

  29. Difference with PCA • It is not a dimensionality reduction technique • There is no single (exact) solution for components; uses different algorithms (in R: FastICA, PearsonICA, MLICA) • ICs are of course uncorrelated but also as independent as possible • Uninteresting for Normally distributed variables

  30. Example: Lee and Batzoglou (2003) • Microarray expression data on 7070 genes in 59 Normal human tissue samples (19 types) • We are not interested in reducing dimension but rather in looking for genes that show tissue specific expression profile (what make tissue types differents)

  31. PCA vs ICA • Hsiao et al (2002) applied PCA and by visual inspection observed three gene cluster of 425 genes: liver-specific, brain-specific and muscle-specific • ICA identified more tissue-specific genes than PCA

More Related