660 likes | 795 Vues
Computational BioMedical Informatics. SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. Course Information. Instructor: Dr. Jinbo Bi Office: ITEB 233 Phone: 860-486-1458 Email: jinbo@engr.uconn.edu
 
                
                E N D
Computational BioMedical Informatics SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept.
Course Information • Instructor: Dr. Jinbo Bi • Office: ITEB 233 • Phone: 860-486-1458 • Email:jinbo@engr.uconn.edu • Web: http://www.engr.uconn.edu/~jinbo/ • Time: Mon / Wed. 2:00pm – 3:15pm • Location: CAST 204 • Office hours: Mon. 3:30-4:30pm • HuskyCT • http://learn.uconn.edu • Login with your NetID and password • Illustration
Review of previous classes • Introduced unsupervised learning – particularly, cluster analysis techniques • Discussed one important application of cluster analysis – cardiac ultrasound view recognition • Review papers on medical, health or public health topics • This class, we start to discuss dimension reduction
Topics today • What is feature reduction? • Why feature reduction? • More general motivations of component analysis • Feature reduction algorithms • Principal component analysis • Canonical correlation analysis • Independent component analysis
What is feature reduction? • Feature reduction refers to the mapping of the original high-dimensional data onto a lower-dimensional space. • Criterion for feature reduction can be different based on different problem settings. • Unsupervised setting: minimize the information loss • Supervised setting: maximize the class discrimination • Given a set of data points of p variables Compute the linear transformation (projection)
What is feature reduction? Original data reduced data Linear transformation
High-dimensional data Gene expression Face images Handwritten digits
Feature reduction versus feature selection • Feature reduction • All original features are used • The transformed features are linear combinations of the original features. • Feature selection • Only a subset of the original features are used.
Why feature reduction? • Most machine learning and data mining techniques may not be effective for high-dimensional data • Curse of Dimensionality • Query accuracy and efficiency degrade rapidly as the dimension increases. • The intrinsic dimension may be small. • For example, the number of genes responsible for a certain type of disease may be small.
Why feature reduction? • Understanding: reason about or obtain insights from data • Combinations of observed variables may be more effective bases for insights, even if physical meaning is obscure • Visualization: projection of high-dimensional data onto 2D or 3D. Too much noise in the data • Data compression: Need to “reduce” them to a smaller set of factors for an efficient storage and retrieval • Better representation of data without losing much information • Noise removal: Can build more effective data analyses on the reduced-dimensional space: classification, clustering, pattern recognition, (positive effect on query accuracy)
Application of feature reduction • Face recognition • Handwritten digit recognition • Text mining • Image retrieval • Microarray data analysis • Protein classification
More general motivations • We study phenomena that can not be directly observed • ego, personality, intelligence in psychology • Underlying factors that govern the observed data • We want to identify and operate with underlying latent factors rather than the observed data • E.g. topics in news articles • Transcription factors in genomics • We want to discover and exploit hidden relationships • “beautiful car” and “gorgeous automobile” are closely related • So are “driver” and “automobile” • But does your search engine know this? • Reduces noise and error in results
More general motivations • Discover a new set of factors/dimensions/axes against which to represent, describe or evaluate the data • For more effective reasoning, insights, or better visualization • Reduce noise in the data • Typically a smaller set of factors: dimension reduction • Better representation of data without losing much information • Can build more effective data analyses on the reduced-dimensional space: classification, clustering, pattern recognition • Factors are combinations of observed variables • May be more effective bases for insights, even if physical meaning is obscure • Observed data are described in terms of these factors rather than in terms of original variables/dimensions
Feature reduction algorithms • Unsupervised • Principal component analysis (PCA) • Canonical correlation analysis (CCA) • Independent component analysis (ICA) • Supervised • Linear discriminant analysis (LDA) • Sparse support vector machine (SSVM) • Semi-supervised • Research topic
Feature reduction algorithms • Linear • Principal Component Analysis (PCA) • Linear Discriminant Analysis (LDA) • Canonical Correlation Analysis (CCA) • Independent component analysis (ICA) • Nonlinear • Nonlinear feature reduction using kernels • Kernel PCA, kernel CCA, … • Manifold learning • LLE, Isomap
Basic Concept • Areas of variance in data are where items can be best discriminated and key underlying phenomena observed • Areas of greatest “signal” in the data • If two items or dimensions are highly correlated or dependent • They are likely to represent highly related phenomena • If they tell us about the same underlying variance in the data, combining them to form a single measure is reasonable • Parsimony • Reduction in Error • So we want to combine related variables, and focus on uncorrelated or independent ones, especially those along which the observations have high variance • We want a smaller set of variables that explain most of the variance in the original data, in more compact and insightful form
Basic Concept • What if the dependences and correlations are not so strong or direct? • And suppose you have 3 variables, or 4, or 5, or 10000? • Look for the phenomena underlying the observed covariance/co-dependence in a set of variables • Once again, phenomena that are uncorrelated or independent, and especially those along which the data show high variance • These phenomena are called “factors” or “principal components” or “independent components,” depending on the methods used • Factor analysis: based on variance/covariance/correlation • Independent Component Analysis: based on independence
What is Principal Component Analysis? • Principal component analysis (PCA) • Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables • Retains most of the sample's information. • Useful for the compression and classification of data. • By information we mean the variation present in the sample, given by the correlations between the original variables.
Principal Component Analysis • Most common form of factor analysis • The new variables/dimensions • are linear combinations of the original ones • are uncorrelated with one another • Orthogonal in original dimension space • capture as much of the original variance in the data as possible • are called Principal Components • are ordered by the fraction of the total information each retains
Some Simple Demos • http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html
What are the new axes? Original Variable B PC 2 PC 1 Original Variable A • Orthogonal directions of greatest variance in data • Projections along PC1 discriminate the data most along any one axis
Principal Components • First principal component is the direction of greatest variability (covariance) in the data • Second is the next orthogonal (uncorrelated) direction of greatest variability • So first remove all the variability along the first component, and then find the next direction of greatest variability • And so on …
Principal Components Analysis (PCA) • Principle • Linear projection method to reduce the number of parameters • Transfer a set of correlated variables into a new set of uncorrelated variables • Map the data into a space of lower dimensionality • Form of unsupervised learning • Properties • It can be viewed as a rotation of the existing axes to new positions in the space defined by original variables • New axes are orthogonal and represent the directions with maximum variability
Computing the Components • Data points are vectors in a multidimensional space • Projection of vector x onto an axis (dimension) u is u.x • Direction of greatest variability is that in which the average square of the projection is greatest • I.e. u such that E((u.x)2) over all x is maximized • (we subtract the mean along each dimension, and center the original axis system at the centroid of all data points, for simplicity) • This direction of u is the direction of the first Principal Component
Computing the Components • E((uTx)2) = E ((uTx) (uTx)) = E (uTx.xTu) • The matrix C = x.xT contains the correlations (similarities) of the original axes based on how the data values project onto them • So we are looking for w that maximizes uTCu, subject to u being unit-length • It is maximized when w is the principal eigenvector of the matrix C, in which case • uTCu= uTlu= l if u is unit-length, where l is the principal eigenvalue of the covariance matrix C • The eigenvalue denotes the amount of variability captured along that dimension
Why the Eigenvectors? Maximise uTxxTu s.tuTu = 1 Construct Langrangian uTxxTu–λuTu Vector of partial derivatives set to zero xxTu –λu =(xxT –λI) u = 0 As u ≠ 0 then u must be an eigenvector of xxT with eigenvalue λ
Singular Value Decomposition The first root is called the prinicipal eigenvalue which has an associated orthonormal (uTu = 1) eigenvectoru Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D) non-zero values. Eigenvectors form an orthonormal basis i.e. uiTuj = δij The eigenvalue decomposition: C = xxT = UΣUT whereU = [u1, u2, …, uM] and Σ= diag[λ1, λ2, …, λM] Similarly the eigenvalue decomposition ofxTx = VΣVT The SVD is closely related to the above x=U Σ1/2 VT The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues.
Computing the Components • Similarly for the next axis, etc. • So, the new axes are the eigenvectors of the covariance matrix of the original variables, which captures the similarities of the original variables based on how data samples project to them • Geometrically: centering followed by rotation • Linear transformation
PCs, Variance and Least-Squares • The first PC retains the greatest amount of variation in the sample • The kth PC retains the kth greatest fraction of the variation in the sample • The kth largest eigenvalue of the correlation matrix C is the variance in the sample along the kth PC • The least-squares view: PCs are a series of linear least squares fits to a sample, each orthogonal to all previous ones
How Many PCs? • For n original dimensions, correlation matrix is nxn, and has up to n eigenvectors. So n PCs. • Where does dimensionality reduction come from?
Dimensionality Reduction • Can ignore the components of lesser significance. • You do lose some information, but if the eigenvalues are small, you don’t lose much • p dimensions in original data • calculate p eigenvectors and eigenvalues • choose only the first d eigenvectors, based on their eigenvalues • final data set has only d dimensions
Geometric picture of principal components • the 1st PC is a minimum distance fit to a line in X space • the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous.
Optimality property of PCA Reconstruction Dimension reduction Original data
Optimality property of PCA Main theoretical result: The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem: reconstruction error PCA projection minimizes the reconstruction error among all linear projections of size d.
Applications of PCA • Eigenfaces for recognition. Turk and Pentland. 1991. • Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001. • Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003.
PCA applications -Eigenfaces • the principal eigenface looks like a bland and rogynous average human face http://en.wikipedia.org/wiki/Image:Eigenfaces.png
Eigenfaces – Face Recognition • When properly weighted, eigenfaces can be summed together to create an approximate gray-scale rendering of a human face. • Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces • Hence eigenfaces provide a means of applying data compression to faces for identification purposes. • Similarly, Expert Object Recognition in Video
PCA for image compression d=1 d=2 d=4 d=8 Original Image d=16 d=32 d=64 d=100
Topics today • Feature reduction algorithms • Principal component analysis • Canonical correlation analysis • Independent component analysis
Canonical correlation analysis (CCA) • CCA was developed first by H. Hotelling. • H. Hotelling. Relations between two sets of variates. Biometrika, 28:321-377, 1936. • CCA measures the linear relationship between two multidimensional variables. • CCA finds two bases, one for each variable, that are optimal with respect to correlations. • Applications in economics, medical studies, bioinformatics and other areas.
Canonical correlation analysis (CCA) • Two multidimensional variables • Two different measurements on the same set of objects • Web images and associated text • Protein (or gene) sequences and related literature (text) • Protein sequence and corresponding gene expression • In classification: feature vector and class label • Two measurements on the same object are likely to be correlated. • May not be obvious on the original measurements. • Find the maximum correlation on transformed space.
Canonical Correlation Analysis (CCA) Correlation Transformed data measurement transformation
Problem definition • Find two sets of basis vectors, one for x and the other for y, such that the correlations between the projections of the variables onto these basis vectors are maximized. Given Compute two basis vectors
Problem definition • Compute the two basis vectors so that the correlations of the projections onto these vectors are maximized.
Algebraic derivation of CCA The optimization problem is equivalent to where
Geometric interpretation of CCA • The Geometry of CCA Maximization of the correlation is equivalent to the minimization of the distance.
max s.t. Algebraic derivation of CCA The optimization problem is equivalent to
Applications in bioinformatics • CCA can be extended to multiple views of the data • Multiple (larger than 2) data sources • Two different ways to combine different data sources • Multiple CCA • Consider all pairwise correlations • Integrated CCA • Divide into two disjoint sources