300 likes | 315 Vues
This article introduces Principal Component Analysis (PCA) and its applications in neural networks for feature selection and dimensionality reduction. It discusses the Generalized Hebbian Algorithm, Adaptive Principal Components Extraction, and Kernel Principal Components Analysis. Conclusions are drawn from the analysis.
E N D
WK9 – Principle Component Analysis CS 476: Networks of Neural Computation WK9 – Principle Component Analysis Dr. Stathis Kasderidis Dept. of Computer Science University of Crete Spring Semester, 2009
Contents • Introduction to Principal Component Analysis • Generalised Hebbian Algorithm • Adaptive Principal Components Extraction • Kernel Principal Components Analysis • Conclusions Contents
Principal Component Analysis • The PCA method is a statistical method for Feature Selection and Dimensionality Reduction. • Feature Selection is a process whereby a data space is transformed into a feature space. In principal both spaces have the same dimensionality. • However, in the PCA method, the transformation is design in such way that the data set be represented by a reduced number of “effective” features and yet retain most of the intrinsic information contained in the data; in other words the data set undergoes a dimensionality reduction. PCA
Principal Component Analysis-1 • Suppose that we have a x of dimension m and we wish to transmit it using l numbers, where l<m. If we simply truncate the vector x, we will cause a mean square error equal to the sum of the variances of the elements eliminated from x. • So, we ask: Does there exist an invertiblelinear transformationT such that the truncation of Tx is optimum in the mean-squared sense? • Clearly, the transformation T should have the property that some of its components have low variance. • Principal Component Analysis maximises the rate PCA
Principal Component Analysis-2 • of decrease of variance and is the right choice. • Before we present neural network, Hebbian-based, algorithms that do this we first present the statistical analysis of the problem. • Let X be an m-dimensional random vector representing the environment of interest. We assume that the vector X has zero mean: • E[X]=0 • Where E is the statistical expectation operator. If X has not zero mean we first subtract the mean from X before we proceed with the rest of the analysis. PCA
Principal Component Analysis-3 • Let q denote a unit vector, also of dimension m, onto which the vector X is to be projected.This projection is defined by the inner product of the vectors X and q: • A=XTq=qTX • Subject to the constraint: • ||q||=(qTq)½=1 • The projection A is a random variable with a mean and variance related to the statistics of vector X. Assuming that X has zero mean we can calculate the mean value of the projection A: • E[A]=qTE[X]=0 PCA
Principal Component Analysis-4 • The variance of A is therefore the same as its mean-square value and so we can write: • 2=E[A2]=E[(qTX)(XTq)]=qTE[XXT]q=qTR q • The m-by-m matrix R is the correlation matrix of the random vector X, formally defined as the expectation of the outer product of the vector X with itself, as shown: • R=E[XXT] • We observe that the matrix R is symmetric, which means that: • RT=R PCA
Principal Component Analysis-5 • From this property it follows that for any m-by-1 vectors a and b we have: • aTRb= bTRa • From the above we see that the variance 2 of A is a function of the unit vector q; we can then thus write: • (q)= 2= qTR q • From the above we can think of (q) as a variance probe. • To minimise the variance of A we must find the vectors q which are the extremal points of (q), PCA
Principal Component Analysis-6 • Subject to the constraint of unit length. • If q is a vector such that (q) has an extreme value, then for any small q of theunitvector q, we find that, to the first order in q: • (q+q )= (q) • Now from the definition of the variance probe we have: • (q+q )= (q+q)TR (q+q)= • qTRq+2(q)TRq+ (q)TR q • Where in the previous line we have made use of the symmetric property of matrix R. PCA
Principal Component Analysis-7 • Ignoring the second-order term (q)TR q and invoking the definition of (q) we may write: • (q+q )= qTRq+2(q)TRq=(q) +2(q)TRq • The above relation implies that: • (q)TRq=0 • Note that just any perturbation q of q is not admissible; rather we restrict to use those for which the Euclidean norm of the perturbed vector q+q remains equal to unity: • || q+q ||=1 • Or: (q+q)T (q+q)=1 PCA
Principal Component Analysis-8 • Taking into account that q is already a vector of unit length, this means that: • (q)Tq=0 • This means that perturbation q must be orthogonal to q and therefore only a small change in the direction of q is permitted. • Combining the previous two equations we can now write: • (q)TRq-(q)Tq=0 (q)T(Rq- q)=0 • Where is a scaling constant for the elements of R. • We can now write: PCA
Principal Component Analysis-9 • Rq= q • This means that q is an eigenvector and is an eigenvalue of R. • The matrix R has real and non-negative eigenvalues (because it is symmetric). Let the eigenvalues of matrix R be denoted by i and the corresponding vectors by qi where the eigenvalues are arranged in a decreasing order: • 1 > 2 > … > m • so that 1= max. PCA
Principal Component Analysis-10 • We can then write matrix R as: • Combining the previous results we can see that the variance probes are the same as the eigenvalues: • (qj)= j , for j=1,2,…,m • To summarise the previous analysis we have two important results: • The eigenvectors of the correlation matrix R pertaining to the zero-mean random variable X define the unit vectors qj , representing the principal directions along which the variance probes (qj) have their extreme values; PCA
Principal Component Analysis-11 • The associated eigenvalues define the extremal values of the variance probes. • We now we want to investigate the representation of a data vector x which is a realisation of the random vector X. • With m eigenvectors qj we have m possible projection directions. The projections of x into the eigenvectors are given by: • j=qjTx= xTqj , j=1,2,…,m • The numbers j are called the principal components. To reconstruct the original vector x from the projections we combine all projections into PCA
Principal Component Analysis-12 • a single vector: • =[1, 2,…, m]T • =[xTq1, xTq2,…, xTqm]T • =QTx • Where Q is the matrix which is constructed by the (column) eigenvectors of R. • From the above we see that: • x=Q • This is nothing more than a coordinate PCA
Principal Component Analysis-13 • transformation from the input space, of vector x, to the feature space of the vector . • From the perspective of the pattern recognition the usefulness of the PCA method is that it provides an effective technique for dimensionality reduction. • In particular we may reduce the number of features needed for effective data representation by discarding those linear combinations in the previous formula that have small variances and retain only these terms that have large variances. • Let 1, 2, …, l denote the largest l eigenvalues of R. We may then approximate the vector x by PCA
Principal Component Analysis-14 • truncating the previous sum to the first l terms: PCA
Generalised Hebbian Algorithm • We will present now a neural network method which solves the PCA problem. It belongs to the so-called re-estimation algorithms class of PCA methods. • The network which solves the problem is shown below: GHA
Generalised Hebbian Algorithm -1 • For the feedforward network shown we make two structural assumptions: • Each neuron in the output layer of the network is linear; • The network has m inputs and l outputs, both of which are specified. Moreover, the network has fewer outputs than inputs (i.e. l < m). • It can be shown that under these assumptions and by using a special form of Hebbian learning the network truly learns to calculate the principal components in its output nodes. • The GHA can be summarised as follows: GHA
Generalised Hebbian Algorithm -2 • Initialise the synaptic weights of the network, wji, to small random values at time n=1. Assign a small positive value to the learning rate parameter ; • For n=1, j=1,2,…,l and i=1,2,…,m, compute: • Where xi(n) is the ith component of the m-by-1 input vector x(n) and l is the desire number of principal compenents; • Increment n by 1, go to step 2, and continue until the synaptic weights wji reach their steady state GHA
Generalised Hebbian Algorithm -3 • values. For large n, the weight wji of neuron j converges to the ith component of the eigenvector associated with jth eigenvalue of the correlation matrix of the input vector x(n). The output neurons represent the eigenvalues of correlation matrix with decreasing order from 1 towards l. GHA
Adaptive Principal Components Extraction • Another algorithm for extracting the principal components is the adaptive principal components extraction (APEX) algorithm. This network uses both feedforward and feedback connections. • The algorithm is iterative in nature and if we are given the first (j-1) principal components the jth one can be easily computed. • This algorithm belongs to the class of decorrelating algorithms. • The network that implements the algorithm is shown next: APEX
Adaptive Principal Components Extraction-1 • The network structure is defined as follows: • Each neuron is assumed to be linear (in the output layer); • Feedforward connections exist from the input nodes to each of the neurons 1,2,…,j, with j<m. The feedforward connections operate with a Hebbian rule. They are APEX
Adaptive Principal Components Extraction-2 • excitatory and therefore provide amplification. These connections are represented by the wj(n) vector. • Lateral connections exist from the individual outputs of neurons 1,2,…,j-1 to neuron j of the output layer, thereby applying feedback to the network. These connections are represented by the aj(n) vector. The lateral connections operate with an anti-Hebbian learning rule which has the effect of making them inhibitory. • The algorithm is summarised as follows: • Initialise the feedforward weight vector wj and the feedback weight vector aj to small random numbers at time n=1, where j=1,2,…,m. Assign a small APEX
Adaptive Principal Components Extraction-3 • positive value to the learning rate parameter ; • Set j=1, and for n=1,2,…, compute: • where x(n) is the input vector. For large n, we have w1(n)q1, where q1 is the eigenvector asociated with the largest eigenvalue 1 of the correlation matrix of x(n); • Set j=2, and for n=1,2,…, compute: APEX
Adaptive Principal Components Extraction-4 • Increment j by 1, go to step 3, and continue until j=m, where m is the desired number of principal components. (Note that j=1 corresponds to eigenvector associated with the largest eigenvalue, which is taken care in step 2). For large n we have wj(n) qj and aj(n) 0, where qj is the eigenvector associated with the jth eigenvalue of the correlation matrix of x(n). APEX
Kernel Principal Components Analysis • A last algorithm which uses kernels (more on the SVM lecture) will be given below. We simply summarise the algorithm. • This algorithm can be considered as a non-linear PCA methods as we first project the input space in a feature space using a non-linear transform (x) and then we perform a linear PCA analysis in the feature space. This is different from the previous methods in that they calculate a linear transformation between the input and the feature spaces. • Summary of the kernel PCA method: • Given the training examples {xi}i=1 , compute the Kernel PCA
Kernel Principal Components Analysis-1 • the N-by-N kernel matrix K={K(xi, xj)}, where: • K(xi, xj)= T(xi)(xj) • Solve the eigenvalue problem: • Ka=a • where is an eigenvalue of the kernel matrix K and a is the associated eigenvector; • Normalise the eigenvectors so computed by requiring that: • akT ak=1/ k , k=1,2,…,p • where p is the smallest nonzero eigenvalue of the matrix K, assuming that the eigenvalues are arranged in decreasing order; Kernel PCA
Kernel Principal Components Analysis-2 • For the extraction of the principal components of a test point x, compute the projections: • where ak,j is the jth element of eigenvector ak. Kernel PCA
Conclusions • Typically we use PCA methods for dimension reduction as a pre-processing step before we apply other methods, for example in a pattern recognition problem. • There are batch and adaptive numerical methods for the calculation of the PCA. An example for the first class is the Singular Value Decomposition (SVD) method while the GHA algorithm is for example and adaptive method. • It is used mainly for finding out clusters in high-dimensional spaces, as it is difficult to visualise these clusters otherwise. Conclusions