Independent Component Analysis

Independent Component Analysis Reference: Independent Component Analysis: A Tutorial by Aapo Hyvarinen, http:www.cis.hut.fi/projects/ica

History • Early 1980s – Jerault, Jutten, Ans • Neurophysiological problem (muscle contraction) • 1990s – wider use • Infomax algorithm (Bell, Sejnowski) • Connection to maximum likelihood (Amari et al) • FastICA algorithm (Hyvarinen, Karhunen, Oja) • 1999 – first international workshop

ICA Conference • ICA1999, Aussois (France) • ICA2000, Helsinki (Finland) • ICA2001, San Diego (CA, USA) • ICA2003, Nara (Japan) • ICA2004, Granada (Spain) • ICA2006, Charleston (SC, USA) • ICA2007, London (UK) • ICA2009, in Paraty (Brazil)

International Conference on Latent Variable Analysis and Signal Separation • LVA/ICA 2010, St. Malo, France • LVA/ICA 2012, Tel Aviv, Israel

Formulation of ICA Two speech signal s1(t) and s2(t), received by two microphones, the mixed signals are: It will be very useful if we could estimate the original signals s1(t) and s2(t), from only the recorded signals x1(t) and x2(t)

Formulation of ICA • One approach is to use some information on the statistical properties of signals si(t) to estimate aii. • Assume s1(t) and s2(t) are statistical independent, then Independent Component Analysis techniques can retrieve s1(t) and s2(t), from the mixture x1(t) and x2(t).

Original signals s1(t), s2(t) Mixture signals x1(t), x2(t) Recovered signals for s1(t), s2(t)

Definition of ICA The independent component si are latent variables, meaning that they can not be directly observed, and the mixing matrix A is assumed to unknown.

Definition of ICA A W X=AS S=WX S

Illustration of ICA 2 independent components with the following uniform distributions Joint density distribution of the original signal s1 and s2 The distribution has zero mean and the variance equal to one

Illustration of ICA S2 Joint density distribution of the observed mixtures x1 and x2 The mixed data has a uniform distribution on a parallelogram

Illustration of ICA • intuitive way of estimating A: the edges of the parallelogram are in the directions of the columns of A. That is to estimate the ICA model by • first estimating the joint density of x1 and x2 , and then • locating the edges. • However, this only works for random variables with uniform distributions

Ambiguities of ICA We can not determine the variance (energies) of the independent components.

Ambiguities of ICA We can not determine the order of the independent components. Applying a permutation matrix P to x=As, i.e., x=AP-1Ps, then • AP -1 is just a new unknown mixing matrix, • Ps is still like the original signals, • the order of s will be changed.

Independence Let p(y1, y2 ) be the joint probability density function (pdf) of y1 and y2, theny1 and y2 are independent if and only if the joint pdf is factorizable. Thus, given two functions h1 and h2 , we always have

Uncorrelated variables are only partly independent • Two variables y1 and y2 are said to be uncorrelated if their covariance is zero • If the variables are independent, they are uncorrelated, but the reverse is not true! • For example: sin(x) and cos(x) is dependent on x, but cov(sin(x),cos(x))=0

Gaussian variables are forbidden • The fundamental restriction in ICA is that the independent components must be nongaussian for ICA to be possible • assume the mixing matrix is orthogonal and si are gaussian, then x1 and x2 are gaussian, uncorrelated, and of unit variance. • The joint pdf is

The distribution is completely symmetric (shown in figure), it does not contain any infomation on the direction of the columns of the mixing matrix A. Thus A can not be estimated The distribution of two independent gaussian variables

ICA Basic • source separation by ICA must go beyond second order statistics • ignoring any time structure because the information contained in the data is exhaustively represented by the sample distribution of the observed vector • source separation can be obtained by optimizing a “contrast function” • i.e.,: a function that measure independence.

Nongaussian is independent • The key to estimate the ICA model is the nongaussianity • The central limit theorem tells us that the distribution of a sum of independent random variables tends toward a gaussian distribution. In other words, • a mixing of two independent signals usually has a distribution that is closer to gaussian than any of the two original signals

Measures of independence • Suppose, we want to estimate y, one of the independent components of s from x, • let us denotes this by y=wTx=Siwixi, w is a vector to be determined • How can we use central limit theorem to determine w so that it would equal to one of the rows of the inverse of A ? X=AS S=A-1X=WX?

Nongaussian is independent • let us make a change of variables,z = ATw, then we have y = wTx = wTAs = zTs = Sizisi. • thus y=zTs is more gaussian than the original variables si • y becomes least gaussian, when it equals to one of si, a trivial way is to ◇let only one of the elements zi of z be nonzero • Maximizing the nongaussianity of wTx, gives us one of the independent components.

Measures of nongaussianity • To use nongaussianity in ICA, we must have a quantitative measure of nongaussianity of a random variable yi • the classical measure of nongaussianity is kurtosis or the fourth-order cumulant

Assume y is of unit variance, then kurt(y)= E{y4}-3. • For a gaussian y, the fourth moment equals to 3(E{y2})2. Thus, kurtosis is zero for a gaussian random variable. • Kurtosis can be both positive and negative

subgaussian • RV have a negative kurtosis are called subgaussian • subgaussian RV have typically a flat pdf, which is rather constant near zero, and very small for larger values • uniform distribution is a typical example for subgaussian

suppergaussian • RV have a positive kurtosis are called suppergaussian • supergaussian RV have a spiky pdf, with a heavy tail • Laplace distribution is a typical example for supergaussian

Kurtosis • Typically nongaussianity is measured by the absolute value of kurtosis. • Kurtosis can be measured by using the fourth moments of the sample data • if x1 and x2 are two independent RV, it holds

Kurtosis-based approach In practice we could start from a weight vector w, and compute the direction in which the kurtosis of y=wTx is growing or decreasing most strongly based on the available sample x(1),…, x(T) of mixture vector x, and use a gradient method for finding a new vector w.

Drawbacks of Kurtosis The main problem is that kurtosis can be very sensitive to outliers, in other words kurtosis is not a robust measure of nongaussianity.

Entropy • The entropy of a RV is a measure of the degree of randomness of the observed variables. • The more unpredictable and unstructured the variable is, the larger is its entropy.

Entropy A fundamental property of information theory for gaussian variable is it has the largest entropy among all random variables of equal variance.

Negentropy To obtain a measure the nongaussianity that is zero for a gaussian variable and always nonnegative, one often uses Negentropy J, which is defined as: J(y)= H(yGauss)-H(y) where ygauss is a gaussian RV of the same covariance matrix as y.

Negentropy J(y)= H(yGauss)-H(y) • the advantage of usingNegentropy is it is in some sense the optimal estimator of nongaussianity, as far as statistical properties are concerned. • The problem in using negentropy is that it is still computationally very difficult. • Thus simpler approximations of negentropy seems necessary and useful.

Approximations of negentropy • The classical method of approximating negentropy is using higher-order-moments, for example • The RV y is assumed to be of zero mean and unit variance. • This approach still suffer from the nonrobustness as kurtosis

Approximations of negentropy Another approximation were developed based on the maximum-entropy principle: where  is a Gaussian variable of zero mean and unit variance, and G is a nonquadratic function.

Contrast Function Suppose G is chosen to be slow growing as the following contrast functions: This approximation is conceptually simple, fast to compute, and especially robustness.

Contrast Function

Preprocessing - centering • Center variable x, i.e., subtract its mean vector m=E{x}, so as to make x a zero-mean variable. • This preprocessing is solely to simplify the ICA algorithms • After estimating the mixing matrix A with centered data, we can complete the estimation by adding the mean vector of s back to the centered estimates of s.

Preprocessing - whitening • Whitening means to transform the variable x linearly so that the new variable x~ is white, i.e., its components are uncorrelated, and their variances equal unity.

Preprocessing - whitening • Although uncorrelated variables are only partly independent, decorrelation (using second-order information) can be used to reduce the problem to a simpler form.

Fig 1 Fig 2 Fig 3 The graph in Fig 3is the whitened result of the data in Fig 2. The square depicting the distribution is clearly a rotated version of original square in Fig 1. All that is left is the estimation of a single angle that gives rotation.

Preprocessing - whitening • Whitening can be computed by eigenvalue decomposition (EVD) of the covariance matrix E{xxT}=EDET • E is the orthogonal matrix of eigenvectors of E{xxT} • D is a diagonal matrix of its eigenvalues, D=diag(d1,…,dn) • note that E{xxT} can be estimated in a standard way from the available sample of x(1), …, x(T).

ICA method • ICA method = Objective function + Optimization algorithm • Algorithm minimizes/maximizes function

Objective (contrast) function • Multi-unit (all components at once) • Likelihood • Entropy • Mutual information • One-unit (one component at a time – projection pursuit) • Negentropy • Kurtosis (a measure of non-Gaussianity)

Independent Component Analysis