1 / 69

Neural Computation 0368-4149-01

This seminar explores statistical methods such as PCA and ICA for unsupervised learning, focusing on maximizing data variance through linear transformations and invariant distances. Learn ways to extract features and understand the nature of noise in data.

jdial
Télécharger la présentation

Neural Computation 0368-4149-01

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Neural Computation0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 7 Office hours: Wed 4-5 nin@tau.ac.il

  2. Outline • Goals for neural learning - Unsupervised • Goals for statisical/computational learning • PCA • ICA • Exploratory Projection Pursuit • Search for non-Gaussian distributions • Practical implementations

  3. Statistical Approach to Unsupervised Learning • Understanding the nature of data variability • Modeling the data (sometimes very flexible model) • Understanding the nature of the noise • Applying prior knowledge • Extracting features based on: • Prior knowledge • Class prediction • Unsupervised learning

  4. Principal Component Analysis. Włodzisław Duch SCE, NTU, Singapore http://www.ntu.edu.sg/home/aswduch

  5. n-dimensional vectors m-dimensional vectors m < n Neuronal Goal We look for axes which minimise projection errors and maximise the variance after projection Ex: transform from 2 to 1 dimension

  6. more information (variance) rotate less information project Algorithm (cont’d) • Preserve as much of the variance as possible

  7. Linear transformations – example 2D vectors X in a unit circle with mean (1,1);Y = A*X, A = 2x2 matrix The shape is elongated, rotated and the mean is shifted.

  8. Invariant distances Euclidean distance is not invariant to general linear transformations This is invariant only for orthonormal matricesATA = Ithat make rigid rotations, without stretching or shrinking distances. Idea: standardize the data in some way to create invariant distances.

  9. Data standardization For each vector componentX(j)T=(X1(j), ... Xd(j)),j=1 .. n calculate mean and std:n– number of vectors,d– their dimension Vector of mean feature values. Averages over rows.

  10. Standard deviation Calculate standard deviation: Vector of mean feature values. Variance = square of standard deviation (std), sum of all deviations from the mean value. Transform X => Z, standardized data vectors

  11. Std data Std data: zero mean and unit variance. Standardize data after making data transformation. Effect: data is invariant to scaling only (diagonal transformation). Distances are invariant,data distribution is the same?? How to make data invariant to any linear transformations?

  12. Terminology (Covariance) • How two dimensions vary from the mean with respect to each other • cov(X,Y)>0: Dimensions increase together • cov(X,Y)<0: One increases, one decreases • cov(X,Y)=0: Dimensions are independent

  13. Terminology (Covariance Matrix) • Contains covariance values between all possible dimensions: • Example for three dimensions (x,y,z)(Always symetric): cov(x,x)variance of component x

  14. Properties of the Cov matrix • Can be used for creating a distance that is not sensitive to linear transformation • Can be used to find directions which maximize the variance • Determines a Gaussian distribution uniquely (up to a shift)

  15. Data standardization example Transformation Vector of mean feature values. Variance check it! For our exampleY=AX, assumingXmeans=1 and variances = 1 How to make this invariant?

  16. Covariance matrix Variance (spread around mean value) + correlation between features. CXisd x d whereXisd x ndimensional matrix of vectors shifted to their means. Covariance matrix is symmetricCij = Cjiand positive definite. Diagonal elements are variances (square of std),si2 = Cii Pearson correlation coefficient Spherical distribution of data hasCij=I(unit matrix). Elongated ellipsoids: large off-diagonal elements, strong correlations between features.

  17. Mahalanobis distance Linear combinations of features leads to rotations and scaling of data. Mahalanobis distance: is invariant to linear transformations:

  18. Principal components How to avoid correlated features? Correlations  covariance matrix is non-diagonal ! Solution: diagonalize it, then use transformation that makes it diagonal to de-correlate features. Z are the eigen vectors of Cx In matrix form,X, Yare dxn,Z, CX,CYaredxd C – symmetric, positive definite matrixXTCX > 0for||X||>0; its eigenvectors are orthonormal: its eigenvalues are all non-negative Z– matrix of orthonormaleigenvectors (becauseZis real+symmetric), transforms Xinto Y,with diagonalCY, i.e. decorrelated.

  19. Matrix form Eigenproblem for C matrix in matrix form:

  20. Principal components Z– principal components, of vectorsXtransformed using eigenvectors ofCX Covariance matrix of transformed vectors is diagonal => ellipsoidal distribution of data. PCA: old idea, C. Pearson (1901), H. Hotelling 1933 Result: PC are linear combinations of all features, providing new uncorrelated features, with diagonal covariance matrix = eigenvalues. Smallli small variance  data change little in directionYi PCA minimizesCmatrix reconstruction errors: Zivectors for largeliare sufficient to get: because vectors for small eigenvalues will have very small contribution to the covariance matrix.

  21. Two components for visualization Diagonalization methods: see Numerical Recipes,www.nr.com New coordinate system: axis ordered according to variance = size of the eigenvalue. First k dimensions account for fraction of all variance (please note that li are variances); frequently 80-90% is sufficient for rough description.

  22. Solving for Eigenvalues & Eigenvectors • Vectors x having same direction as Ax are called eigenvectors of A (A is an n by n matrix). • In the equation Ax=x,  is called an eigenvalue of A. • Ax=x  (A-I)x=0 • How to calculate x and : • Calculate det(A-I), yields a polynomial (degree n) • Determine roots to det(A-I)=0, roots are eigenvalues  • Solve (A- I) x=0 for each  to obtain eigenvectors x

  23. PCA properties PC Analysis (PCA) may be achieved by: • transformation making covariance matrix diagonal • projecting the data on a line for which the sums of squares of distances from original points to projections is minimal. • orthogonal transformation to new variables that have stationary variances True covariance matrices are usually not known, estimated from data. This works well on single-cluster data; more complex structure may require local PCA, separately for each cluster. PC is useful for: finding new, more informative, uncorrelated features; reducing dimensionality: reject low variance features, reconstructing covariance matrices from low-dim data.

  24. PCA Wisconsin example Wisconsin Breast Cancer data: • Collected at the University of Wisconsin Hospitals, USA. • 699 cases, 458 (65.5%) benign (red), 241 malignant (green). • 9 features: quantized 1, 2 .. 10, cell properties, ex: Clump Thickness, Uniformity of Cell Size, Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses. 2D scatterograms do not show any structure no matter which subspaces are taken!

  25. Example cont. PC gives useful information already in 2D. Taking first PCA component of the standardized data: If (Y1>0.41) then benign else malignant 18 errors/699 cases = 97.4% Transformed vectors are not standardized, std’s are below. Eigenvalues converge slowly, but classes are separated well.

  26. PCA disadvantages Useful for dimensionality reduction but: • Largest variance determines which components are used, but does not guarantee interesting viewpoint for clustering data. • The meaning of features is lost when linear combinations are formed. Analysis of coefficients in Z1 and other important eigenvectors may show which original features are given much weight. PCA may be also done in an efficient way by performing singular value decomposition of the standardized data matrix. PCA is also called Karhuen-Loève transformation. Many variants of PCA are described in A. Webb, Statistical pattern recognition, J. Wiley 2002.

  27. Exercise (will be part of Ex. 1) • How would you calculate efficiently the PCA of data where the dimensionality d is much larger than the number of vector observations n?

  28. 2 skewed distributions PCA transformation for 2D data: First component will be chosen along the largest variance line, both clusters will strongly overlap, no interesting structure will be visible. In fact projection to orthogonal axis to the first PCA component has much more discriminating power. Discriminant coordinates should be used to reveal class structure.

  29. Hebb Rule • Linear neuron • Hebb rule • Similar to LTP (but not quite…)

  30. Hebb Rule • Average Hebb rule= correlation rule • Q: correlation matrix of u

  31. Hebb Rule • Hebb rule with threshold= covariance rule • C: covariance matrix of u • Note that <(v-< v >)(u-< u >)> would be unrealistic because it predicts LTP when both u and v are low

  32. Hebb Rule • Main problem with Hebb rule: it’s unstable… Two solutions: • Bounded weights • Normalization of either the activity of the postsynaptic cells or the weights.

  33. BCM rule • Hebb rule with sliding threshold • BCM rule implements competition because when a synaptic weight grows, it raises by v2, making more difficult for other weights to grow.

  34. Weight Normalization • Subtractive Normalization:

  35. Weight Normalization • Multiplicative Normalization: • Norm of the weights converge to 1/a

  36. Hebb Rule • Convergence properties: • Use an eigenvector decomposition: • where em are the eigenvectors of Q

  37. Hebb Rule e2 e1 l1>l2

  38. Hebb Rule Equations decouple because em are the eigenvectors of Q

  39. Hebb Rule

  40. Hebb Rule • The weights line up with first eigenvector and the postsynaptic activity, v, converges toward the projection of u onto the first eigenvector (unstable PCA)

  41. Hebb Rule • Non zero mean distribution: correlation vs covariance

  42. 1 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Hebb Rule • Limiting weights growth affects the final state First eigenvector: [1,-1] 0.8 x a m w / 2 w w w / max 1

  43. Hebb Rule • Normalization also affects the final state. • Ex: multiplicative normalization. In this case, Hebb rule extracts the first eigenvector but keeps the norm constant (stable PCA).

  44. Hebb Rule • Normalization also affects the final state. • Ex: subtractive normalization.

  45. Hebb Rule

  46. Hebb Rule • The constrain does not affect the other eigenvector: • The weights converge to the second eigenvector (the weights need to be bounded to guarantee stability…)

  47. Ocular Dominance Column • One unit with one input from right and left eyes s: same eye d: different eyes

  48. Ocular Dominance Column • The eigenvectors are:

  49. Ocular Dominance Column • Since qd is likely to be positive, qs+qd>qs-qd. As a result, the weights will converge toward the first eigenvector which mixes the right and left eye equally. No ocular dominance...

  50. Ocular Dominance Column • To get ocular dominance we need subtractive normalization.

More Related