Self Organization: Hebbian Learning

Self Organization: Hebbian Learning CS/CMPE 537 – Neural Networks

Introduction • So far, we have studied neural networks that learn from their environment in a supervised manner • Neural networks can also learn in an unsupervised manner as well. This is also known as self organized learning • Self organized learning discovers significant features or patterns in the input data through general rules that operate locally • Self organizing networks typically consist of two layers with feedforward connections and elements to facilitate ‘local’ learning CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Self-Organization • “Global order can arise from local interactions” – Turing (1952) • Input signal produces certain activity patterns in network <-> weights are modified (feedback loop) • Principles of self organization • Modification in weights tend to self-amplify • Limitation of resources leads to competition and selection of the most active synapse and disregard of less active synapse • Modifications in weights tends to cooperate • Order and structure in the activation patterns represent redundant information CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Hebbian Learning • A self-organizing principle was proposed by Hebb in 1949 in the context of biological neurons • Hebb’s principle • When a neuron repeatedly excites another neuron, then the threshold of the latter neuron is decreased, or the synaptic weight between the neurons is increased, in effect increasing the likelihood of the second neuron to excite • Hebbian learning rule Δwji = ηyjxi • There is no desired or target signal required in the Hebbian rule, hence it is unsupervised learning • The update rule is local to the weight CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Hebbian Update • Consider the update of a single weight w (x and y are the pre- and post-synaptic activities) w(n + 1) = w(n) + ηx(n)y(n) • For a linear activation function w(n + 1) = w(n)[1 + ηx2(n)] • Weights increase without bounds. If initial weight is negative, then it will increase in the negative. If it is positive, then it will increase in the positive range • Hebbian learning is intrinsically unstable, unlike error-correction learning with BP algorithm CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Geometric Interpretation of Hebbian Learning • Consider a single linear neuron with p inputs y = wTx = xTw and Δw = η[x1y x2y … xpy]T • The dot product can be written as y = |w||x| cos(α) • α = angle between vectors x and w • If α is zero (x and w are ‘close’) y is large. If α is 90 (x and w are ‘far’) y is zero. CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Similarity Measure • A network trained with Hebbian learning creates a similarity measure (the inner product) in its input space according to the information contained in the weights • The weights capture (memorizes) the information in the data during training • During operation, when the weights are fixed, a large output y signifies that the present input is "similar" to the inputs x that created the weights during training • Similarity measures • Hamming distance • Correlation CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Hebbian Learning as Correlation Learning • Hebbian learning (pattern-by-pattern mode) Δw(n) = ηx(n)y(n) = ηxT(n)x(n)w(n) • Using batch mode Δw(n) = η[Σn=1Nx(n)xT(n)]w(0) • The term Σn=1Nx(n)xT(n) is sample approximation of the auto-correlation of the input data • Thus Hebbian learning can be thought of learning the auto-correlation of the input space • Correlation is a well-known operation in signal processing and statistics. In particular, it completely describes signals defined by Gaussian distributions • Applications in signal processing CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Oja’s Rule • The simple Hebbian rule causes the weights to increase (or decrease) without bounds • The weights need to be normalized to one as wji(n + 1) = [wji(n) + ηxi(n)yj(n)] / √Σi[wji(n) + ηxi(n)yj(n)]2 • This equation effectively imposes a constraint on the weights that the sum at a neuron be equal to 1 • Oja approximated the normalization (for small η) as: wji(n + 1) = wjin) + ηyj(n)[xi(n) – yj(n)wji(n)] • This is Oja’s rule, or the generalized Hebbian rule • It involves a ‘forgetting term’ that prevents the weights from growing without bounds CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Oja’s Rule – Geometric Interpretation • The simple Hebbian rule finds the weight vector with the largest variance with the input data. • However, the magnitude of the weight vector increases without bounds • Oja’s rule has a similar interpretation; normalization only changes the magnitude while the direction of the weight vector is same • Magnitude is equal to one • Oja’s rule converges asymptotically, unlike Hebbian rule which is unstable CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

The Maximum Eigenfilter • A linear neuron trained with Oja’s rule produces a weight vector that is the eigenvector of the input auto correlation matrix, and produces at its output the largest eigenvalue • A linear neuron trained with Oja’s rule solves the following eigen problem Re1 = λ1e1 • R = auto-correlation matrix of input data • e1 = largest eigenvector which corresponds to the weight vector w obtained by Oja’s rule • λ1 = largest eigenvalue, which corresponds to the network’s output CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Principal Component Analysis (1) • Oja’s rule when applied to a single neuron creates a principal component in the input space in the form of the weight vector • How can we find other components in the input space with significant variance ? • In statistics, PCA is used to obtain the significant components of data in the form of orthogonal principal axes • PCA is also known as K-L filtering in signal processing • First proposed in 1901. Later developments occurred in the 1930s, 1940s and 1960s. • Hebbian network with Oja’s rule can perform PCA CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Principal Component Analysis (2) PCA • Consider a set of vectors x with zero mean and unit variance. There exist an orthogonal transformation y = QTx such that the covariance matrix of y is Λ = E[yyT] Λij = λi if i = j and Λij = 0 otherwise (diagonal matrix) • λ1 > λ2 > … > λp = eigenvalues of covariance matrix of x (C = E[xxT] • Columns of Q are the corresponding eigenvectors • Vector y is the principal component that has the maximum variance with all other components CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

PCA – Example CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Hebbian Network for PCA CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Hebbian Network for PCA • Procedure • Use Oja’s rule to find the principal component • Project the data orthogonal to the principal component • Use Oja’s rule on the projected data to find the next major component • Repeat the above for m <= p (m = desired components; p = input space dimensionality) • How to find the projection onto orthogonal direction? • Deflation method: subtract the principal component from the input • Oja’s rule can be modified to perform this operation; Sanger’s rule CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Sanger’s Rule • Sanger’s rule is a modification of Oja’s rule that implements the deflation method for PCA • Classical PCA involves matrix operations • Sanger’s rule implements PCA in an iterative fashion for neural networks • Consider p inputs and m outputs, where m < p yj(n) = Σi=1p wji(n)xi(n) j = 1, m and, the update (Sanger’s rule) Δwji(n) = η[yj(n)xi(n) – yj(n) Σk=1j wki(n)yk(n)] CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

PCA for Feature Extraction • PCA is the optimal linear feature extractor. This means that there is no other linear system that is able to provide better features for reconstruction. • PCA may or may not be the best preprocessing for pattern classification or recognition. Classification requires good discrimination which PCA might not be able to provide. • Feature extraction: transform p-dimensional input space to an m-dimensional space (m < p), such that the m-dimensions capture the information with minimal loss • The error e in the reconstruction is given by: e2 = Σi=M+1pλi CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

PCA for Data Compression • PCA identifies an orthogonal coordinate system for the input data such that the variance of the projection on the principal axis is largest, followed by the next major axis, and so on • By discarding some of the minor components, PCA can be used for data compression, where a p-dimension (bit) input is encoded in a m < p dimensional space • Weights are computed by Sanger’s rule on typical inputs • The de-compressor (receiver) must know the weights of the network to reconstruct the original signal x’ = WTy CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

PCA for Classification (1) • Can PCA enhance classification ? • In general, no. PCA is good for reconstruction and not feature discrimination or classification CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

PCA for Classification (2) CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

PCA – Some Remarks • Practical uses of PCA • Data Compression • Cluster analysis • Feature extraction • Preprocessing for classification/recognition (e.g. preprocessing for MLP training) • Biological basis • It is unlikely that the processing performed by biological neurons in, say perception, involves PCA only. More complex feature extraction processes are involved. CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Anti-Hebbian Learning • Modifying the Hebbian rule as Δwji(n) = - ηxi(n)yj(n) • The anti-Hebbian rule find the direction in space that has the minimum variance. In other words, it is the complement of the Hebbian rule • Anti-Hebbian does de-correlation. It de-correlates the output from the input • Hebbian rule is unstable, since it tries to maximize the variance. Anti-Hebbian rule, on the other hand, is stable and converges CS/CMPE 537 - Neural Networks (Sp 2004/2005) - Asim Karim @ LUMS

Self Organization: Hebbian Learning

Self Organization: Hebbian Learning

Presentation Transcript

Chapter 2

Organization of DNA

Protein Classification and Meta-organization. Methods for Global Organization of the Protein Universe

8.3 Stack Organization

COSC 6368 Machine Learning Organization

FAA Air Traffic Organization Program Management Organization PMO and Program Control

Is Your Organization a HRO? (High Reliability Organization) How can you tell? If not, why Not ? David Eibling Universit

TISSUES

Deep Learning

Spatial organization

e-learning ?

BLENDED LEARNING

What is Knowledge Organization?

Computer Organization and Architecture

Tensor Voting: A Perceptual Organization Approach to Computer Vision and Machine Learning

New Form 990 – What It Means to Your Organization!

HR 6002 Organization Theory and Management

Tensor Voting: A Perceptual Organization Approach to Computer Vision and Machine Learning

Keep Learning - How Can We Enable & Facilitate This

Content Methodology: A New Model for Content Marketing