Neural Networks: Perceptron Training and Geometric Views

reinforcement learning world self world redundancy reduction in early sensory sys ? diversity diversity + errors errors

Neural Networks Courtesy of: Elena Marchiori R4.47 elena@cs.vu.nl Assistant: Kees Jong S2.22 cjong@cs.vu.nl

Neural Networks • A neural network (NN) is a machine learning approach inspired by the way in which the brain performs a particular learning task. • Knowledge about the learning task is given in the form of examples called training examples. • A NN is specified by: • an architecture: a set of neurons and links connecting neurons. Each link has a weight, • a neuron model: the information processing unit of the NN, • a learning algorithm:used for training the NN by modifying the weights in order to model the particular learning task correctly on the training examples. The aim is to obtain a NN that generalizes well, that is, that behaves correctly on new instances of the learning task.

Single Layer Feed-forward Input layer of source nodes Output layer of neurons

Multi layer feed-forward 3-4-2 Network Output layer Input layer Hidden Layer

b (bias) x1 w1 v y x2 w2 (v) wn xn Perceptron: Neuron Model • The (McCulloch-Pitts) perceptron is a single layer NN with a non-linear , the sign function

Perceptron for Classification • The perceptron is used for binary classification. • Given training examples of classes C1, C2 train the perceptron in such a way that it classifies correctly the training examples: • If the output of the perceptron is +1 then the input is assigned to class C1 • If the output is -1 then the input is assigned to C2

Perceptron Training • How can we train a perceptron for a classification task? • We try to find suitable values for the weights in such a way that the training examples are correctly classified. • Geometrically, we try to find a hyper-plane that separates the examples of the two classes.

Perceptron Geometric View The equation below describes a (hyper-)plane in the input space consisting of real valued m-dimensional vectors. The plane splits the input space into two regions, each of them describing one class. decision region for C1 x2 w1x1 + w2x2 + w0 >= 0 decision boundary C1 x1 C2 w1x1 + w2x2 + w0 = 0

The fixed-increment learning algorithm n=1; initializew(n) randomly; while (there are misclassified training examples) Select a misclassified augmented example (x(n),d(n)) w(n+1) = w(n) + d(n)x(n); n = n+1; end-while;  = learning rate parameter (real number)

Example Consider the 2-dimensional training set C1 C2, C1 = {(1,1), (1, -1), (0, -1)} with class label 1 C2 = {(-1,-1), (-1,1), (0,1)} with class label -1 Train a perceptron on C1 C2

A possible implementation Consider the augmented training set C’1 C’2, with first entry fixed to 1 (to deal with the bias as extra weight): (1, 1, 1), (1, 1, -1), (1, 0, -1) ,(1,-1, -1), (1,-1, 1), (1,0, Replace x with -x for all x C2’ and use the following update rule: Epoch = the application of the update rule to each example of the training set. Then terminate the execution of the learning algorithm if the weights do not change after one epoch.

Execution • the execution of the perceptron learning algorithm for each epoch is illustrated below, with w(1)=(1,0,0),  =1, and transformed inputs (1, 1, 1), (1, 1, -1), (1,0, -1), (-1,1, 1), (-1,1, -1), (-1,0, -1) End epoch 1

Execution End epoch 2 At epoch 3 no weight changes. (check!)  stop execution of algorithm. Final weight vector: (0, 2, -1).  decision hyperplane is 2x1 - x2 = 0.

x2 1 - - + Decision boundary: 2x1 - x2 = 0 C2 x1 -1 1 1/2 -1 C1 - + + Result

Termination of the learning algorithm Suppose the classes C1, C2are linearly separable (that is, there exists a hyper-plane that separates them). Then the perceptron algorithm applied to C1  C2 terminates successfully after a finite number of iterations. Proof: Consider the set C containing the inputs of C1 C2transformed by replacing x with -x for each x with class label -1. For simplicity assume w(1) = 0,  = 1. Let x(1) … x(k) Cbe the sequence of inputs that have been used after k iterations. Then w(2) = w(1) + x(1) w(3) = w(2) + x(2) w(k+1) = x(1) + … + x(k) w(k+1) = w(k) + x(k)

Convergence theorem (proof) • Since C1 and C2 are linearly separable then there exists w* such that w*Tx > 0 x C. • Let  = min w*T x • Then w*T w(k+1) = w*Tx(1) + … + w*Tx(k)  k • By the Cauchy-Schwarz inequality we get: ||w*||2 ||w(k+1)||2 [w*T w(k+1)]2 • ||w(k+1)||2 (A) k2 2 ||w*||2

Convergence theorem (proof) Now we consider another route: w(k+1) = w(k) + x(k) || w(k+1)||2= || w(k)||2+ ||x(k)||2 +2 wT(k)x(k)euclidean norm   0 becausex(k) is misclassified  ||w(k+1)||2 ||w(k)||2+ ||x(k)||2 =0 ||w(2)||2 ||w(1)||2+ ||x(1)||2 ||w(3)||2 ||w(2)||2+ ||x(2)||2  ||w(k+1)||2

convergence theorem (proof) • Let  = max ||x(n)||2 x(n)  C • ||w(k+1)||2 k  (B) • For sufficiently large values of k: (B)becomes in conflict with(A).Then k cannot be greater than kmaxsuch that (A)and (B) are both satisfied with the equality sign. • Then the algorithm terminates successfully in at most iterations.  ||w*||2 2

Perceptron: Limitations • The perceptron can only model linearly separable classes, like (those described by) the following Boolean functions: • AND • OR • COMPLEMENT • It cannot model the XOR. • You will experiment with these functions in the Matlab practical lessons.

Adaline: Adaptive Linear Element • When the two classes are not linearly separable, it may be desirable to obtain a linear separator that minimizes the mean squared error. • Adaline (Adaptive Linear Element): • uses a linear neuron model and • the Least-Mean-Square (LMS) learning algorithm • useful for robust linear classification and regression For an example (x,d) the error e(w) of the network is and the squared error is

Adaline • The total error E_tot is the mean of the squared errors of all the examples. • E_tot is a quadratic function of the weights whose derivative exists everywhere. • Then incremental gradient descent may be used to minimize E_tot. (see Sec. 3.1,3.2 of the online Gurney book for an explanation of gradient descent). • At each iteration LMS algorithm selects an example and decreases the network error E of that example, even when the example is correctly classified by the network.

Incremental Gradient Descent • start from an arbitrary point in the weight space • the direction in which the error E of an example (as a function of the weights) is decreasing most rapidly is the opposite of the gradient of E: • take a small step (of size ) in that direction

Weights Update Rule • Computation of Gradient(E): • Delta rule for weight update:

LMS learning algorithm n=1; initializew(n) randomly; while (E_tot unsatisfactory and n<max_iterations) Select an example (x(n),d(n)) n = n+1; end-while;  = learning rate parameter (real number) A modification uses

Comparison Perceptron and Adaline

Multi layer feed-forward NNFFNN We consider a more general network architecture: between the input and output layers there are hidden layers, as illustrated below. Hidden nodes do not directly receive inputs nor send outputs to the external environment. FFNNs overcome the limitation of single-layer NN: they can handle non-linearly separable learning tasks. Input layer Output layer Hidden Layer

XOR problem A typical example of non-linealy separable function is the XOR. This function takes two input arguments with values in {-1,1} and returns one output in {-1,1}, as specified in the following table: If we think at -1 and 1 as encoding of the truth values false and true, respectively, then XOR computes the logical exclusive or, which yields true if and only if the two inputs have different truth values.

x1 1 -1 1 x2 -1 -1 0.1 +1 +1 -1 -1 +1 +1 XOR problem In this graph of the XOR, input pairs giving output equal to 1 and -1 are depicted with green and red circles, respectively. These two classes (green and red) cannot be separated using a line. We have to use two lines, like those depicted in blue. The following NN with two hidden nodes realizes this non-linear separation, where each hidden node describes one of the two blue lines. This NN uses the sign activation function. The two green arrows indicate the directions of the weight vectors of the two hidden nodes, (1,-1) and (-1,1). They indicate the regions where the network output will be 1. The output node is used to combine the outputs of the two hidden nodes. x1 x2 -1

1 w0 x1 w1 x2 w2 1 L1 1 L2 1 Convex region x1 1 L4 L3 1 x2 1 1 1 x1 1 1 x2 1 Types of decision regions Network with a single node One-hidden layer network that realizes the convex region: each hidden node realizes one of the lines bounding the convex region -3.5 P1 two-hidden layer network that realizes the union of three convex regions: each box represents a one hidden layer network realizing one convex region P2 P3 1.5

1 Increasing a -10 -8 -6 -4 -2 2 4 6 8 10 FFNN NEURON MODEL • The classical learning algorithm of FFNN is based on the gradient descent method. For this reason the activation function used in FFNN are continuous functions of the weights, differentiable everywhere. • A typical activation function that can be viewed as a continuous approximation of the step (threshold) function is the Sigmoid Function. The activation function for node j is: • when a tends to infinity then  becomes the step function

Training: Backprop algorithm • The Backprop algorithm searches for weight values that minimize the total error of the network over the set of training examples (training set). • Backprop consists of the repeated application of the following two passes: • Forward pass: in this step the network is activated on one example and the error of (each neuron of) the output layer is computed. • Backward pass:in this step the network error is used for updating the weights (credit assignment problem). This process is more complex than the LMS algorithm for Adaline, because hidden nodes are linked to the error not directly but by means of the nodes of the next layer.Therefore,starting at the output layer, the error is propagated backwards through the network, layer by layer. This is done by recursively computing the local gradient of each neuron.

Backprop • Back-propagation training algorithm • Backprop adjusts the weights of the NN in order to minimize the network total mean squared error. Network activation Forward Step Error propagation Backward Step

Total Mean Squared Error • The error of output neuron j after the activation of the network on the n-th training example is: • The network error is the sum of the squared errors of the output neurons: • The total mean squared error is the average of the network errors of the training examples.

Weight Update Rule The Backprop weight update rule is based on the gradient descent method: take a step in the direction yielding the maximum decrease of the network error E. This direction is the opposite of the gradient of E.

Weight Update Rule Input of neuron j is: Using the chain rule we can write: Moreover if we define the local gradientof neuron j as follows: Then from we get

Weight update of output neuron In order to compute the weight change we need to know the local gradient of neuron j . There are two cases, depending whether j is an output or an hidden neuron. If j is an output neuron then using the chain rule we obtain: and because So if j is an output node then the weight from neuron i to neuron j is updated of:

Weight updateof hidden neuron If j is a hidden neuron then its local gradient is computed using the local gradients of all the neurons of the next layer. Using the chain rule we have: Observe that Then Moreover So if j is a hidden node then the weight from neuron i to neuron j is updated of:

Error backpropagation The flow-graph below illustrates how errors are back-propagated to hidden neuron j w1j e1 ’(v1) 1 j ’(vj) wkj ek ’(vk) k wm j em m ’(vm)

Summary: Delta Rule • Delta rulewji = j yi IF j output node IF j hidden node where

Generalized delta rule • If  is small then the algorithm learns the weights very slowly, while if  is large then the large changes of the weights may cause an unstable behavior with oscillations of the weight values. • A technique for tackling this problem is the introduction of a momentum term in the delta rule which takes into account previous updates. We obtain the following generalized Delta rule:  momentum constant the momentum accelerates the descent in steady downhill directions. the momentum has a stabilizing effect in directions that oscillate in time.

Other techniques: adaptation Other heuristics for accelerating the convergence of the back-prop algorithm through  adaptation: • Heuristic 1: Every weight has its own . • Heuristic 2: Every is allowed to vary from one iteration to the next.

Backprop learning algorithm(incremental-mode) n=1; initializew(n) randomly; while (stopping criterion not satisfied and n<max_iterations) for each example (x,d) - run the network with input x and compute the output y - update the weights in backward order starting from those of the output layer: with computed using the (generalized) Delta rule end-for n = n+1; end-while;

Backprop algorithm • In the batch-mode the weights are updated only after all examples have been processed, using the formula • The learning process continues on an epoch-by-epoch basis until the stopping condition is satisfied. • In the incremental mode from one epoch to the next choose a randomized ordering for selecting the examples in the training set in order to avoid poor performance.

Stopping criterions • Sensible stopping criterions: • total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]). • generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.

Metric Striatal Networks

Caudate Putamen GPe GPi

STN SN (r & c)

Basal Ganglia: 3 circuits • Sensorimotor: Putamen to GPi • Associative: Caudate to SNr • Limbic: Ventral striatum to ventral pallidum

Neural Networks: Perceptron Training and Geometric Views