1 / 21

Artificial Intelligence Methods

Artificial Intelligence Methods. Neural Networks Lecture 4 Rakesh K. Bissoondeeal. Learning in Multilayer Networks. Backpropagation Learning

Télécharger la présentation

Artificial Intelligence Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal

  2. Learning in Multilayer Networks • Backpropagation Learning • A Multi-Layer neural network trained using the Backpropagation learning algorithm is one of the most powerful forms of supervised neural network system. • The training of such a network involves three stages: 1) the feedforward of the input training pattern, 2) the calculation and backpropagation of the associated error, 3) and the adjustment of the weights.

  3. Architecture of Network • In a typical Multilayer network, the input units (Xi) are fully connected to all hidden layer units (Yj) and the hidden layer units are fully connected to all output layer units (Zk).

  4. Architecture of Network • Each of the connections between the input to hidden and hidden to output layer units has an associated weight attached to it (Wij or Vij). • The hidden and output layer units also receive signals from weighted connections (bias) from units whose values are always 1.

  5. Architecture of Network • Activation Functions • The choice of activation function to use in a backpropagation network is limited to functions that are continuous, differentiable and monotonically non-decreasing. • Furthermore, for computational efficiency, it is desirable that its derivative is easy to compute. Usually the function is also expected to saturate, i.e. approach finite maximum and minimum values asymptotically. • One of the most typical activation functions used is the binary sigmoidal function: • f(x) = 1 . 1 + exp(-x) • where the derivative is given by: f ’(x) = f(x)[1 - f(x)]

  6. Backpropagation Learning Algorithm • During the feedforward phase, each of the input units (Xi) is set to its given input pattern value • Xi = input[i] • Each input unit is then multiplied by the weight of its connection. The weighted inputs are then fed into the hidden units (Y1 to Yj). • Each hidden unit then sums the incoming signals and applies an activation function to produce an output. Yj = f( bj + XiWij)

  7. Backpropagation Learning Algorithm • Each of the outputs of the hidden units is then multiplied by the weight of its connection and the weighted signals are fed into the output units (Z1 - Zk). • Each output unit then sums the incoming signals from the hidden units and applies an activation function to form the response of the net for a given input pattern. • Zk = f( bk + YjVjk)

  8. Backpropagation Learning Algorithm • Backpropagation of errors • During training, each output unit then compares its output (Zk) with the required target value (dk) to determine the associated error for that pattern. Based on this error, a factor k is computed that is used to distribute the error at Zk back to all units in the previous layer. k = f ’(Zk)(dk - Zk) Each hidden unit then computes a similar factor j that is a weighted sum of all the backpropagated delta terms from units in the previous layer multiplied by the derivative of the activation function for that unit. • j = f ’(Yj)  k Vjk

  9. Weight adjustment • After all the delta terms have been calculated, each hidden and output layer unit updates its connection weights and bias weights accordingly. • Output layer: bk(new) = bk(old) + k Vjk(new) = Vjk(old) + kYj • Hidden Layer: bj(new) = bj(old) + j Wij(new) = Wij(old) + jXi • Where  is a learning rate coefficient that is given a value between 0 and 1 at the start of training.

  10. Test stopping condition • After each epoch of training (one epoch = one cycle through the entire training set) the performance of the network is measured by computing the average (Root Mean Square(RMS)) error of the network for all of the patterns in the training set and for all of the patterns in a validation set. These two sets being disjoint. • Training is terminated when the RMS value for the training set is continuing to decrease but the RMS value for the validation set is starting to increase. This prevents the network from being OVERTRAINED (i.e. memorising the training set) and ensures that the ability of the network to GENERALISE (i.e. correctly classify non-trained patterns) will be at its maximum.

  11. Which model is better? The complicated model fits the data better. But it is not economical A model is convincing when it fits a lot of data surprisingly well. A simple example of overfitting (overtraining)

  12. Validation E Validation Training amount of training, parameter adjustment Stop training here

  13. Problems with basic Backpropagation • One of the problems with the basic backpropagation algorithm is that it is possible for the network to get ‘stuck’ in a local minimum area on the error surface rather than in the desired global minimum. • The weight updating therefore ceases in a local minimum and the network becomes trapped because it cannot alter the weights to get out of the local minimum.

  14. Local Minima Local Minimum Global Minimum

  15. Backpropagation with Momentum • One solution to the problems with the basic backpropagation algorithm is to use a slightly modified weight updating procedure. In backpropagation with momentum, the weight change is in a direction that is a combination of the current error gradient and the previous error gradient. • The modified weight updating procedures are: • Wij(t+1) = Wij(t) + jXi + [Wij(t) - Wij(t - 1)] • Vjk(t+1) = Vjk(t) + kYj + [Vjk(t) - Vjk(t - 1)] • where  is a momentum term coefficient that is given a value between 0 and 1 at the start of training. • The use of the extra momentum term can help the network to ‘climb out’ of local minima and can also help speed up the network training.

  16. Momentum • Adds a percentage of the last movement to the current movement

  17. Choice of Parameters • Initial weight set • Normally, the network weights are initialised to small random values before training is started. However, the choice of starting weight set can affect whether or not the network can find the global error minimum. This is due to the presence of local minima within the error surface. Some starting weight sets may therefore set the network off on a path that leads to a given local minimum whilst other starting weight sets avoid the local minimum. • It may therefore be necessary for several training runs to be performed using different random starting weight sets in order to determine whether or not the network has achieved the desired global minimum.

  18. Choice of Parameters • Number of hidden neurons • Usually determined by experimentation • Too many – network will memorise training set and will not generalise well • Too few – risk that network may not be able to learn the pattern in the training set • Learning rate • Value between 0 and 1 • Too low – training will be very slow • Too high – network may never reach a global minimum • It is often necessary to train the network with different learning rates to find the optimum value for the problem under investigation

  19. Choice of Parameters • Training, validation and test sets • Training set - The choice of training set can also affect the ability of the network to reach the global minimum. The aim is to have a set of patterns that are representative of the whole population of patterns that the network is expected to encounter. • Example • Training set – 75% • Validation set -10% • Test set – 5%

  20. Pre-processing and Post-processing • Pre-process Train network Post-process data data • Why pre-process? • Input variables sometime differ by several orders of magnitude and the sizes of the variables do not necessarily reflect their importance in finding the required output • Types of pre-processing • input normalisation – normalised inputs will fall in the range [-1,1] • Normalise mean and standard deviation of training set so that input variables will have 0 mean and standard 1

  21. Recommended Reading • Fundamentals of neural networks; Architectures, Algorithms and Applications, L. Fausett, 1994. • Artificial Intelligence: A Modern Approach, S. Russel and P. Norvig, 1995. • An Introduction to Neural Networks. 2nd Edition, Morton, IM.

More Related