WK5 – Dynamic Networks

WK5 – Dynamic Networks CS 476: Networks of Neural Computation WK5 – Dynamic Networks: Time Delayed & Recurrent Networks Dr. Stathis Kasderidis Dept. of Computer Science University of Crete Spring Semester, 2009

Contents • Sequence Learning • Time Delayed Networks I: Implicit Representation • Time Delayed Networks II: Explicit Representation • Recurrent Networks I: Elman + Jordan Networks • Recurrent Networks II: Back Propagation Through Time • Conclusions Contents

Sequence Learning • MLP & RBF networks are static networks, i.e. they learn a mapping from a single input signal to a single output response for an arbitrary large number of pairs; • Dynamic networks learn a mapping from a single input signal to a sequence of response signals, for an arbitrary number of pairs (signal,sequence). • Typically the input signal to a dynamic network is an element of the sequence and then the network produces as a response the rest of the sequence. • To learn sequences we need to include some form of memory (short term memory) to the network. Sequences

Sequence Learning II • We can introduce memory effects with two principal ways: • Implicit: e.g. Time lagged signal as input to a static network or as recurrent connections • Explicit: e.g. Temporal Backpropagation Method • In the implicit form, we assume that the environment from which we collect examples of (input signal, output sequence) is stationary. For the explicit form the environment could be non-stationary, i.e. the network can track the changes in the structure of the signal. Sequences

Time Delayed Networks I • The time delayed approach includes two basic types of networks: • Implicit Representation of Time: We combine a memory structure in the input layer of the network with a static network model • Explicit Representation of Time: We explicitly allow the network to code time, by generalising the network weights from scalars to vectors, as in TBP (Temporal Backpropagation). • Typical forms of memories that are used are the Tapped Delay Line and the Gamma Memory family. Time Delayed I

Time Delayed Networks I • The Tapped Delay Line form of memory is shown below for an input signal x(n): Time Delayed I • The Gamma form of memory is defined by:

Time Delayed Networks I • The Gamma Memory is shown below: Time Delayed I

Time Delayed Networks I • In the implicit representation approach we combine a static network (e.g. MLP / RBF) with a memory structure (e.g. tapped delay line). • An example is shown below (the NETtalk network): Time Delayed I

Time Delayed Networks I • We present the data in a sliding window. For example in NETtalk the middle group of input neurons present the letter in focus. The rest of the input groups, three before & three after, present context. • The purpose is to predict for example the next symbol in the sequence. • The NETtalk model (Sejnowski & Rosenberg, 1987) has: • 203 input nodes • 80 hidden neurons • 26 output neurons • 18629 weights • Used BP method for training Time Delayed I

Time Delayed Networks II • In explicit time representation, neurons have a spatio-temporal structure, i.e. its synapse arriving to a neuron is not a scalar number but a vector of weights, which are used for convolution of the time-delayed input signal of a previous neuron with the synapse. • A schematic representation of a neuron is given below: Time Delayed II

Time Delayed Networks II • The output of the neuron in this case is given by: • In case of a whole network, for example assuming a single output node and a linear output layer, the response is given by: • Where p is the depth of the memory and b0 is the bias of the output neuron Time Delayed II

Time Delayed Networks II • In the more general case, where we have multiple neurons at the output layer, we have for neuron j of any layer: • The output of any synapse is given by the convolution sum: Time Delayed II

Time Delayed Networks II • Where the state vectorxi(n) and weight vector wji for synapse I are defined as follows: • xi(n)=[xi(n), xi(n-1),…, xi(n-p)]T • wji=[wji(0), wji(1),…, wji(p)]T Time Delayed II

Time Delayed Networks II: Learning Law • To train such a network we use the Temporal BackPropagation algorithm. We present the algorithm below. • Assume that neuron j lies in the output layer and its response is denoted by yj(n) at time n, while its desired response is given by dj(n). • We can define an instantaneous value for the sum of squared errors produced by the network as follows: Time Delayed II

Time Delayed Networks II: Learning Law-1 • The error signal at the output layer is defined by: • The idea is a minimise an overall cost function, calculated over all time: • We could proceed as usual by calculating the gradient of the cost function over the weights. This implies that we need to calculate the instantaneous gradient: Time Delayed II

Time Delayed Networks II: Learning Law-2 • However this approach to work we need to unfold the network in time (i.e. to convert it to an equivalent static network and then calculate the gradient). This option presents a number of disadvantages: • A loss of symmetry between forward and backward pass for the calculation of instantaneous gradient; • No nice recursive formula for propagation of error terms; • Need for global bookkeeping to keep track of which static weights are actually the same in the equivalent network Time Delayed II

Time Delayed Networks II: Learning Law-3 • For these reasons we prefer to calculate the gradient of the cost function as follows: • Note that in general holds: • The equality is correct only when we take the sum over all time. Time Delayed II

Time Delayed Networks II: Learning Law-4 • To calculate the weight update we use the steepest descent method: • Where  is the learning rate. • We calculate the terms in the above relation as follows: • This is by definition the induced field vj(n) Time Delayed II

Time Delayed Networks II: Learning Law-5 • We define the local gradient as: • Thus we can write the weight update equations in the familiar form: • We need to calculate the  for the cases of output and hidden layers. Time Delayed II

Time Delayed Networks II: Learning Law-6 • For the output layer the local gradient is given by: • For a hidden layer we assume that neuron j is connected to a set A of neurons in the next layer (hidden or output). Then we have: Time Delayed II

Time Delayed Networks II: Learning Law-7 • By re-writing we get the following: • Finally we putting all together we get: Time Delayed II

Time Delayed Networks II: Learning Law-8 • Where l is the layer level Time Delayed II

Recurrent I • A network is called recurrent when there are connections which feedback to previous layers or neurons, including self-connections. An example is shown next: • Successful early models of recurrent networks are: • Jordan Network • Elman Network Recurrent I

Recurrent I • The Jordan Network has the structure of an MLP and additional context units. The Output neurons feedback to the context neurons in 1-1 fashion. The context units also feedback to themselves. • The network is trained by using the Backpropagation algorithm • A schematic is shown in the next figure: Recurrent I

Recurrent I • The Elman Network has also the structure of an MLP and additional context units. The Hidden neurons feedback to the context neurons in 1-1 fashion. The hidden neurons’ connections to the context units are constant and equal to 1. It is also called Simple Recurrent Network (SRN). • The network is trained by using the Backpropagation algorithm Recurrent I • A schematic is shown in the next figure:

Recurrent II • More complex forms of recurrent networks are possible. We can start by extending a MLP as a basic building block. • Typical paradigms of complex recurrent models are: • Nonlinear Autoregressive with Exogenous Inputs Network (NARX) • The State Space Model • The Recurrent Multilayer Perceptron (RMLP) • Schematic representations of the networks are given in the next slides: Recurrent II

Recurrent II-1 • The structure of the NARX model includes: • A MLP static network; • A current input u(n) and its delayed versions up to a time q; • A time delayed version of the current output y(n) which feeds back to the input layer. The memory of the delayed output vector is in general p. • The output is calculated as: • y(n+1)=F(y(n),…,y(n-p+1),u(n),…,u(n-q+1)) Recurrent II

Recurrent II-2 • A schematic of the NARX model is as follows: Recurrent II

Recurrent II-3 • The structure of the State Space model includes: • A MLP network with a single hidden layer; • The hidden neurons define the state of the network; • A linear output layer; • A feedback of the hidden layer to the input layer assuming a memory of q lags; • The output is determined by the coupled equations: • x(n+1)=f(x(n),u(n)) • y(n+1)=C x(n+1) Recurrent II

Recurrent II-4 • Where f is a suitable nonlinear function characterising the hidden layer. x is the state vector, as it is produced by the hidden layer. It has q components. y is the output vector and it has p components. The input vector is given by u and it has m components. • A schematic representation of the network is given below: Recurrent II

Recurrent II-5 • The structure of the RMLP includes: • One or more hidden layers; • Feedback around each layer; • The general structure of a static MLP network; • The output is calculated as follows (assuming that xI, xII, and xo are the first, second and output layer outputs): • xI(n+1)=I(xI(n), u(n)) • xII(n+1)=II(xII(n), xI(n+1)) • xO(n+1)=O(xO(n), xII(n+1)) Recurrent II

Recurrent II-6 • Where the functions I(•), II(•) and O(•) denote the • Activation functions of the corresponding layer. • A schematic representation is given below: Recurrent II

Recurrent II-7 • Some theorems on the computational power of recurrent networks: • Thm 1: All Turing machines may be simulated by fully connected recurrent networks built on neurons with sigmoid activation functions. • Thm 2: NARX networks with one layer of hidden neurons with bounded, one-sided saturated activation functions and a linear output neuron can simulate fully connected recurrent networks with bounded, one-sided saturated activation functions, except for a linear slowdown. Recurrent II

Recurrent II-8 • Corollary: NARX networks with one hidden layer of neurons with BOSS activations functions and a linear output neuron are Turing equivalent. Recurrent II

Recurrent II-9 • The training of the recurrent networks can be done with two methods: • BackPropagation Through Time • Real-Time Recurrent Learning • We can train a recurrent network with either epoch-based or continuous training operation. However an epoch in recurrent networks does not mean the presentation of all learning patterns but rather denotes the length of a single sequence that we use for training. So an epoch in recurrent network corresponds in presenting only one pattern to the network. • At the end of an epoch the network stabilises. Recurrent II

Recurrent II-10 • Some useful heuristics for the training is given below: • Lexigraphic order of training samples should be followed, with the shortest strings of symbols being presented in the network first; • The training should begin with a small training sample and then its size should be incrementally increased as the training proceeds; • The synaptic weights of the network should be updated only if the absolute error on the training sample currently being processed by the network is greater than some prescribed criterion; Recurrent II

Recurrent II-11 • The use of weight decay during training is recommended; weight decay was discussed in WK3 • The BackPropagation Through Time algorithm proceeds by unfolding a network in time. To be more specific: • Assume that we have a recurrent network N which is required to learn a temporal task starting from time n0 and going all the way to time n. • Let N* denote the feedforward network that results from unfolding the temporal operation of the recurrent network N. Recurrent II

Recurrent II-12 • The network N* is related to the original network N as follows: • For each time step in the interval (n0,n], the network N* has a layer containing K neurons, where K is the number of neurons contained in network N; • In every layer of network N* there is a copy of each neuron in network N; • For every time step l  [n0,n], the synaptic connection from neuron i in layer l to neuron j in layer l+1 of the network N* is a copy of the synaptic connection from neuron i to neuron j in the network N. • The following example explains the idea of unfolding: Recurrent II

Recurrent II-13 • We assume that we have a network with two neurons which is unfolded for a number of steps, n: Recurrent II

Recurrent II-14 • We present now the method of Epochwise BackPropagation Through Time. • Let the dataset used for training the network be partitioned into independent epochs, with each epoch representing a temporal pattern of interest. Let n0 denote the start time of an epoch and n1 denotes its end time. • We can define the following cost function: Recurrent II

Recurrent II-15 Where A is the set of indices j pertaining to those neurons in the network for which desired responses are specified, and ej(n) is the error signal at the output of such a neuron measured with respect to some desired response. Recurrent II

Recurrent II-16 • The algorithm proceeds as follows: • For a given epoch, the recurrent network starts running from some initial state until it reaches a new state, at which point the training is stopped and the network is reset to an initial state for the next epoch. The initial state doesn’t have to be the same for each epoch of training. Rather, what is important is for the initial state for the new epoch to be different from the state reached by the network at the end of the previous epoch; Recurrent II

Recurrent II-17 • First a single forward pass of the data through the network for the interval (n0, n1) is performed. The complete record of input data, network state (i.e. synaptic weights), and desired responses over this interval is saved; • A single backward pass over this past record is performed to compute the values of the local gradients: • For all j  A and n0 < n  n1 . This computation is performed by the formula: Recurrent II

Recurrent II-18 • Where ’(•) is the derivative of an activation function with respect to its argument, and vj(n) is the induced local field of neuron j. • The use of above formula is repeated, starting from time n1 and working back, step by step, to time n0 ; the number of steps involved here is equal to the number of time steps contained in the epoch. Recurrent II

Recurrent II-19 • Once the computation of back-propagation has been performed back to time n0+1, the following adjustment is applied to the synaptic weight wji of neuron j: • Where  is the learning rate parameter and xi(n-1) is the input applied to the ith synapse of neuron j at time n-1. Recurrent II

Recurrent II-20 • There is a potential problem with the method, which is called the Vanishing Gradients Problem, i.e. the corrections calculated for the weights are not large enough when using methods based on steepest descent. • However this is a research problem currently and ones has to see the literature for details. Recurrent II

Conclusions • Dynamic networks learn sequences in contrast to the static mappings of MLP and RBF networks. • Time representation takes place explicitly or implicitly. • The implicit form includes time-delayed versions of the input vector and use of a static network model afterwards or the use of recurrent networks. • The explicit form uses a generalisation of the MLP model where a synapse is modelled now as a weight vector and not as a single number. The synapse activation is not any more the product of the synapse’s weight with the output of a previous neuron but rather the inner product of the synapse’s weight vector with the time-delayed state vector of the previous neuron. Conclusions

Conclusions I • The extended MLP networks with explicit temporal structure are trained with the Temporal BackPropagation algorithm. • The recurrent networks include a number of simple and complex architectures. In the simpler case we train the networks using the standard BackProgation algorithm. • In the more complex cases we first unfold the network in time and then train it using the BackProgation Through Time algorithm. Conclusions

WK5 – Dynamic Networks