580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen

580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 3: TD(l) and eligibility traces

N-step We have seen the possible virtue of backing up the temporal difference error by more than one state. The n-step backup rule was: In your homework you should have come up with a graph like this: Learning rate

Complex backups The idea with complex backups is not to use just a n-step backup, but to use a certain mixture of backups. For example we could average a 2-step backup and a 4-step backup. One way of doing this is to use a geometric series with parameter l to average all possible n-step backups. This is the forward view of the important algorithm TD(l). It is called the forward view, because we look from each state forward in time to see how the value function of this state is updated. We can see if l=0, we only use the 1-step backup. So TD(0) is the temporal difference learning we have been using so far. If l=1, we will only learn from the final return, that means that we do Monte-carlo.

Backward view of TD(l): Eligibility trace How do you best implement TD(l)? The forward view tells us how every state is updated, but we would have to wait until the end until we can update anything. The backward view tells us, how we should broadcast the current temporal difference error to previous states. The key idea are eligibility traces, which keep track of how much past states should learn from the actual TD-error: Then we can implement the algorithm very easily:

This is how TD(l) does on the homework problem. Even though the 1-step solution was better at it’s optimal learning rate than any n-step backup, the TD(0.6) beats TD(0) at the optimal learning rate.

Forward view After the whole series is run, the net-change in state s will be (using the indicator function I(a,b), which is 1 if a=b and 0 otherwise: Backward view After the whole series is run, the net-change in state s will be (using the indicator function I(a,b), which is 1 if a=b and 0 otherwise: On every step, only dt is broadcasted. It is weighted by the eligibility trace that can be written as:

Aligning forward and backward view We can represent the backward view by a grid. At every time step t we go through all visited states so far and broadcast the error dt back. By changing the order of summation from summing up the rows to summing up the columns we can get to something that already looks very much like the forward view: k t To finally see the equivalence, we need to show that:

Now we can pull out all terms that depend on t+1, then we pull out everything that depends on t+2, etc. And this is exactly the last term in the backward view:

Advantage of Eligibility traces: • They are a mixture between temporal difference learning (which assumes a specific Markov model structure) and Monte-carlo methods (which do not assume any model structure). They can exploit model structure for better learning (as TD(0)), but if the Markov property is violated, they also provide some robustness (like Monte Carlo). • Eligibility traces are an elegant way of going from discrete to continuous time. In continuous time state transitions form a labeled point process and eligibility traces decay exponentially. • Discussion question: • What if the states are continuous? That is, what if there are a vector of continuous variables x that describe the state space? One example would be the position and velocity of the car in the car parking problem. How would we approximate V(x)?

Schultz, Dayan, & Montague, Science 1997

580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen