Reinforcement Learning

Reinforcement Learning Based on slides by AviPfeffer and David Parkes

Mechanism State Closed Loop Interactions Environment Reward Agent Sensors Actuators Percepts Actions

Reinforcement Learning • When mechanism(=model) is unknown • When mechanism is known, but model is too hard to solve

Basic Idea • Select an action using some sort of action selection process • If it leads to a reward, reinforce taking that action in future • If it leads to a punishment, avoid taking that action in future

But It’s Not So Simple • Rewards and punishments may be delayed • credit assignment problem: how do you figure out which actions were responsible? • How do you choose an action? • exploration versus exploitation • What if the state space is very large so you can’t visit all states?

Model-Based RL

Model-Based Reinforcement Learning • Mechanism is an MDP • Approach: • learn the MDP • solve it to determine the optimal policy • Works when model is unknown, but it is not too large to store and solve

Learning the MDP • We need to learn the parameters of the reward and transition models • We assume the agent plays every action in every state a number of times • Let Rai= total reward received for playing a in state i • Let Nai = number of times played a in state i • Let Naij = number of times j was reached when played a in state i • R(i,a) = Rai/Nai • Taij = Naij / Nai

Note • Learning and solving the MDP need not be a one-off thing • Instead, we can repeatedly solve the MDP to get better and better policies • How often should we solve the MDP? • depends how expensive it is compared to acting in the world

Model-Based Reinforcement Learning Algorithm Let 0 be arbitrary k  0 Experience   Repeat k  k + 1 Begin in state i For a while: Choose action a based on k-1 Receive reward r and transition to j Experience  Experience  < i, a, r, j > i  j Learn MDP M from Experience Solve M to obtain k

Credit Assignment • How does model-based RL deal with the credit assignment problem? • By learning the MDP, the agent knows which states lead to which other states • Solving the MDP ensures that the agent plans ahead and takes the long run effects of actions into account • So the problem is solved optimally

Action Selection • The line in the algorithm Choose action a based on k-1 is not specific • How do we choose the action?

Action Selection • The line in the algorithm Choose action a based on k-1 is not specific • How do we choose the action? • Obvious answer: the policy tells us the action to perform • But is that always what we want to do?

Exploration versus Exploitation • Exploit: use your learning results to play the action that maximizes your expected utility, relative to the model you have learned • Explore: play an action that will help you learn the model better

Questions • When to explore • How to explore • simple answer: play an action you haven’t played much yet in the current state • more sophisticated: play an action that will probably lead you to part of the space you haven’t explored much • How to exploit • we know the answer to this: follow the learned policy

Conditions for Optimality To ensure that the optimal policy will eventually be reached, we need to ensure that • Every action is taken in every state infinitely often in the long run • The probability of exploitation tends to 1

Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad?

Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad? • We may not explore long enough to get an accurate model • As a result, the optimal policy will not be reached

Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad? • We may not explore long enough to get an accurate model • As a result, the optimal policy will not be reached • But it works well if we’re planning to learn the MDP once, then solve it, then play according to the learned policy

Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad? • We may not explore long enough to get an accurate model • As a result, the optimal policy will not be reached • But it works well if we’re planning to learn the MDP once, then solve it, then play according to the learned policy • Works well for learning from simulation and performing in the real world

Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad?

Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad? • Does not fully exploit when learning has converged to optimal policy

Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad? • Does not fully exploit when learning has converged to optimal policy • When could this approach be useful?

Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad? • Does not fully exploit when learning has converged to optimal policy • When could this approach be useful? • If world is changing gradually

Boltzmann Exploration • In state i, choose action a with probability • T is a temperature • High temperature: more exploration • T should be cooled down to reduce amount of exploration over time • Sensitive to cooling schedule

Guarantee • If: • every action is taken in every state infinitely often • probability of exploration tends to zero • Then: • Model-based reinforcement learning will converge to the optimal policy with probability 1

Pros and Cons • Pro: • makes maximal use of experience • solves model optimally given experience • Con: • assumes model is small enough to solve • requires expensive solution procedure

R-Max • Assume R(s,a)=R-max (the maximal possible reward • Called optimism bias • Assume any transition probability • Solve and act optimally • When Nai> c, update R(i,a) • After each update, resolve • If you choose c properly, converges to the optimal policy

Model-Free RL

Monte Carlo Sampling • If we want to estimate y = Ex~D[f(x)] we can • Generate random samples x1,…,xN from D • Estimate • Guaranteed to converge to correct estimate with sufficient samples • Requires keeping count of # of samples • Alternative, update average: • Generate random samples x1,…,xN from D • Estimate

Estimating the Value of a Policy • Fix a policy  • When starting in state i, taking action a according to , getting reward r and transitioning to j, we get a sample of • So we can update V(i)  (1-)V(i) + (r + V(j)) • But where does V(j) comes from? • Guess (this is called bootstrapping)

Temporal Difference Algorithm For each state i: V(i)  0 Begin in state i Repeat: Apply action a based on current policy Receive reward r and transition to j i j

Credit Assignment • By linking values to those of the next state, rewards and punishments are eventually propagated backwards • We wait until end of game and then propagate backwards in reverse order

But how do learn to act • To improve our policy, we need to have an idea of how good it is to use a different policy • TD learns the value function • Similar to the value determination step of policy iteration, but without a model • To improve, we need an estimate of the Q function:

TD for Control: SARSA Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Choose a from s using policy derived from Q(e.g., ε-greedy) Repeat (for each step of episode): Take action a, observe r, Choose a’from s’using policy derived from Q(e.g., ε-greedy) s  s’, aa’ until s is terminal

Off-Policy vs. On-Policy • On-policy learning: learn only the value of actions used in the current policy. SARSA is an example of an on-policy method. Of course, learning of the value of the policy is combined with gradual change of the policy • Off-policy learning: can learn the value of a policy/action different than the one used – separating learning from control. Q-learning is an example. It learns about the optimal policy by using a different policy (e.g., e-greedy policy).

Recursive Formulation of Q Function

Learning the Q Values • We don’t know Tai and we don’t want to learn it

Learning the Q Values • We don’t know Tai and we don’t want to learn it • If only we knew that our future Q values were accurate… • …every time we played a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b)

Learning the Q Values • We don’t know Tai and we don’t want to learn it • If only we knew that our future Q values were accurate… • …every time we played a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b) • So we pretend that they are accurate • (after all, they get more and more accurate)

Q Learning Update Rule • On transitioning from i to j, taking action a, receiving reward r, update

Q Learning Update Rule • On transitioning from i to j, taking action a, receiving reward r, update •  is the learning rate • Large : • learning is quicker • but may not converge •  is often decreased over the course of learning

Q Learning Algorithm For each state i and action a: Q(i,a)  0 Begin in state i Repeat: Choose action a based on the Qvalues for state i for all actions Receive reward r and transition to j i  j

Choosing Which Action to Take • Once you have learned the Q function, you can use it to determine the policy • in state i, choose action a that has highest estimated Q(i,a) • But we need to combine exploitation with exploration • same methods as before

Guarantee • If: • every action is taken in every state infinitely often •  is sufficiently small • Then Q learning will converge to the optimal Q values with probability 1 • If also: • probability of exploration tends to zero • Then Q learning will converge to the optimal policy with probability 1

Credit Assignment • By linking Q values to those of the next state, rewards and punishments are eventually propagated backwards • But may take a long time • Idea: wait until end of game and then propagate backwards in reverse order

Q-learning ( = 1) S2 a,b S3 a,b S4 a,b S5 a 0 0 1 0 S1 0 a,b a,b a,b S6 S7 S8 S9 b 0 0 -1 After playing aaaa: After playing bbbb: Q(S4,a) = 1 Q(S2,a) = 1 Q(S8,a) = 0 Q(S6,a) = 0 Q(S4,b) = 0 Q(S2,b) = 0 Q(S8,b) = -1 Q(S6,b) = 0 Q(S3,a) = 1 Q(S1,a) = 1 Q(S7,a) = 0 Q(S1,a) = 1 Q(S3,b) = 0 Q(S1,b) = 0 Q(S7,b) = 0 Q(S1,b) = 0

Bottom Line • Q learning makes optimistic assumption about the future • Rewards will be propagated back in linear time, but punishments may take exponential time to be propagated • But eventually, Q learning will converge to optimal policy

Reinforcement Learning