Meeting 9 - RL Wrap Up and B&W Tournament

Meeting 9 - RL Wrap Up and B&W Tournament • Course logistic notes • Reinforcement Learning Wrap Up (30 minutes) • Temporal Difference Learning and Variants • Examples • Continuous state-action spaces • Tournament time – move to 5110

Course Logistics • Course website: www.cim.mcgill.ca/~yon/ai • WebCT Vista: • Course, project, assignment related discussion • Schedule office hours and assignment demo with TA (or write dcasto4@cim.mcgill.ca ) • Trottier 5110 Lab (assignment 1): • Graduate / exchange students: If you don’t already have access, you will need to submit your ID / DAS numbers to admin@cim.mcgill.ca . Mention this class.

Reinforcement Learning

Reinforcement Learning Problem • At each discrete time t, the agent observes state st∈ S and chooses action at∈ A • Then it receives immediate reward rt+1and the state changes to st+1

Models for MDPs • rsa is the expected value of the immediate reward if the agent is in s and performs action a: • pass’ is the probability of going from s to s’ when doing action a: • These form the model of the environment, and may be unknown

Example: Grad School Options

Policy • Execute actions in environment, observe results, and learnpolicyπ: S ×A → [0, 1] • Deterministic policy, π: S → A withπ(s) = a

Returns • Sequence of rewards received after t: rt+1, rt+2, rt+3 ... • Return Rt is the net reward received in the long run: • Finite-length tasks: • Continuing tasks (0 ≤ γ ≤ 1 is the discount factor):

Value and Action-Value Functions • ValueVof a state: the expected return starting from that state, when following the policy: • Action-ValueQ of a state: Expected return when starting in that state, taking the action, and following π afterwards (action-value function):

Learning • The main problems of RL can be stated as: • What is the value function Vπ or Qπof a given policy π? • How should we change π to improve the value function (closer to optimal V*)?

Action Value from Value Function • Simple relation to compute action-value function Q from value V, provided we have a model (r, p) of the environment:

Bellman Policy Equation for π

Dynamic Programming Based Value Estimation

Convergence of Bellman Equations • Error controlled by gamma:

Monte-Carlo Value Estimation

Monte-Carlo Value Estimation • Agent behaves according to π for a while, generating several trajectories • Vπ(s) computed by averaging observed returns after s on trajectories in which s was visited • Average may be computed as each new trajectory is seen, using a “leaky accumulator”: • This can be regarded as modifying V with each new trajectory by an amount α in the direction of the true value R • Typically choose αt = 1 / n(st) to guarantee convergence.

Optimal Value Functions

Optimal Value Functions • Value gives partial order on policies: π ≥ π’ iff Vπ(s) ≥ Vπ’(s) ∀s • Optimal state-value function is the value function shared by all optimal policies: • Likewise, optimal action-value function: • Expected value for taking action a in state s and following an optimal policy afterwards

Bellman Equation for Optimal Value Function V* • The value of a state under the optimal policy must be equal to the expected return for the best action in the state: • V* is the unique solution to this system of non-linear equations

Bellman Equation for Optimal Action-Value Function Q* • Q* is the unique solution to this system of nonlinear equations

Why Care About Optimal Value Functions? • Any policy that is greedy with respect to V* is an optimal policy: • If we know V* and the model of environment, a search just one step ahead will tell us what the optimal action is (and so determine the policy). This is as we discussed in heuristic search. • If we know Q*, a one step search is not even needed; just set:

Example: Agent in Simple Grid

Policy Improvement

Policy Improvement • Given Vπfor some deterministic policy π. How to find a better policy? • Best clue: sometimes it is better to do an action a ≠ π(s), i.e. when: • Qπ(s,a) > Vπ(s) • If we make such a change at all states, we get a policy which is greedy with respect to Qπ : • Then Vπ’(s) ≥ Vπ(s), ∀ s⇒policy is better!

Policy Improvement • If at some point Vπ’ = Vπ, then we have • This is the Bellman optimality equation!!!! • Thus, if the value does not change at some point in the process of greedy improvement, both π and π’ are optimal.

Policy Iteration (for policy improvement) • Start with an initial policy π0 • Repeat: • Compute Vπk using policy evaluation • Compute a new policy πk+1 that is greedy with respect to Vπi: • Until Vπk = Vπk+1

Generalized Policy Iteration (for policy improvement) • Any combination of policy evaluation and policy improvement steps, even if they are not complete:

Value Iteration • Main idea: turn the Bellman optimality equation into an update rule (same as done in policy evaluation): • Start with an arbitrary initial approximation V0 • Vk+1←maxarsa +γ ∑s’ pass’ Vk (s’) , ∀ s

Monte-Carlo vs. Dynamic Programming • Monte-Carlo: • Dynamic Programming:

Dynamic Programming vs. Monte Carlo Value Estimation • Can we combine the advantages of both methods?

Value Estimation and Greedy Policy Improvement: Exercise • The exercise will posted in elaborated form on Friday 27 January on the course website. • The exercise will be due on Tuesday 7 February at 16h00. • This will count as one quiz toward your grade.

Monte-Carlo vs. Dynamic Programming • Monte-Carlo: • Dynamic Programming:

Combining Monte-Carlo and Dynamic Programming • Idea: use sampling idea of Monte-Carlo, but instead of nudging V in the direction of the true observed return Rt use estimate for the return: • The value function update formula becomes: • Note: if V(st+1) were a perfect estimate, this would be a DP update • This Value Estimation method is called Temporal-Difference (TD) learning

Temporal-Difference (TD) Learning • Like DP, it bootstraps (computes the value of a state based on estimates of the successors • Like MC, it estimates expected values by sampling

TD Learning: Advantages • No model of the environment (rewards, probabilities) is needed • TD only needs experience with the environment. • On-line, incremental learning • Both TD and MC converge. TD converges faster • Learn before knowing final outcome • Less memory and peak computation required

TD for learning action values • TD for value function learning: • Easy ansatz provides a TD version of action-value learning – • Bellman equation for Q: • Dynamic programming update for Q: • TD update for Q(s,a):

On-Policy SARSA learning • Policy improvement: At each time-step, choose action greedy mostly greedy with respect to Q: • After getting reward, seeing st+1, and choosing at+1, update action values:

Continuous (or large) state spaces • Previously considered that values held in table • Impractical for continuous or large (multi-dimensional) state spaces • Examples: Chess (~10^43), Robot manipulator (continuous) • Basic approaches: • Quantize continuous state spaces • Quantization of continuous state variables, perhaps coarse • Example: Angle in cart-pole problem quantized to (-90, -30, 0, 30, 90)° • Tilings: Impose overlapping - grid • More general approach: function approximation • Simple case: represent V(s) as a set of weights wi and basis functions fi(s) • V(s) = w1 f1(s) + w2 f2(s) + ... + wn fn(s) • Other methods of function approximation (eg. neural network) in coming weeks • Added benefit: V(s) able to predict value of states not yet visited

Robot that learns to stand

Robot that learns to stand • No static solution here to stand-up problem: robot must use momentum to stand up • RL used: • Two-layer heirarchy • TD learning for action values Q(s,a) applied to plan problem sub-goal sequence that may lead to success • Continuous version of TD learning applied to learn to achieve sub-goals • Robot trained several hundred iterations in simulation + 100 or so trials on the physical robot • Details: • J. Morimoto, K. Doya, “Acquisition of stand-up behavior by a real robot using heirarchical reinforcement learning”. J. Robotics Auto. Sys. 36 (2001)

Robot that learns to stand • After several hundred attempts in simulation

Robot that learns to stand • After ~100 hundred additional trials on robot

Wrap up

Wrap up • Next time: Wrap up of RL • Temporal Difference Learning • TD-Learning for Action-Values • Examples

Wrap up • Required readings • Russell and Norvig • Chapter 17, 21: Reinforcement learning • Optional but highly recommended (via course outline): • Sutton and Barto, Reinforcement Learning, pt 1(MIT Press, 1998) • (Link to HTML version from course readings) • Kaelbling et al., “Reinforcement Learning: A Survey” (1996) • Tesauro, “Temporal Difference Learning and TD-Gammon” (1995) • Acknowledgements: • Rich Sutton, Doina Precup, RL materials used here • Jeremy Cooperstock, AI Course Materials (2003)

Meeting 9 - RL Wrap Up and B&W Tournament