1 / 47

Meeting 9 - RL Wrap Up and B&W Tournament

Meeting 9 - RL Wrap Up and B&W Tournament. Course logistic notes Reinforcement Learning Wrap Up (30 minutes) Temporal Difference Learning and Variants Examples Continuous state-action spaces Tournament time – move to 5110. Course Logistics. Course website: www.cim.mcgill.ca/~yon/ai

eaton-henry
Télécharger la présentation

Meeting 9 - RL Wrap Up and B&W Tournament

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Meeting 9 - RL Wrap Up and B&W Tournament • Course logistic notes • Reinforcement Learning Wrap Up (30 minutes) • Temporal Difference Learning and Variants • Examples • Continuous state-action spaces • Tournament time – move to 5110

  2. Course Logistics • Course website: www.cim.mcgill.ca/~yon/ai • WebCT Vista: • Course, project, assignment related discussion • Schedule office hours and assignment demo with TA (or write dcasto4@cim.mcgill.ca ) • Trottier 5110 Lab (assignment 1): • Graduate / exchange students: If you don’t already have access, you will need to submit your ID / DAS numbers to admin@cim.mcgill.ca . Mention this class.

  3. Reinforcement Learning

  4. Reinforcement Learning Problem • At each discrete time t, the agent observes state st∈ S and chooses action at∈ A • Then it receives immediate reward rt+1and the state changes to st+1

  5. Models for MDPs • rsa is the expected value of the immediate reward if the agent is in s and performs action a: • pass’ is the probability of going from s to s’ when doing action a: • These form the model of the environment, and may be unknown

  6. Example: Grad School Options

  7. Policy • Execute actions in environment, observe results, and learnpolicyπ: S ×A → [0, 1] • Deterministic policy, π: S → A withπ(s) = a

  8. Returns • Sequence of rewards received after t: rt+1, rt+2, rt+3 ... • Return Rt is the net reward received in the long run: • Finite-length tasks: • Continuing tasks (0 ≤ γ ≤ 1 is the discount factor):

  9. Value and Action-Value Functions • ValueVof a state: the expected return starting from that state, when following the policy: • Action-ValueQ of a state: Expected return when starting in that state, taking the action, and following π afterwards (action-value function):

  10. Learning • The main problems of RL can be stated as: • What is the value function Vπ or Qπof a given policy π? • How should we change π to improve the value function (closer to optimal V*)?

  11. Action Value from Value Function • Simple relation to compute action-value function Q from value V, provided we have a model (r, p) of the environment:

  12. Bellman Policy Equation for π

  13. Dynamic Programming Based Value Estimation

  14. Convergence of Bellman Equations • Error controlled by gamma:

  15. Monte-Carlo Value Estimation

  16. Monte-Carlo Value Estimation • Agent behaves according to π for a while, generating several trajectories • Vπ(s) computed by averaging observed returns after s on trajectories in which s was visited • Average may be computed as each new trajectory is seen, using a “leaky accumulator”: • This can be regarded as modifying V with each new trajectory by an amount α in the direction of the true value R • Typically choose αt = 1 / n(st) to guarantee convergence.

  17. Optimal Value Functions

  18. Optimal Value Functions • Value gives partial order on policies: π ≥ π’ iff Vπ(s) ≥ Vπ’(s) ∀s • Optimal state-value function is the value function shared by all optimal policies: • Likewise, optimal action-value function: • Expected value for taking action a in state s and following an optimal policy afterwards

  19. Bellman Equation for Optimal Value Function V* • The value of a state under the optimal policy must be equal to the expected return for the best action in the state: • V* is the unique solution to this system of non-linear equations

  20. Bellman Equation for Optimal Action-Value Function Q* • Q* is the unique solution to this system of nonlinear equations

  21. Why Care About Optimal Value Functions? • Any policy that is greedy with respect to V* is an optimal policy: • If we know V* and the model of environment, a search just one step ahead will tell us what the optimal action is (and so determine the policy). This is as we discussed in heuristic search. • If we know Q*, a one step search is not even needed; just set:

  22. Example: Agent in Simple Grid

  23. Policy Improvement

  24. Policy Improvement • Given Vπfor some deterministic policy π. How to find a better policy? • Best clue: sometimes it is better to do an action a ≠ π(s), i.e. when: • Qπ(s,a) > Vπ(s) • If we make such a change at all states, we get a policy which is greedy with respect to Qπ : • Then Vπ’(s) ≥ Vπ(s), ∀ s⇒policy is better!

  25. Policy Improvement • If at some point Vπ’ = Vπ, then we have • This is the Bellman optimality equation!!!! • Thus, if the value does not change at some point in the process of greedy improvement, both π and π’ are optimal.

  26. Policy Iteration (for policy improvement) • Start with an initial policy π0 • Repeat: • Compute Vπk using policy evaluation • Compute a new policy πk+1 that is greedy with respect to Vπi: • Until Vπk = Vπk+1

  27. Generalized Policy Iteration (for policy improvement) • Any combination of policy evaluation and policy improvement steps, even if they are not complete:

  28. Value Iteration • Main idea: turn the Bellman optimality equation into an update rule (same as done in policy evaluation): • Start with an arbitrary initial approximation V0 • Vk+1←maxarsa +γ ∑s’ pass’ Vk (s’) , ∀ s

  29. Monte-Carlo vs. Dynamic Programming • Monte-Carlo: • Dynamic Programming:

  30. Dynamic Programming vs. Monte Carlo Value Estimation • Can we combine the advantages of both methods?

  31. Value Estimation and Greedy Policy Improvement: Exercise • The exercise will posted in elaborated form on Friday 27 January on the course website. • The exercise will be due on Tuesday 7 February at 16h00. • This will count as one quiz toward your grade.

  32. Monte-Carlo vs. Dynamic Programming • Monte-Carlo: • Dynamic Programming:

  33. Combining Monte-Carlo and Dynamic Programming • Idea: use sampling idea of Monte-Carlo, but instead of nudging V in the direction of the true observed return Rt use estimate for the return: • The value function update formula becomes: • Note: if V(st+1) were a perfect estimate, this would be a DP update • This Value Estimation method is called Temporal-Difference (TD) learning

  34. Temporal-Difference (TD) Learning • Like DP, it bootstraps (computes the value of a state based on estimates of the successors • Like MC, it estimates expected values by sampling

  35. TD Learning: Advantages • No model of the environment (rewards, probabilities) is needed • TD only needs experience with the environment. • On-line, incremental learning • Both TD and MC converge. TD converges faster • Learn before knowing final outcome • Less memory and peak computation required

  36. TD for learning action values • TD for value function learning: • Easy ansatz provides a TD version of action-value learning – • Bellman equation for Q: • Dynamic programming update for Q: • TD update for Q(s,a):

  37. On-Policy SARSA learning • Policy improvement: At each time-step, choose action greedy mostly greedy with respect to Q: • After getting reward, seeing st+1, and choosing at+1, update action values:

  38. Continuous (or large) state spaces • Previously considered that values held in table • Impractical for continuous or large (multi-dimensional) state spaces • Examples: Chess (~10^43), Robot manipulator (continuous) • Basic approaches: • Quantize continuous state spaces • Quantization of continuous state variables, perhaps coarse • Example: Angle in cart-pole problem quantized to (-90, -30, 0, 30, 90)° • Tilings: Impose overlapping - grid • More general approach: function approximation • Simple case: represent V(s) as a set of weights wi and basis functions fi(s) • V(s) = w1 f1(s) + w2 f2(s) + ... + wn fn(s) • Other methods of function approximation (eg. neural network) in coming weeks • Added benefit: V(s) able to predict value of states not yet visited

  39. Robot that learns to stand

  40. Robot that learns to stand

  41. Robot that learns to stand

  42. Robot that learns to stand • No static solution here to stand-up problem: robot must use momentum to stand up • RL used: • Two-layer heirarchy • TD learning for action values Q(s,a) applied to plan problem sub-goal sequence that may lead to success • Continuous version of TD learning applied to learn to achieve sub-goals • Robot trained several hundred iterations in simulation + 100 or so trials on the physical robot • Details: • J. Morimoto, K. Doya, “Acquisition of stand-up behavior by a real robot using heirarchical reinforcement learning”. J. Robotics Auto. Sys. 36 (2001)

  43. Robot that learns to stand • After several hundred attempts in simulation

  44. Robot that learns to stand • After ~100 hundred additional trials on robot

  45. Wrap up

  46. Wrap up • Next time: Wrap up of RL • Temporal Difference Learning • TD-Learning for Action-Values • Examples

  47. Wrap up • Required readings • Russell and Norvig • Chapter 17, 21: Reinforcement learning • Optional but highly recommended (via course outline): • Sutton and Barto, Reinforcement Learning, pt 1(MIT Press, 1998) • (Link to HTML version from course readings) • Kaelbling et al., “Reinforcement Learning: A Survey” (1996) • Tesauro, “Temporal Difference Learning and TD-Gammon” (1995) • Acknowledgements: • Rich Sutton, Doina Precup, RL materials used here • Jeremy Cooperstock, AI Course Materials (2003)

More Related