310 likes | 436 Vues
Meeting 9 - RL Wrap Up and B&W Tournament. Course logistic notes Reinforcement Learning Wrap Up (30 minutes) Temporal Difference Learning and Variants Examples Continuous state-action spaces Tournament time – move to 5110. Course Logistics. Assignment 1:
E N D
Meeting 9 - RL Wrap Up and B&W Tournament • Course logistic notes • Reinforcement Learning Wrap Up (30 minutes) • Temporal Difference Learning and Variants • Examples • Continuous state-action spaces • Tournament time – move to 5110
Course Logistics • Assignment 1: • Tournament code due several minutes ago • Final code and paper due Thursday. This is the result of the McGill open house last weekend that prevented lab access to some. • Tournament after short conclusion of RL
Value Estimation and Greedy Policy Improvement: Exercise • The exercise was posted in elaborated form on Friday 27 January – See the course website. • The exercise is due on Tuesday 7 February at 16h00. • This will count as one quiz toward your grade.
Value and Action-Value Functions • ValueV(s)of a state under the policy: • Action-ValueQ(s,a): Take any action a and follow policy thereafter
Generalized Policy Iteration (for policy improvement) • Iterative cycle of value estimation and improvement of policy:
Value Estimation via Temporal Difference Learning • Idea: use sampling idea of Monte-Carlo, but instead of adjusting V in to better match observed return Rt use revised estimate for the return: • The value function update formula becomes: • Note: if V(st+1) were a perfect estimate, this would be a DP update • This Value Estimation method is called Temporal-Difference (TD) learning
TD Learning: Advantages • No model of the environment (rewards, probabilities) is needed • TD only needs experience with the environment. • On-line, incremental learning • Both TD and MC converge. TD converges faster • Learn before knowing final outcome • Less memory and peak computation required
TD for learning action values ( “Q-Learning”) • Easy ansatz provides a TD version of action-value learning – • Bellman equation for Qπ: • Dynamic programming update for Q: • TD update for Q(s,a):
On-Policy SARSA learning • Policy improvement: At each time-step, choose action greedy mostly greedy with respect to Q: • After getting reward, seeing st+1, and choosing at+1, update action values according to action-value TD formula:
Grid World – Example Goal – Prey runs about at random Agent – Predator chasing prey
Grid World Example Pursuit before learning
Grid World Example Upon further learning trials
Grid World Example Learned pursuit task
Grid World Example State space and learned value function
Grid World Example State sequence st Action sequence at Reward sequence rt Value sequence V(st) Delta sequence rt+1 - γV(st+1) - V(st)
Continuous (or large) state spaces • Previously implied table implementation: Value[state] • Impractical for continuous or large (multi-dimensional) state spaces • Examples: Chess (~10^43), Robot manipulator (continuous) • B&W state space size? • Both storage problems and generalization problems • Generalization problem: Agent will not have a chance to explore most states of a sufficiently large state space
State Space Generalization: Approaches • Quantize continuous state spaces • Circumvents generalization problem: force small number states • Quantization of continuous state variables, perhaps coarse • Example: Angle in cart-pole problem quantized to (-90, -30, 0, 30, 90)° • Tilings: Impose overlapping - grid • More general approach: function approximation • Simple case: represent V(s) as a set of weights wi and basis functions fi(s) • V(s) = w1 f1(s) + w2 f2(s) + ... + wn fn(s) • More refined methods of function approximation (eg. neural network) in coming weeks • Added benefit: V(s) generalizes to predict value of states not yet visited (interpolating between their values, for example)
Robot that learns to stand • No static solution here to stand-up problem: robot must use momentum to stand up • RL used: • Two-layer heirarchy • TD learning for action values Q(s,a) applied to plan problem sub-goal sequence that may lead to success • Continuous version of TD learning applied to learn to achieve sub-goals • Robot trained several hundred iterations in simulation + 100 or so trials on the physical robot • Details: • J. Morimoto, K. Doya, “Acquisition of stand-up behavior by a real robot using heirarchical reinforcement learning”. J. Robotics Auto. Sys. 36 (2001)
Robot that learns to stand • After several hundred attempts in simulation
Robot that learns to stand • After ~100 hundred additional trials on robot
Wrap up • Next time: • Brief overview of planning in AI • End of 1st section of course devoted to problem solving, search, and RL
Wrap up • Required readings • Russell and Norvig • Chapter 11: Planning • Acknowledgements: • Rich Sutton, Doina Precup, RL materials used here • Kenji Doya, Standing robot materials
Inaugural B&W Computer Tournament • Number of competitors? • Duration of typical game? • t ≈ 50 (total) moves × 10 sec / move ≈ 8 minutes • Stage 1: • Round-robin play • 3 games against randomly selected opponents • t ≈ 35 minutes • Top 8 agents advance. Scoring: Draw = 0, Win = +1, Loss = -1. • Stage 2: • Single elimination seeded bracket play: (((1 8),(4 5)),((2 7),(3 6))) • Top four competitors receive bonus (will deal fairly with drawn agents) • Draws: game drawn after 50 (total) moves, or by referee decision