1 / 54

Reinforcement Learning: A Tutorial

Reinforcement Learning: A Tutorial. Rich Sutton AT&T Labs -- Research www.research.att.com/~sutton. Outline. Core RL Theory Value functions Markov decision processes Temporal-Difference Learning Policy Iteration The Space of RL Methods Function Approximation Traces Kinds of Backups

ievan
Télécharger la présentation

Reinforcement Learning: A Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning:A Tutorial Rich Sutton AT&T Labs -- Research www.research.att.com/~sutton

  2. Outline • Core RL Theory • Value functions • Markov decision processes • Temporal-Difference Learning • Policy Iteration • The Space of RL Methods • Function Approximation • Traces • Kinds of Backups • Lessons for Artificial Intelligence • Concluding Remarks

  3. RL is Learning from Interaction • complete agent • temporally situated • continual learning & planning • object is to affect environment • environment stochastic & uncertain Environment action state reward Agent

  4. Ball-Ballancing Demo • RL applied to a real physical system • Illustrates online learning • An easy problem...but learns as you watch! • Courtesy of GTE Labs: Hamid Benbrahim Judy Franklin John Doleac Oliver Selfridge • Same learning algorithm as early pole-balancer

  5. Markov Decision Processes (MDPs) In RL, the environment is a modeled as an MDP, defined by S – set of states of the environment A(s) – set of actions possible in state s S P(s,s',a) – probability of transition from s to s' given a R(s,s',a)– expected reward on transition s to s' given a  – discount rate for delayed reward discrete time, t = 0, 1, 2, . . . r r r . . . . . . t +1 t +2 s s t +3 s s t +1 t +2 t +3 a a a a t t +1 t +2 t t +3

  6. The Objective is to Maximize Long-term Total Discounted Reward Find a policy sS  aA(s)(could be stochastic) that maximizes thevalue (expected future reward) of each s : and each s,apair:  V (s) = E{r+ r+r+ s =s, } 2 . . . t +1t +2t +3 t rewards  Q (s,a) = E{r+ r+r+ s =s, a =a, } 2 . . . t +1t +2t +3 t t These are called value functions - cf. evaluation functions

  7. Optimal Value Functions and Policies There exist optimal value functions: And corresponding optimal policies: * is the greedy policy with respect toQ*

  8. 1-Step Tabular Q-Learning On each state transition: Update: a table entry TD Error Optimal behavior found without a model of the environment! (Watkins, 1989) Assumes finite MDP

  9. Example of Value Functions

  10. 4Grid Example

  11. Policy Iteration

  12. Summary of RL Theory:Generalized Policy Iteration policy evaluation value learning Value Function Policy policy improvement greedification

  13. Outline • Core RL Theory • Value functions • Markov decision processes • Temporal-Difference Learning • Policy Iteration • The Space of RL Methods • Function Approximation • Traces • Kinds of Backups • Lessons for Artificial Intelligence • Concluding Remarks

  14. Rummery & Niranjan Holland Sutton Sarsa Backup On transition: Update: TD Error The value of this pair Backup! Is updated from the value of the next pair

  15. Dimensions of RL algorithms 1. Action Selection and Exploration 2. Function Approximation 3. Eligibility Trace 4. On-line Experience vs Simulated Experience 5. Kind of Backups . . .

  16. 1. Action Selection (Exploration) Ok, you've learned a value function, How do you pick Actions? -Greedy Action Selection: Be greedy most of the time Occasionally take a random action Surprisingly, this is the state of the art. Greedy Action Selection: Always select the action that looks best: Exploitation vs Exploration!

  17. 2. Function Approximation Could be: • table • Backprop Neural Network • Radial-Basis-Function Network • Tile Coding (CMAC) • Nearest Neighbor, Memory Based • Decision Tree Function Approximator targets or errors gradient- descent methods

  18. Adjusting Network Weights weight vector standard backprop gradient e.g., gradient-descent Sarsa: estimated value target value

  19. Sparse Coarse Coding . . . . Linear last layer . fixed expansive Re-representation . . . . . . features Coarse: Large receptive fields Sparse: Few features present at one time

  20. The Mountain Car Problem Moore, 1990 SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car reaches the goal No Discounting Goal Gravity wins Minimum-Time-to-Goal Problem

  21. RL Applied to the Mountain Car 1. TD Method = Sarsa Backup 2. Action Selection = Greedy with initial optimistic values 3. Function approximator = CMAC 4. Eligibility traces = Replacing Video Demo (Sutton, 1995)

  22. Tile Coding applied to Mountain Cara.k.a. CMACs (Albus, 1980)

  23. The Acrobot Problem e.g., Dejong & Spong, 1994 Sutton, 1995 Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities CMAC of 48 layers RL same as Mountain Car

  24. 3. Eligibility Traces (*) • The  in TD(), Q(), Sarsa() . . . • A more primitive way of dealing with delayed reward • The link to Monte Carlo methods • Passing credit back farther than one action • The general form is: specific to global ~ change in at time t (TD error) • (eligibility of at time t ) at time t e.g., was taken at t ? or shortly before t ?

  25. Eligibility Traces visits to accumulating trace replace trace TIME

  26. Complex TD Backups (*) primitive TD backups trial e.g. TD (  ) Incrementally computes a weighted mixture of these backups as states are visited 1– (1–)  = 0 Simple TD  = 1 Simple Monte Carlo 2 (1–) = 1 3 

  27. Exhaustive Dynamic search programming full backups sample Temporal- backups difference learning l bootstrapping, shallow deep backups backups Kinds of Backups Monte Carlo

  28. Outline • Core RL Theory • Value functions • Markov decision processes • Temporal-Difference Learning • Policy Iteration • The Space of RL Methods • Function Approximation • Traces • Kinds of Backups • Lessons for Artificial Intelligence • Concluding Remarks

  29. World-Class Applications of RL • TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player • Elevator Control Crites & Barto World's best down-peak elevator controller • Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods • Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls

  30. Lesson #1:Approximate the Solution, Not the Problem • All these applications are large, stochastic, optimal control problems • Too hard for conventional methods to solve exactly • Require problem to be simplified • RL just finds an approximate solution • . . . which can be much better!

  31. Backgammon Tesauro, 1992,1995 SITUATIONS: configurations of the playing board (about 10 ) ACTIONS: moves REWARDS: win: +1 lose: –1 else: 0 20 Pure delayed reward

  32. Tesauro, 1992–1995 TD-Gammon Start with a random network Play millions of games against self Learn a value function from this simulated experience Action selection by 2–3 ply search Value TD error This produces arguably the best player in the world

  33. Lesson #2:The Power of Learning from Experience TD-Gammon self-play Expert examples are expensive and scarce Experience is cheap and plentiful! And teaches the real solution 70% Tesauro, 1992 performance against gammontool Neurogammon same network, but trained from 15,000 expert-labeled examples 50% 10 20 40 80 0 # hidden units

  34. Lesson #3:The Central Role of Value Functions ...of modifiable moment-by-moment estimates of how well things are going All RL methods learn value functions All state-space planners compute value functions Both are based on "backing up" value Recognizing and reacting to the ups and downs of life is an important part of intelligence

  35. on-line interaction environment learning experience value/policy planning simulation environment model Lesson #4:Learning and Planningcan be radically similar Historically, planning and trial-and-error learning have been seen as opposites But RL treats both as processing of experience

  36. 1-Step Tabular Q-Planning 1. Generate a state, s, and action, a 2. Consult model for next state, s', and reward, r 3. Learn from the experience, s,a,r,s' : 4. go to 1 With function approximation and cleverness in search control (Step 1), this is a viable, perhaps even superior, approach to state-space planning

  37. Lesson #5: Accretive Computation"Solves" Planning Dilemmas Reactivity/Deliberation dilemma "solved" simply by not opposing search and memory Intractability of planning "solved" by anytime improvement of solution value/policy Processes only loosely coupled Can proceed in parallel, asynchronously Quality of solution accumulates in value & model memories acting planning direct RL model experience model learning

  38. Dyna Algorithm 1. s current state 2. Choose an action, a, and take it 3. Receive next state, s’, and reward, r 4. Apply RL backup to s, a, s’, r e.g., Q-learning update 5. Update Model(s, a) with s’, r 6. Repeat k times: - select a previously seen state-action pair s,a - s’, r Model(s, a) - Apply RL backup to s, a, s’, r 7. Go to 1

  39. Actions –> Behaviors MDPs seems too too flat, low-level Need to learn/plan/design agents at a higher level at the level of behaviors rather than just low-level actions • e.g., open-the-door rather than twitch-muscle-17 • walk-to-work rather than 1-step-forward • Behaviors are based on entire policies, • directed toward subgoals • enable abstraction, hierarchy, modularity, transfer...

  40. Rooms Example 4 rooms 4 hallways 4 unreliable primitive actions ROOM HALLWAYS up Fail 33% left right of the time O 1 down G? 8 multi-step options O G? (to each room's 2 hallways) 2 Given goal location, quickly plan shortest route Goal states are given All rewards zero  a terminal value of 1 = .9

  41. Planning by Value Iteration

  42. Behaviors aremuch like Primitive Actions • Define a higher level Semi-MDP, overlaid on the original MDP • Can be used with same planning methods - faster planning - guaranteed correct results •Interchangeable with primitive actions • Models and policies can be learned by TD methods Provide a clear semantics for multi-level, re-usable knowledge for the general MDP context Lesson #6: Generality is no impediment to working with higher-level knowledge

  43. Summary of Lessons 1. Approximate the Solution, Not the Problem 2. The Power of Learning from Experience 3. The Central Role of Value Functions in finding optimal sequential behavior 4. Learning and Planning can be Radically Similar 5. Accretive Computation "Solves Dilemmas" 6. A General Approach need not be Flat, Low-level All stem from trying to solve MDPs

  44. Outline • Core RL Theory • Value functions • Markov decision processes • Temporal-Difference Learning • Policy Iteration • The Space of RL Methods • Function Approximation • Traces • Kinds of Backups • Lessons for Artificial Intelligence • Concluding Remarks

  45. Strands of History of RL Trial-and-error learning Temporal-difference learning Optimal control, value functions Hamilton (Physics) 1800s Thorndike () 1911 Secondary reinforcement () Shannon Samuel Minsky Bellman/Howard (OR) Holland Klopf Witten Werbos Barto et al Sutton Watkins

  46. Radical Generalityof the RL/MDP problem Determinism Goals of achievement Complete knowledge Closed world Unlimited deliberation Classical RL/MDPs General stochastic dynamics General goals Uncertainty Reactive decision-making

  47. Getting the degree of abstraction right • Actions can be low level (e.g. voltages to motors), high level (e.g., accept a job offer), "mental" (e.g., shift in focus of attention) • Situations: direct sensor readings, or symbolic descriptions, or "mental" situations (e.g., the situation of being lost) • An RL agent is not like a whole animal or robot, which consist of many RL agents as well as other components • The environment is not necessarily unknown to the RL agent, only incompletely controllable • Reward computation is in the RL agent's environment because the RL agent cannot change it arbitrarily

  48. Some of the Open Problems • Incomplete state information • Exploration • Structured states and actions • Incorporating prior knowledge • Using teachers • Theory of RL with function approximators • Modular and hierarchical architectures • Integration with other problem–solving and planning methods

  49. Dimensions of RL Algorithms • TD --- Monte Carlo () • Model --- Model free • Kind of Function Approximator • Exploration method • Discounted --- Undiscounted • Computed policy --- Stored policy • On-policy --- Off-policy • Action values --- State values • Sample model --- Distribution model • Discrete actions --- Continuous actions

  50. A Unified View • There is a space of methods, with tradeoffs and mixtures - rather than discrete choices • Value function are key - all the methods are about learning, computing, or estimating values - all the methods store a value function • All methods are based on backups - different methods just have backups of different shapes and sizes, sources and mixtures

More Related