1 / 26

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning. October 27, 2015. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015. Final projects - logistics.

mleona
Télécharger la présentation

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 517: Reinforcement Learningin Artificial Intelligence Lecture 14: Planning and Learning October 27, 2015 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015

  2. Final projects - logistics • Projects can be done in groups of up to 3 students • Details on projects will be posted soon • Students are encouraged to propose a topic • Please email me your top three choices for a project along with a preferred date for your presentation • Presentation dates: • Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD) • Format: 20 min presentation + 5 min Q&A • ~5 min for background and motivation • ~15 for description of your work, results, conclusions • Written report due: Monday, Dec. 7 • Format similar to project report

  3. Final projects – sample topics DQN – Playing Atari games using RL Teris player using RL (and NN) Curiosity based TD learning* Reinforcement Learning of Local Shape in the Game of Go AIBO learning to walk Study of value function definitions for TD learning Imitation learning in RL

  4. Outline • Introduction • Use of environment models • Integration of planning and learning methods

  5. Introduction • Earlier we discussed Monte Carlo and temporal-difference methods as distinct alternatives • Then showed how they can be seamlessly integrated by using eligibility traces such as in TD(l) • Planning methods: e.g. Dynamic Programming and heuristic search • Rely on knowledge of a model • Model – any information that helps the agent predict the way the environment will behave • Learning methods: Monte Carlo and Temporal Difference Learning • Do not require a model • Our goal: Explore the extent to which the two methods can be intermixed

  6. The original idea

  7. The original idea (cont.)

  8. Models • Model: anything the agent can use to predict how the environment will respond to its actions • Distribution models: provide description of all possibilities (of next states and rewards) and their probabilities • e.g. Dynamic Programming • Example - sum of a dozen dice – produce all possible sums and their probabilities of occurring • Sample models: produce just one sample experience • In our example - produce individual sums drawn according to this probability distribution • Both types of models can be used to (mimic) produce simulated experience • Often sample models are much easier to come by

  9. Planning • Planning: any computational process that uses a model to create or improve a policy • Planning in AI: • State-space planning (such as in RL) – search for policy • Plan-space planning (e.g., partial-order planner) • e.g. evolutionary methods • We take the following (unusual) view: • All state-space planning methods involve computing value functions, either explicitly or implicitly • They all apply backups to simulated experience

  10. Planning (cont.) • Classical DP methods are state-space planning methods • Heuristic search methods are state-space planning methods • Planning methods rely on “real” experience as input, but in many cases they can be applied to simulated experience just as well • Example: a planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning

  11. Learning, Planning, and Acting • Two uses of real experience: • Model learning: to improve the model • Direct RL: to directly improve the value function and policy • Improving value function and/or policy via a model is sometimes called indirect RLormodel-based RL.Here,we call itplanning. • Q: What are the advantages/disadvantages of each?

  12. Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct vs. Indirect RL • Direct methods • simpler • not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel Q: Which scheme do you think applies to humans?

  13. The Dyna-Q Architecture (Sutton 1990)

  14. The Dyna-Q Algorithm Random-sample single-step tabular Q-planning method direct RL model learning(update) planning

  15. Dyna-Q on a Simple Maze rewards = 0 untilgoal reached, when reward = 1

  16. Dyna-Q Snapshots: Midway in 2nd Episode • Recall that in a planning context … • Exploration – trying actions that improve the model • Exploitation – Behaving in the optimal way given the current model • Balance between the two is always a key challenge!

  17. Variations on the Dyna-Q agent • (Regular) Dyna-Q • Soft exploration/exploitation with constant rewards • Dyna-Q+ • Encourages exploration of state-action pairs that have not been visited in a long time (in real interaction with the environment) • If n is the number of steps elapsed between two consecutive visits to (s,a), then the reward is larger as a function of n • Dyna-AC • Actor-Critic learning rather that Q-learning

  18. More on Dyna-Q+ ? • Uses an “exploration bonus”: • Keeps track of time since each state-action pair was tried for real • An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting • The agent (indirectly) “plans” how to visit long unvisited states

  19. When the Model is Wrong: Blocking Maze (cont.) • The maze example was oversimplified • In reality many things could go wrong • Environment could be stochastic • Model can be imperfect (local minimum, stochasticity or no convergence) • Partial experience could be misleading • When the model is incorrect, the planning process will compute a suboptimal policy • This is actually a learning opportunity • Discovery and correction of the modeling error

  20. When the Model is Wrong: Blocking Maze (cont.) The changed environment is harder

  21. Shortcut Maze The changed environment is easier

  22. Prioritized Sweeping • In the Dyna agents presented, simulated transitions are started in uniformly chosen state-action pairs • Probably not optimal • Which states or state-action pairs should be generated during planning? • Work backwards from states whose values have just changed: • Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change • When a new backup occurs, insert predecessors according to their priorities • Always perform backups from first in queue • Moore and Atkeson 1993; Peng and Williams, 1993

  23. Prioritized Sweeping

  24. Prioritized Sweeping vs. Dyna-Q Both use N = 5 backups per environmental interaction

  25. Trajectory Sampling • Trajectory sampling: perform backups along simulated trajectories • This samples from the on-policy distribution • Distribution constructed from experience (visits) • Advantages when function approximation is used • Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Irrelevant states Reachable under optimal control

  26. Summary • Discussed close relationship between planning and learning • Important distinction between distributionmodels and sample models • Looked at some ways to integrate planning and learning • synergy among planning, acting, model learning • Distribution of backups: focus of the computation • prioritized sweeping • trajectory sampling: backup along trajectories

More Related