1 / 19

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning. Dr Kathryn Merrick k.merrick@adfa.edu.au 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th November, 15:30-17:00. Reinforcement Learning is…. … learning from trial-and-error and reward by interaction with an environment.

hinda
Télécharger la présentation

Introduction to Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Reinforcement Learning Dr Kathryn Merrick k.merrick@adfa.edu.au 2008 Spring School on Optimisation, Learning and Complexity Friday 7th November, 15:30-17:00

  2. Reinforcement Learning is… … learning from trial-and-error and reward by interaction with an environment.

  3. Today’s Lecture • A formal framework: Markov Decision Processes • Optimality criteria • Value functions • Solution methods: Q-learning • Examples and exercises • Alternative models • Summary and applications

  4. Markov Decision Processes The reinforcement learning problem can be represented as: • A set S of states {s1, s2, s3, …} • A set A of actions {a1, a2, a3, …} • A transition function T:S x A  S (deterministic) orT:S x A x S  [0, 1] (stochastic) • A reward function R:S x A  Real or R:S x A x S  Real • A policyπ:S  A (deterministic) or π:S x A  [0, 1] (stochastic)

  5. Optimality Criteria Suppose an agent receives a reward rt at time t. Then optimal behaviour might: • Maximise the sum of expected future reward: • Maximise over a finite horizon: • Maximise over an infinite horizon: • Maximise over a discounted infinite horizon: • Maximise average reward:

  6. Value Functions The expected sum of discounted reward for following the policy πfrom state s to the end of time. The expected sum of discounted reward for starting in state s, taking action a once then following the policy πfrom state s’ to the end of time. State value function Vπ:S  Real or Vπ(s) State-action value function Qπ:S x A  Real or Qπ(s, a)

  7. Optimal State Value Function V*(s) = E{ R(s, a, s’) + γV*(s’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γV*(s’) ] • A Bellman Equation • Can be solved using dynamic programming • Requires knowledge of the transition function T

  8. Optimal State-Action Value Function Q*(s, a) = E{ R(s, a, s’) + γQ*(s’, a’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γQ*(s’, a’) ] • Also a Bellman Equation • Also requires knowledge of the transition function T to solve using dynamic programming • Can now define action selection: π*(s) = Q*(s, a)

  9. A Possible Application…

  10. Solution Methods • Model based: • For example dynamic programming • Require a model (transition function) of the environment for learning • Model free: • Learn from interaction with the environment without requiring a model • For example Q-learning…

  11. Park Clean Drive Parked Clean Driving Clean Clean Park Clean Drive Clean Drive Parked Dirty Driving Dirty Park Park Drive Q-Learning by Example: Driving in Canberra

  12. Formulating the Problem • States s1 Park clean s2 Park dirty s3 Drive clean s4 Drive dirty • Actions a1 Drive a2 Clean a3 Park • Reward 1 for transitions to a rt = ‘clean’ state 0 otherwise • State-Action Table or Q-Table

  13. A Q-Learning Agent st Learning update to πt Action selection from πt at rt Agent Environment

  14. Q-Learning Algorithmic Components • Learning update (to Q-Table): Q(s, a)  (1-α)Q(s, a) + α[r + γQ(s’, a’)] or Q(s, a)  Q(s, a) + α[r + γQ(s’, a’) - Q(s, a)] • Action selection (from Q-Table): a = f(Q(s, a))

  15. Matlab Code Available on Request

  16. Exercise You need to program a small robot to learn to find food. • What assumptions will you make about the robot’s sensors and actuators to represent the environment? • How could you model the problem as an MDP? • Calculate a few learning iterations in your domain by hand.

  17. Alternatives • Function approximation of the Q-table: • Neural networks • Decision trees • Gradient descent methods • Reinforcement learning variants: • Relational reinforcement learning • Hierarchical reinforcement learning • Intrinsically motivated reinforcement learning

  18. A final application…

  19. References and Further Reading • Sutton, R., Barto, A., (2000) Reinforcement Learning: an Introduction, The MIT Press http://www.cs.ualberta.ca/~sutton/book/the-book.html • Kaelbling, L., Littman, M., Moore, A., (1996) Reinforcement Learning: a Survey, Journal of Artificial Intelligence Research, 4:237-285 • Barto, A., Mahadevan, S., (2003) Recent Advances in Hierarchical Reinforcement Learning, Discrete Event Dynamic Systems: Theory and Applications, 13(4):41-77

More Related