1 / 29

Chap. 13 Reinforcement Learning (RL)

Chap. 13 Reinforcement Learning (RL). Machine Learning Tom M. Mitchell. outline. What is Reinforcement Learning? Methods Used in Reinforcement Learning Temporal Difference Methods Applications. Introduction. What ’ s reinforcement learning? History What reinforcement learning can do?

Télécharger la présentation

Chap. 13 Reinforcement Learning (RL)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chap. 13 Reinforcement Learning (RL) Machine Learning Tom M. Mitchell

  2. outline • What is Reinforcement Learning? • Methods Used in Reinforcement Learning • Temporal Difference Methods • Applications

  3. Introduction • What’s reinforcement learning? • History • What reinforcement learning can do? • Reinforcement Learning’s Element

  4. What’s reinforcement Learning? Reinforcement learning address the question of how an autonomous agent that sense and acts in its environment can learn to choose optimal actions to achieve its goals. In Reinforcement learning , the computer is simply given a goal to achieve. The computer then learns how to achieve that goal by trial-and-error interactions with its environment.

  5. What’s reinforcement Learning? • To provide the intuition behind reinforcement learning consider the problem of learning to ride a bicycle. • The goal given to the RL system is simply to ride the bicycle without falling over. • In the first trial, the RL system begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right. At this point their are two actions possible: turn the handle bars left or turn them right.

  6. What’s reinforcement Learning? • The RL system turns the handle bars to the left and immediately crashes to the ground, thus receiving a negative reinforcement. • The RL system has just learned not to turn the handle bars left when tilted 45 degrees to the right. • In the next trial The RL system knows not to turn the handle bars to the left, so it performs the only other possible action: turn right when tilted 45 degrees to the right.

  7. What’s reinforcement Learning? • It immediately crashes to the ground, again receiving a strong negative reinforcement. • At this point the RL system has not only learned that turning the handle bars right or left when tilted 45 degrees to the right is bad, but that the "state" of being titled 45 degrees to the right is bad. • Again, the RL system begins another trial and performs a series of actions that result in the bicycle being tilted 40 degrees to the right. ……

  8. What’s reinforcement Learning? • RL systems learn a mapping from situations to actions by trial-and-error interactions with a dynamic environment. The “goal” of the RL system is defined using the concept of a reward function, which is the exact function of future reinforcements(rewards) the agent seeks to maximize. • In other words, there exists a mapping from state/action pairs to reward; after performing an action in a given state the RL agent will receive some reward in the form of a scalar value. The RL agent learns to perform actions that will maximize the sum of the rewards received when starting from some initial state and proceeding to a terminal state. Agent State Reward environment a0 a1 a2 s0 s1 s2 r0 r2 r1 r0+  r1+ 2 r2+ … , where 0 < 1 The discount factor  is used to exponentially decrease the weight of reinforcements received in the future

  9. RL Vs. other function approximation • RL is similar in some respects to the function approximation problems discussed in other chapters. The target function to be learned in this case is a control policy : S -> A. A policydetermines which action should be performed in each state; a policy is a mapping from states to actions. • The valueof a state is defined as the sum of the rewardsreceived when starting in that state and following some fixed policy to a terminal state. • The optimal policy would therefore be the mapping from states to actions that maximizes the sum of the rewards when starting in an arbitrary state and performing actions until a terminal state is reached

  10. RL Vs. other function approximation • This reinforcement learning problem differs from other function approximation tasks in several important respects. • Delayed reward: In RL, direct correspondence between the states and the actions is not available. The trainer provides only a sequence of immediate reward values as the agent executes its actions. Face the problem of temporal credit assignment. • Exploration: The agents influence the distribution of training examples by the action sequence it chooses. The question is which experimentation strategy produces most effective learning.

  11. The Learning Task • Ways of formulation • Markov Decision Process (MDP) • Precise Task Definition • An Example

  12. Ways of formulation • There are many ways to formulate the problem of learning sequential control strategies • Agent’s actions: Might be Deterministic or Nondeterministic • Agent may have or haven’t the ability of predicting the next state that will result from each action • Trainer of the agent: Expert(who shows it examples of optimal action sequences) or agent itself(train itself by performing actions of its own choice.)

  13. States & Transitions s0 a1 a3 a2 s2 s1 s3 a7 a4 a5 a6 s5 s6 s7 s8

  14. at at+1 at+2 st st+1 st+2 … rt rt+1 rt+2 Markov Decision Process. • Finite set of States : S; Set of Actions: A • t: discrete time step; • st: the state at time t; • at: the action at time t; • At each discrete time, agent observe states stS, and chooses action atA. • Then receive immediate reward: rt , And state change to: st+1 • Markov assumption:st+1= (st , at), rt=r (st , at) • i.e., rt, and st+1 depend only on current state and action • Functions  and r may be nondeterministic • Functions  and r not necessarily be known to agent

  15. at at+1 at+2 st st+1 st+2 … rt rt+1 rt+2 Learning Task • Execute actions in the environment, observe results and • Learn a policy p : S  A from states stS to actions atA that maximizes the accumulated reward : Vp(s) = rt+ rt+1+ 2 rt+2+… from any starting state st 0<<1 is the discount factor for future rewards. • Target function is p : S  A • But there are no direct training examples of the form <s,a> • Training examples are of the form <<s,a>,r>

  16. State Value Function • Consider deterministic environments, namely d(s,a) and r(s,a) are deterministic functions of s and a. • For each policy p: S  A the agent might adopt, we define an evaluation function: Vp(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+ii (13.1) where rt, rt+1,… are generated by following the policy p from start state s • Task: Learn the optimal policy p* that maximizes Vp(s) p* = argmaxp Vp(s) ,s (13.2)

  17. Action Value Function • State value function denotes the reward for starting in state s and following policy p. Vp(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+ii • Action value function denotes the reward for starting in state s, taking actiona and following policy p afterwards. Qp(s,a)= r(s,a) +  rt+1+ 2 rt+2+… = r(s,a) + Vp (d(s,a))

  18. Optimal Value Functions Concept of V* : V*(s) = maxp Vp(s) Concept of π*: The policy π which maximizes Vπ(s) for all states s. p*(s) = argmaxp Vp(s) ,s (13.2) π*(s)=argmaxa{r(s,a) + V*(d(s,a))} (13.3)

  19. Example 0 0 100 0 G G 0 90 100 0 0 0 100 0 0 0 0 0 0 100 81 90 r(s, a) (immediate reward) values V*(s) values G G One Optimal policy all Optimal policies

  20. What to Learn

  21. Q-Function

  22. Training Rule to Learn APPROXIMATE Q

  23. A simple deterministic world

  24. Q Learning for Deterministic Worlds

  25. example

  26. Explore or Exploit?

  27. hwk • 13.1 • 13.2

  28. Example 73 100 90 100 66 66 R R 81 81 aright s1 s2 Q(s1, aright)  r +  maxa’Q (s2, a’ )  0 + 0.9max{66,81,100}  90 0 0 0 0 90 100 G G 0 81 0 0 72 81 0 0 90 81 0 0 81 90 0 100 0 0 72 81 initial Q(s, a) values Q(s, a) values

  29. Thank you

More Related