500 likes | 687 Vues
Reinforcement Learning. Vimal. The learner here is a decision making agent Keeps making decisions to take actions in the environment and receive rewards (or penalty ) A set of trial-and-error runs, the agent is expected to learn the best policy that maximize the total reward . .
 
                
                E N D
Reinforcement Learning Vimal
The learner here is a decision making agent • Keeps making decisions to take actions in the environment and receive rewards(or penalty) • A set of trial-and-error runs, the agent is expected to learn the best policy that maximize the total reward.
Example: Playing Chess • Supervised learner to play the game? • We need a costly teacher to teach the game • In many cases, we don’t have a single best move • Goodness of the move may depend on the move that follow • Sequence of moves is good, if we win
Example: Robot in a maze • Task : Robot can move in one of four compass directions and should make a better sequence of moves to exit. • What is a best move here? • As long as the robot is in the maze, there is no feedback. • No opponent here in the environment (except the environment itself) • We may play against time?
From these examples • A decision maker (agent) placed in an environment • The decision maker should learn to make decisions • At any time, the environment is in a certain state (one of many states) • The decision maker has a set of actions available • An action taken by an agent changes the state of the environment • Reward(or penalty) for choosing an action in a state • At times, rewards come late, after carrying out the complete sequence of actions
The learner here is the agent (learning to make decisions) • Keeps making decisions to take actions in the environment and receive rewards(or penalty) • A set of trial-and-error runs, the agent is expected to learn the best policy that maximize the total reward.
Reinforcement Learning Supervised Learning
Supervised Learning (Learning with a teacher) • Reinforcement Learning(Learning with a critic) • Critic can tell, how well we have been doing in the past • Critic never informs anything ahead
Supervised Learning (Learning with a teacher) • Reinforcement Learning(Learning with a critic) • Critic can tell, how well we have been doing in the past • Critic never informs anything ahead • Agent learns an internal value for intermediate states which reflects how good are they in the path leading to goal & getting real reward • With this agent can learn to take local actions and work to maximize rewards.
The Big Picture Your action influences the state of the world which determines its reward
Lets consolidate… • To Learn successful control policies by experimenting in their environment • Perform sequences of actions - observe their consequences - learn a control policy • Control policy: a policy which chooses actions that maximize the reward accumulated over time by the agent from a given initial state Control policy, π : S  A
Episode(Trial) • The sequence of actions that takes us from an initial state to final state • There can be designated start states (or not) depending on the nature of problem • Repeat infinitely many trials for arriving at a policy (we will come to it later)
Markov Decision Process (MDP) • MDP is a formal model of the RL problem • At each discrete time point • Agent observes state st and chooses actionat • Receives rewardrt from the environment and the state changes to st+1 • Markov assumption: rt=r(st,at) st+1=(st,at) i.e.rt and st+1 depend only on the current state and action • In general, the functions r and  may not be deterministic and are not necessarily known to the agent
Agent’s Learning Task • To Learn action policy that produces the greatest possible cumulative reward for the robot over time • The cumulative reward starting from a state st is Here is the discount factor for future rewards. Generally the delayed rewards are exponentially discounted
Alternative rewards • Finite Horizon Reward • Average reward (over entire life time)
Agent’s Learning Task (contd) • Optimal Policy • We use V*(s) denote the value function for optimal policy
Example Grid world environment • Six possible states • Arrows represent possible actions • G: goal state One optimal policy – denoted * The best thing to do in each state: Compute the values of the states for this policy – denoted V*
Example: TD-Gammon • Immediate reward: +100 if win -100 if lose 0 for all other states • Trained by playing 1.5 million games against itself • Now approximately equal to the best human player
Value function We will consider deterministic worlds first • Given a policy (adopted by the agent), define an evaluation function over states: • Property:
Example Grid world environment • Six possible states • Arrows represent possible actions • G: goal state One optimal policy – denoted * What is the best thing to do when in each state? Compute the values of the states for this policy – denoted V*
The Q Function • Maximum discounted cumulative reward that can be achieved starting from state s and applying action a as the first action • The optimal policy now is
Why do we need Q function? • If the agent learns the Q function instead of the V* function then • we can select optimal actions even when it has no knowledge of the functions r and δ • Q function enables making decision without look ahead
Why do we need Q function? • If the agent learns the Q function instead of the V* function then • we can select optimal actions even when it has no knowledge of the functions r and δ • Q function enables making decision without look ahead
Q Learning • Now, let denote the agent’s current approximation to Q. Consider the iterative update rule. Under some assumptions (<s,a> visited infinitely often), this will converge to the true Q:
Q Learning algorithm (in deterministic worlds) • For each (s,a) initialise table entry • Observe current state s • Do forever: • Select an action a and execute it • Receive immediate reward r • Observe new state s’ • Update table entry as follows: • s:=s’
Example updating Q given the Q values from a previous iteration on the arrows
Arrows indicate strength between two problem states Start maze … Start S2 S4 S3 S8 S7 S5 Goal
The first response leads to S2 … The next state is chosen by randomly sampling from the possible next states weighted by their associative strength Associative strength = line width Start S2 S4 S3 S8 S7 S5 Goal
Suppose the randomly sampled response leads to S3 … Start S2 S4 S3 S8 S7 S5 Goal
At S3, choices lead to either S2, S4, or S7. S7 was picked (randomly) Start S2 S4 S3 S8 S7 S5 Goal
By chance, S3 was picked next… Start S2 S4 S3 S8 S7 S5 Goal
Next response is S4 Start S2 S4 S3 S8 S7 S5 Goal
And S5 was chosen next (randomly) Start S2 S4 S3 S8 S7 S5 Goal
And the goal is reached … Start S2 S4 S3 S8 S7 S5 Goal
Goal is reached, strengthen the associative connection between goal state and last response Next time S5 is reached, part of the associative strength is passed back to S4... Start S2 S4 S3 S8 S7 S5 Goal
Start maze again… Start S2 S4 S3 S8 S7 S5 Goal
Let’s suppose after a couple of moves, we end up at S5 again Start S2 S4 S3 S8 S7 S5 Goal
S5 is likely to lead to GOAL through strenghtened route In reinforcement learning, strength is also passed back to the last state This paves the way for the next time going through maze Start S2 S4 S3 S8 S7 S5 Goal
The situation after lots of restarts … Start S2 S4 S3 S8 S7 S5 Goal
Convergence Theorem :Q Learning • If each state-action pair is visited infinitely often, then Q^(s, a) converges to Q(s, a ) as n tends to infinity , for all s, a. • Proof: TB-1 Page 379
Exploration versus Exploitation • The Q-learning algorithm doesn’t say how we could choose an action • If we choose an action that maximises our estimate of Q we could end up not exploring better alternatives • To converge on the true Q values we must favour higher estimated Q values but still have a chance of choosing worse estimated Q values for exploration. • An action selection function of the following form may employed, where k>0:
Nondeterministic case • What if the reward and the state transition are not deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice! • Then V and Q needs redefined by taking expected values • Similar reasoning and convergent update iteration will apply • Will continue next week.
Summary • Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance • The goal of a reinforcement learning program is to maximise the eventual reward • Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment