Q Learning and SARSA: Off-policy and On-policy RL

Q Learning and SARSA: Off-policy and On-policy RL -Tanya Mishra CS 5368 Intelligent Systems FALL 2010

Brief Recap: • What is a Markov Decision Process? MDPs provide a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker. For instance, At each time step, the process is in some state s, and the decision maker may choose any action a that is available in state s. The process then randomly moves into a new state s', and giving the decision maker a corresponding reward Ra(s,s'). The probability that the process chooses s' as its new state is influenced by the chosen action. Specifically, it is given by the state transition function Ta(s,s'). Thus, the next state s' depends on the current state s and the decision maker's action a.

What is Reinforcement Learning? Reinforcement learning is an area concerned with how an agent ought to take actions in an environment so as to maximize some notion of reward. In machine learning, the environment is typically formulated as a MDP. An optimal Policy is a policy that maximizes the expected reward/reinforcement/feedback of a state. Thus, the task of RL is to use observed rewards to find an optimal policy for the environment.

The Three Agents: • Utility based Agent: A utility based agent learns a utility function on states and selects actions that maximizes the expected outcome utility. (Model Based) • The Q-learning Agent: It learns a action-utility or Q-function, giving the expected utility of taking a given action in a given state. (Model Free) • Reflex Agent: The agent function is based on the condition-action rule: if condition then action. Model-based agents can handle partially observable environments. Its current state is stored inside the agent maintaining some kind of structure which describes the part of the world which cannot be seen.

Passive Learning: Agents policy is fixed and task is to learn how good the policy is i.e. to learn the utility function U. • Active Learning: Here the agent must also learn what actions to take. The principal issue is “exploration”: an agent must experience as mush as possible about the environment in order to learn how to behave in it. By “exploitation”: an agent maximizes its reward- as reflected in its current utility estimates. • Passive RL-Temporal Difference Learning: TD resembles a Monte Carlo method because it learns by sampling the environment according to some policy. TD is related to dynamic programming techniques because it approximates its current estimate based on previously learned estimates.

Active RL: Learning and Q-Function • Q(s,a)- Value of doing action a on state s. Q-values are related to utility values as follows: U(s)=maxaQ(s,a) • A TD agent that learns a Q-function does not need a model of the form P(s’|s,a), either for learning or for action selection. For this reason, Q-Learning is Model-Free. • The update equation for TD Q-learning is as follows: • Q(s,a) Q(s,a)+a(R(s)+gmaxa’Q(s’,a’)-Q(s,a)), • which is calculated whenever action a is executed in state s leading to state s’.

Algorithm: • The problem model: agent, states S and a number of actions per state A. • By performing an action a, where a is in A , the agent can move from state to state. • Each state provides the agent a reward (a positive reward) or punishment (a negative reward). • The goal of the agent is to maximize its total reward. • It does this by learning which action is optimal for each state.

The algorithm therefore has a function which calculates the Quality of a state-action combination: Q: S x A  R • Before learning has started, Q returns a fixed value, chosen by the designer. • Then, each time the agent is given a reward (the state has changed) new values are calculated for each combination of a state s from S, and action a from A. • The core of the algorithm is a simple “value iteration update”. It assumes the old value and makes a correction based on the new information.

Q(s,a) Q(s,a)+a(R(s)+gmaxa’Q(s’,a’)-Q(s,a)), where, Q(s,a) on RHS- Old value a- Learning Rate R(s)- Reward g- Discount Factor maxa’Q(s’,a’)- maximum future value R(s)+gmaxa’Q(s’,a’)- expected discount reward

Learning rate The learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information. Discount factor • The discount factor determines the importance of future rewards. A factor of 0 will make the agent "opportunistic" by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward.

SARSA • SARSA(state-action-reward-state-action) equation is very similar to the equation we just learned. Q(s,a) Q(s,a)+a(R(s)+gQ(s’,a’)-Q(s,a)), • where a’ is the action actually taken in state s’. • The rule is applied at the end of each s,a,r,s’,a’. • Difference with Q learning: Q-learning backs up the best Q-value from the state reached while SARSA waits until an action is taken and then backs up the Q-value from that action.

On Policy VS Off Policy • Q-Learning does not pay attention to what policy is being followed. Instead, it just uses the best Q-Value. Thus, it is an off-Policy learning algorithm. • It is called an off-policy because the policy being learned can be different than the policy being executed. • SARSA is an on-policy learning Algorithm.It updates value functions strictly on the basis of the experience gained from executing some (possibly non-stationary) policy.

Can Q-Learning be done with On-Policy? • No. But, we can use similar techniques. I can provide two ways to thinking to support my hypotheses: • SARSA is identical to Q-learning, except the fact that it waits for an action to happen, and then backs up the Q-value for that action. For a greedy agent that always takes the action with the best Q-Values, both are identical. • We can use QV Learning to implement Q-Learning for on-policy. QV-learning is a natural extension of Q-learning and SARSA.

It is an on-policy learning algorithm and it learns Q Values. Instead of finding the maximum future value, it calculates the state value function V for a state at time t+1.

References • [1] Artificial Intelligence: A modern Approach (Third Edition) by Stuart Russell and Peter Norvig. • [2] QV()-learning: A New On-policy Reinforcement Learning Algorithm by Marco A. Wiering • [3] www.wikipedia.com

Thank You!

Q Learning and SARSA: Off-policy and On-policy RL