Reinforcement Learning

AI – Week 22Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10 Email lee@hud.ac.uk http://scom.hud.ac.uk/scomtlm/cha2555/ Reinforcement Learning

Resources Support Resources: Introduction for 5 minutes. http://www.youtube.com/watch?v=m2weFARriE8 See first 10 -20 mins of this one: http://www.youtube.com/watch?v=ifma8G7LegE Longer video sequence (home work) http://videolectures.net/mlss08au_szepesvari_rele/ To Read: http://www.nbu.bg/cogs/events/2000/Readings/Petrov/rltutorial.pdf Reinforcement Learning

Reinforcement Learning • Reinforcement learning is defined characterizing a learning problem and not by characterizing learning methods. • Reinforcement learning differs from supervised learning, the kind of learning studied in most current research e.g. machine learning, statistical pattern recognition, and artificial neural networks

Definition of Terms • Policy, • Reward Function, • Value function • Model of the environment.(optionally)

Policy • A policy defines the learning agent's way of behaving at a given time. It is a mapping from perceived states of the environment to actions to be taken when in those states. • It corresponds to what in psychology would be called a set of stimulus-response rules or associations. • The policy is the core of a any reinforcement learning agent.

Rewards Function • A reward function defines the goal in a reinforcement learning problem. • It maps each perceived state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state. • A reinforcement learning agent's sole objective is to maximize the accumulated reward over a period of time.

Value Function • The reward accumulated over a period of time is known as the value function • Whereas a reward function indicates what is good in an immediate sense, a value function specifies what is good in the long run. • Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states. • Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward.

Model of the Environment. • This represents and mimics the behaviour of the environment. • Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced. • Models are optional

General Idea Environment s SENSE EFFECT Represent the world as an agent interacting with the environment: • Conduct “Trial and Error” or “Sampling” experiments in order to solve some goal. • Sense a reward (or negative reinforcement) as a result of some behaviour that moves towards the goal. • Add more weight to using that behaviour (or less weight) and continue trials. Reinforcement Learning

RL – the idea is pervasive REWARD ACTION Reinforcement Learning

Output of RL The goal of RL is to learn a mapping SITUATION => ACTION which optimises the rewards obtained. Note connection with AI Planning: • Situation = Goal, State, Actions • Input to MetricFF, output solution • Action is head(solution). Assuming solution is optimal, this is the best action to take. RL “comes into its own” when the condition where we can use a planner are not met e.g. partial observable state, actions not well specified Reinforcement Learning

Challenges of RL • One of the challenges that arise in reinforcement learning is the trade-off between exploration and exploitation

Tic-Tac-Toe

Tic-Tac-Toe • Although might look like a simple problem, but cannot readily be solved in a satisfactory way through classical techniques. • For example, the classical "minimax" solution from game theory is not accurate in this case because it assumes a particular way of playing by the opponent.

Tic-Tac-Toe • This example has a relatively small, finite state set, whereas reinforcement learning can be used when the state set is very large, or even infinite. For example, Gerry Tesauro (1992, 1995) combined the algorithm described above with an artificial neural network to learn to play backgammon, which has approximately states 1020. With this many states it is impossible ever to experience more than a small fraction of them.

Summary • Reinforcement learning uses a formal framework in terms of states, actions, and rewards • The concepts of maximising value and value functions are the key features of the reinforcement learning methods. • Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision-making.

Reinforcement Learning