Reinforcement Learning

Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

What is RL? • Trial & error learning • without model • with model • Structure s3 r2 r1 s1 s2 r3 s4

RL vs. Supervised Learning • Evaluative vs. Instructional feedback • Role of exploration • On-line performance

K-armed Bandit Problem Average Rewards Actions 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 Agent 100 0

K-armed Bandit Cont. • Greedy exploration • ε-greedy • Softmax Average Reward: Incremental formula: where: α = 1 / (k+1) Probability of choosing action a:

More General Problems • More than one state • Delayed rewards • Markov Decision Process (MDP) • Set of states • Set of actions • Reward function • State transition function • Table or Function Approximation

Example: Recycling Robot

Recycling Robot: Transition Graph

Dynamic Programming

Backup Diagram .25 .25 .25 .4 .6 .7 .3 .5 .5 Rewards 10 5 200 200 -10 1000

Dynamic Programming:Optimal Policy

Backup for Optimal Policy

Performance Metrics • Eventual convergence to optimality • Speed of convergence to optimality • Regret (Kaelbling, L., Littman, M., & Moore, A. 1996)

Gridworld Example

Initialize V arbitrarily, e.g. , for all Repeat For each until (a small positive number) Output a deterministic policy, such that:

Temporal Difference Learning • RL without a model • Issue of: temporal credit assignment • Bootstraps like DP • TD(0):

TD Learning • Again, TD(0) = TD(λ) = where e is called an eligibility trace

Backup Diagram for TD(λ)

TD-Gammon (Tesauro)

Additional Work • POMDP’s • Macros • Multi-agent rl • Multiple reward structures

Reinforcement Learning