The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks GerhardNeumann Master Thesis2005Institute für Grundlagen der Informationsverarbeitung (IGI) www.igi.tu-graz.ac.at/ril-toolbox

Master Thesis: • Reinforcement Learning Toolbox • General Software Tool for Reinforcement Learning • Benchmark tests of Reinforcement Learning algorithms on three Optimal Control Problems • Pendulum Swing Up • Cart-Pole Swing Up • Acro-Bot Swing Up www.igi.tu-graz.ac.at/ril-toolbox

RL Toolbox: Features • Software: • C++ Class System • Open Source / Non Commercial • Homepage: www.igi.tu-graz.ac.at/ril-toolbox • Class Reference, Manual • Runs under Linux and Windows • > 40.000 lines of code, > 250 classes www.igi.tu-graz.ac.at/ril-toolbox

RL Toolbox: Features • Learning in discrete or continuous State Space • Learning in discrete or continuous Action Space • Different kinds of Learning Algorithms • TD-lambda learning • Actor critic learning • Dynamic Programming, Model based learning, planning methods • Continuous time RL • Policy search algorithm • Residual / Residual gradient Algorithm • Use Different Function Approximators • RBF-Networks • Linear Interpolation • CMAC-Tile Coding • Feed Forward Neural Networks • Learning from other (self coded) Controllers • Hierarchical Reinforcement Learning www.igi.tu-graz.ac.at/ril-toolbox

Structure of the Learning System • The Agent and the environment • The agent tells the environment which action to execute, the environment makes the internal state transitions • Environment defines the learning problem www.igi.tu-graz.ac.at/ril-toolbox

Structure of the learning system • Linkage to the learning algorithms • All algorithms need <st,at,st+1> for learning. • The algorithms are implemented as listeners • The algorithms adapt the agent controller to learn optimal policy • Agent informs all listeners about the steps and when a new episode has started. www.igi.tu-graz.ac.at/ril-toolbox

Reinforcement Learning: • Agent: • State Space S • Action Space A • Transition Function • Agent has to optimize the future discounted reward • Many possibilities to solve the optimization task: • Value based Approaches • Genetic Search • Other Optimization algorithms www.igi.tu-graz.ac.at/ril-toolbox

Short Overview over the algorithms: • Value-based algorithms • Calculate the goodness of each state • Policy-search algorithms • Represent the policy directly, search in the policy parameter space • Hybrid Methods • Actor-Critic Learning www.igi.tu-graz.ac.at/ril-toolbox

Value Based Algorithms • Calculate either: • Action value function (Q-Function): • Directly used for action selection • Value Function • Need the transition function for action selection • E.g. Do state prediction or use the derivation of the transition function • Representation of the V or Q Function is in the most cases independent of the learning algorithm. • We can use any function approximator for the value function • Independent V-Function and Q-Function interfaces • Different Algorithms: TD-Learning, Advantage Learning, Continuous Time RL www.igi.tu-graz.ac.at/ril-toolbox

Policy Search / Policy Gradient Algorithms • Directly climb the value function with a parameterized policy • Calculate the Values of N given initial states per simulation (PEGASUS, [NG, 2000]) • Use standard optimization techniques like gradient ascent, simulated annealing or genetic algorithms. • Gradient Ascent used in the Toolbox www.igi.tu-graz.ac.at/ril-toolbox

Actor Critic Methods: • Learn the value function and an extra policy representation • Discrete actor critic • Stochastic Policies • Represent directly the action selection propabilities. • Similar to TD-Q Learning • Continous actor critic • Directly output the continuous control vector • Policy can be represented by any Function approximator • Stochastic Real Values (SRV) Algorithm ([Gullapalli, 1992]) • Policy-Gradient Actor-Critic (PGAC) algorithm www.igi.tu-graz.ac.at/ril-toolbox

Policy-Gradient Actor-Critic Agorithm • Learn the V-Function with standard algorithm • Calculate Gradient of the Value within a certain time window (k-steps in the past, l-steps in the future) • Gradient is then estimated by: • Again exact model is needed www.igi.tu-graz.ac.at/ril-toolbox

Second Part: Benchmark Tests • Pendulum Swing Up • Easy Task • CartPole Swing Up • Medium Task • AcroBot Swing Up • Hard Task www.igi.tu-graz.ac.at/ril-toolbox

Benchmark Problems • Common problems in non-linear control • Try to find an unstable fixpoint • 2 or 4 continuous state variables • 1 continuous control variable • Reward: Height of the end point at time each step www.igi.tu-graz.ac.at/ril-toolbox

Benchmark Tests: • Test the algorithms on the benchmark problems with different parameter settings. • Compare sensitivity of the parameter setting • Use different Function Approximators (FA) • Linear FAs (e.g. RBF-Networks) • Typical local representation • Curse of dimensionality • Non-Linear FA (e.g. Feed-Forward Neural-Networks): • No expontial dependency on the input state dimension • Harder to learn (no local representation) • Compare the algorithms with respect to their features and requirements • Is the exact transition function needed? • Can the algorithm produce continuous actions? • How much computation time is needed? • Use hierarchical RL, directed exploration strategies or planning methods to boost learning www.igi.tu-graz.ac.at/ril-toolbox

Benchmark Tests:Cart-Pole Task, RBF-network • Planning boosts performance significantly • Very time intensive (search depth 5 – 120 times longer computation time) • PG-AC approach can compete with standard V-Learning approach • Can not represent sharp decision boundaries www.igi.tu-graz.ac.at/ril-toolbox

Benchmark:PG-AC vs V-Planning, Feed Forward NN PG-AC V-Planning • Learning with FF-NN using the standard planning approach almost impossible (very unstable performance) • PG-AC with RBF critic (time window = 7 time steps) manges to learn the task in almost 1/10 of episodes of the standard planning approach. www.igi.tu-graz.ac.at/ril-toolbox

V-Planning • Cart-Pole Task: Higher Search Depths could improve performance significantly, but at exponential cost of computation time www.igi.tu-graz.ac.at/ril-toolbox

Hierarchical RL • Cart-Pole Task: The Hierarchical Sub-Goal Approach (alpha = 0.6) outperforms the flat approach (alpha = 1.0) www.igi.tu-graz.ac.at/ril-toolbox

Other general results • The Acro-Bot Task could not be learned with the flat architecture • Hierarchical Architecture manges to swing up, but could not stay on top • Nearly all algorithms managed to learn the first two tasks with linear function approximation (RBF networks) • Non linear function approximators are very hard to learn • Feed Forward NN‘s have a very poor performance (no locality), but can be used for larger state spaces • Very restrictive parameter settings • Approaches which use the transition function typically outperform the model-free approaches. • The Policy Gradient algorithm (PEGASUS) only worked with the linear FAs, with non-linear FAs it could not recover from local maxima. www.igi.tu-graz.ac.at/ril-toolbox

Literature • [Sutt_1999] R. Sutton and A. Barto: Reinforcement Learning: An Introduction. MIT press • [NG_2000] A. Ng an M. Jordan: PEGASUS: A policy search method for large mdps and pomdps approximation • [Doya_1999] K. Doya: Reinforcement Learning in continuous time and space • [Baxter, 1999] J. Baxter: Direct gradient-based reinforcement learning: 2. gradient ascent algorithms and experiments. • [Baird_1999] L. Baird: Reinforcement Learning Through Gradient Descent. PhD Thesis • [Gulla_1992] V. Gullapalli: Reinforcement Learning and its application to control • [Coulom_2000] R. Coulom: Reinforcement Learning using Neural Networks. PhD thesis www.igi.tu-graz.ac.at/ril-toolbox

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks