1 / 21

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks. Gerhard Neumann. Master Thesis 2005 Institute für Grundlagen der Informationsverarbeitung (IGI). Master Thesis:. Reinforcement Learning Toolbox General Software Tool for Reinforcement Learning

Télécharger la présentation

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks GerhardNeumann Master Thesis2005Institute für Grundlagen der Informationsverarbeitung (IGI) www.igi.tu-graz.ac.at/ril-toolbox

  2. Master Thesis: • Reinforcement Learning Toolbox • General Software Tool for Reinforcement Learning • Benchmark tests of Reinforcement Learning algorithms on three Optimal Control Problems • Pendulum Swing Up • Cart-Pole Swing Up • Acro-Bot Swing Up www.igi.tu-graz.ac.at/ril-toolbox

  3. RL Toolbox: Features • Software: • C++ Class System • Open Source / Non Commercial • Homepage: www.igi.tu-graz.ac.at/ril-toolbox • Class Reference, Manual • Runs under Linux and Windows • > 40.000 lines of code, > 250 classes www.igi.tu-graz.ac.at/ril-toolbox

  4. RL Toolbox: Features • Learning in discrete or continuous State Space • Learning in discrete or continuous Action Space • Different kinds of Learning Algorithms • TD-lambda learning • Actor critic learning • Dynamic Programming, Model based learning, planning methods • Continuous time RL • Policy search algorithm • Residual / Residual gradient Algorithm • Use Different Function Approximators • RBF-Networks • Linear Interpolation • CMAC-Tile Coding • Feed Forward Neural Networks • Learning from other (self coded) Controllers • Hierarchical Reinforcement Learning www.igi.tu-graz.ac.at/ril-toolbox

  5. Structure of the Learning System • The Agent and the environment • The agent tells the environment which action to execute, the environment makes the internal state transitions • Environment defines the learning problem www.igi.tu-graz.ac.at/ril-toolbox

  6. Structure of the learning system • Linkage to the learning algorithms • All algorithms need <st,at,st+1> for learning. • The algorithms are implemented as listeners • The algorithms adapt the agent controller to learn optimal policy • Agent informs all listeners about the steps and when a new episode has started. www.igi.tu-graz.ac.at/ril-toolbox

  7. Reinforcement Learning: • Agent: • State Space S • Action Space A • Transition Function • Agent has to optimize the future discounted reward • Many possibilities to solve the optimization task: • Value based Approaches • Genetic Search • Other Optimization algorithms www.igi.tu-graz.ac.at/ril-toolbox

  8. Short Overview over the algorithms: • Value-based algorithms • Calculate the goodness of each state • Policy-search algorithms • Represent the policy directly, search in the policy parameter space • Hybrid Methods • Actor-Critic Learning www.igi.tu-graz.ac.at/ril-toolbox

  9. Value Based Algorithms • Calculate either: • Action value function (Q-Function): • Directly used for action selection • Value Function • Need the transition function for action selection • E.g. Do state prediction or use the derivation of the transition function • Representation of the V or Q Function is in the most cases independent of the learning algorithm. • We can use any function approximator for the value function • Independent V-Function and Q-Function interfaces • Different Algorithms: TD-Learning, Advantage Learning, Continuous Time RL www.igi.tu-graz.ac.at/ril-toolbox

  10. Policy Search / Policy Gradient Algorithms • Directly climb the value function with a parameterized policy • Calculate the Values of N given initial states per simulation (PEGASUS, [NG, 2000]) • Use standard optimization techniques like gradient ascent, simulated annealing or genetic algorithms. • Gradient Ascent used in the Toolbox www.igi.tu-graz.ac.at/ril-toolbox

  11. Actor Critic Methods: • Learn the value function and an extra policy representation • Discrete actor critic • Stochastic Policies • Represent directly the action selection propabilities. • Similar to TD-Q Learning • Continous actor critic • Directly output the continuous control vector • Policy can be represented by any Function approximator • Stochastic Real Values (SRV) Algorithm ([Gullapalli, 1992]) • Policy-Gradient Actor-Critic (PGAC) algorithm www.igi.tu-graz.ac.at/ril-toolbox

  12. Policy-Gradient Actor-Critic Agorithm • Learn the V-Function with standard algorithm • Calculate Gradient of the Value within a certain time window (k-steps in the past, l-steps in the future) • Gradient is then estimated by: • Again exact model is needed www.igi.tu-graz.ac.at/ril-toolbox

  13. Second Part: Benchmark Tests • Pendulum Swing Up • Easy Task • CartPole Swing Up • Medium Task • AcroBot Swing Up • Hard Task www.igi.tu-graz.ac.at/ril-toolbox

  14. Benchmark Problems • Common problems in non-linear control • Try to find an unstable fixpoint • 2 or 4 continuous state variables • 1 continuous control variable • Reward: Height of the end point at time each step www.igi.tu-graz.ac.at/ril-toolbox

  15. Benchmark Tests: • Test the algorithms on the benchmark problems with different parameter settings. • Compare sensitivity of the parameter setting • Use different Function Approximators (FA) • Linear FAs (e.g. RBF-Networks) • Typical local representation • Curse of dimensionality • Non-Linear FA (e.g. Feed-Forward Neural-Networks): • No expontial dependency on the input state dimension • Harder to learn (no local representation) • Compare the algorithms with respect to their features and requirements • Is the exact transition function needed? • Can the algorithm produce continuous actions? • How much computation time is needed? • Use hierarchical RL, directed exploration strategies or planning methods to boost learning www.igi.tu-graz.ac.at/ril-toolbox

  16. Benchmark Tests:Cart-Pole Task, RBF-network • Planning boosts performance significantly • Very time intensive (search depth 5 – 120 times longer computation time) • PG-AC approach can compete with standard V-Learning approach • Can not represent sharp decision boundaries www.igi.tu-graz.ac.at/ril-toolbox

  17. Benchmark:PG-AC vs V-Planning, Feed Forward NN PG-AC V-Planning • Learning with FF-NN using the standard planning approach almost impossible (very unstable performance) • PG-AC with RBF critic (time window = 7 time steps) manges to learn the task in almost 1/10 of episodes of the standard planning approach. www.igi.tu-graz.ac.at/ril-toolbox

  18. V-Planning • Cart-Pole Task: Higher Search Depths could improve performance significantly, but at exponential cost of computation time www.igi.tu-graz.ac.at/ril-toolbox

  19. Hierarchical RL • Cart-Pole Task: The Hierarchical Sub-Goal Approach (alpha = 0.6) outperforms the flat approach (alpha = 1.0) www.igi.tu-graz.ac.at/ril-toolbox

  20. Other general results • The Acro-Bot Task could not be learned with the flat architecture • Hierarchical Architecture manges to swing up, but could not stay on top • Nearly all algorithms managed to learn the first two tasks with linear function approximation (RBF networks) • Non linear function approximators are very hard to learn • Feed Forward NN‘s have a very poor performance (no locality), but can be used for larger state spaces • Very restrictive parameter settings • Approaches which use the transition function typically outperform the model-free approaches. • The Policy Gradient algorithm (PEGASUS) only worked with the linear FAs, with non-linear FAs it could not recover from local maxima. www.igi.tu-graz.ac.at/ril-toolbox

  21. Literature • [Sutt_1999] R. Sutton and A. Barto: Reinforcement Learning: An Introduction. MIT press • [NG_2000] A. Ng an M. Jordan: PEGASUS: A policy search method for large mdps and pomdps approximation • [Doya_1999] K. Doya: Reinforcement Learning in continuous time and space • [Baxter, 1999] J. Baxter: Direct gradient-based reinforcement learning: 2. gradient ascent algorithms and experiments. • [Baird_1999] L. Baird: Reinforcement Learning Through Gradient Descent. PhD Thesis • [Gulla_1992] V. Gullapalli: Reinforcement Learning and its application to control • [Coulom_2000] R. Coulom: Reinforcement Learning using Neural Networks. PhD thesis www.igi.tu-graz.ac.at/ril-toolbox

More Related