1 / 19

Policy Gradient in Continuous Time

Policy Gradient in Continuous Time. by Remi Munos, JMLR 2006. Presented by Hui Li Duke University Machine Learning Group May 30, 2007. Outline. Introduction Discretized Stochastic Processes Approximation Model-free Reinforcement Learning (RL) algorithm Example Results. Control.

bowie
Télécharger la présentation

Policy Gradient in Continuous Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Policy Gradient in Continuous Time by Remi Munos, JMLR 2006 Presented by Hui Li Duke University Machine Learning Group May 30, 2007

  2. Outline • Introduction • Discretized Stochastic Processes Approximation • Model-free Reinforcement Learning (RL) • algorithm • Example Results

  3. Control State Introduction of the Problem • Consider an optimal control problem with continuous state System dynamics: • Deterministic process • Continuous state • Objective: Find an optimal control (ut) that maximize the functional Objective function:

  4. Introduction of the Problem • Consider a class of parameterized policies  with • Find parameter  that maximize the performance measure • Standard approach is to use gradient ascent method object of the paper

  5. Introduction of the Problem How to compute • Finite-difference method This method requires a large number of trajectories to compute the gradient of performance measure. • Pathwise estimation of the gradient Compute the gradient using one trajectory only

  6. Introduction of the Problem Pathwise estimation of the gradient • Define • Dynamics of zt: • Gradient unknown known • In the reinforcement learning, is unknown. How to approximate zt?

  7. Discretized Stochastic Processes Approximation • A General Convergence Result If

  8. Discretization of the state • Stochastic policy • Stochastic discrete state process Initialization: Jump in state

  9. Proof of proposition 5: From Taylor’s formula The average jump: Directly apply the Theorem 3, proposition 5 is proved.

  10. Discretization of the state gradient • Stochastic discrete state gradient process Initialization: With

  11. Proof of proposition 6: Since then Directly apply the Theorem 3, proposition 6 is proved.

  12. Model-free Reinforcement Learning Algorithm Let In this stochastic approximation, is observed, and is given, we only need to approximate

  13. Least-Square Approximation of Define The set of past discrete times t-cs t when action ut have been taken. From Taylor’s formula, for all discrete time s, We deduce

  14. Where We may derive an approximation of by solving the least-square problem: Then we have Here denote the average value of

  15. Algorithm

  16. Experimental Results Six continuous state: x0, y0: hand position x, y: mass position vx, vy: mass velocity Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)} Goal: reach a target (xG, yG) with the mass at specific time T Terminal reward function

  17. The system dynamics: Consider a Boltzmann-like stochastic policy where

  18. Conclusion • Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters • Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process

More Related