190 likes | 339 Vues
This presentation by Hui Li at Duke University delves into the intricacies of applying policy gradient methods to continuous-time reinforcement learning problems. The session focuses on discretized stochastic process approximations, touching on model-free reinforcement learning algorithms. Notably, it covers techniques for finding optimal control policies that maximize performance measures while addressing challenges such as gradient estimation via finite-difference and pathwise methods. Experimental results illustrate the effectiveness of the proposed methods in reaching designated targets in a continuous state environment.
E N D
Policy Gradient in Continuous Time by Remi Munos, JMLR 2006 Presented by Hui Li Duke University Machine Learning Group May 30, 2007
Outline • Introduction • Discretized Stochastic Processes Approximation • Model-free Reinforcement Learning (RL) • algorithm • Example Results
Control State Introduction of the Problem • Consider an optimal control problem with continuous state System dynamics: • Deterministic process • Continuous state • Objective: Find an optimal control (ut) that maximize the functional Objective function:
Introduction of the Problem • Consider a class of parameterized policies with • Find parameter that maximize the performance measure • Standard approach is to use gradient ascent method object of the paper
Introduction of the Problem How to compute • Finite-difference method This method requires a large number of trajectories to compute the gradient of performance measure. • Pathwise estimation of the gradient Compute the gradient using one trajectory only
Introduction of the Problem Pathwise estimation of the gradient • Define • Dynamics of zt: • Gradient unknown known • In the reinforcement learning, is unknown. How to approximate zt?
Discretized Stochastic Processes Approximation • A General Convergence Result If
Discretization of the state • Stochastic policy • Stochastic discrete state process Initialization: Jump in state
Proof of proposition 5: From Taylor’s formula The average jump: Directly apply the Theorem 3, proposition 5 is proved.
Discretization of the state gradient • Stochastic discrete state gradient process Initialization: With
Proof of proposition 6: Since then Directly apply the Theorem 3, proposition 6 is proved.
Model-free Reinforcement Learning Algorithm Let In this stochastic approximation, is observed, and is given, we only need to approximate
Least-Square Approximation of Define The set of past discrete times t-cs t when action ut have been taken. From Taylor’s formula, for all discrete time s, We deduce
Where We may derive an approximation of by solving the least-square problem: Then we have Here denote the average value of
Experimental Results Six continuous state: x0, y0: hand position x, y: mass position vx, vy: mass velocity Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)} Goal: reach a target (xG, yG) with the mass at specific time T Terminal reward function
The system dynamics: Consider a Boltzmann-like stochastic policy where
Conclusion • Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters • Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process