Policy Gradient in Continuous Time

Policy Gradient in Continuous Time by Remi Munos, JMLR 2006 Presented by Hui Li Duke University Machine Learning Group May 30, 2007

Outline • Introduction • Discretized Stochastic Processes Approximation • Model-free Reinforcement Learning (RL) • algorithm • Example Results

Control State Introduction of the Problem • Consider an optimal control problem with continuous state System dynamics: • Deterministic process • Continuous state • Objective: Find an optimal control (ut) that maximize the functional Objective function:

Introduction of the Problem • Consider a class of parameterized policies  with • Find parameter  that maximize the performance measure • Standard approach is to use gradient ascent method object of the paper

Introduction of the Problem How to compute • Finite-difference method This method requires a large number of trajectories to compute the gradient of performance measure. • Pathwise estimation of the gradient Compute the gradient using one trajectory only

Introduction of the Problem Pathwise estimation of the gradient • Define • Dynamics of zt: • Gradient unknown known • In the reinforcement learning, is unknown. How to approximate zt?

Discretized Stochastic Processes Approximation • A General Convergence Result If

Discretization of the state • Stochastic policy • Stochastic discrete state process Initialization: Jump in state

Proof of proposition 5: From Taylor’s formula The average jump: Directly apply the Theorem 3, proposition 5 is proved.

Discretization of the state gradient • Stochastic discrete state gradient process Initialization: With

Proof of proposition 6: Since then Directly apply the Theorem 3, proposition 6 is proved.

Model-free Reinforcement Learning Algorithm Let In this stochastic approximation, is observed, and is given, we only need to approximate

Least-Square Approximation of Define The set of past discrete times t-cs t when action ut have been taken. From Taylor’s formula, for all discrete time s, We deduce

Where We may derive an approximation of by solving the least-square problem: Then we have Here denote the average value of

Algorithm

Experimental Results Six continuous state: x0, y0: hand position x, y: mass position vx, vy: mass velocity Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)} Goal: reach a target (xG, yG) with the mass at specific time T Terminal reward function

The system dynamics: Consider a Boltzmann-like stochastic policy where

Conclusion • Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters • Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process

Policy Gradient in Continuous Time

Policy Gradient in Continuous Time

Presentation Transcript

Continuous time formalism

Continuous-Time Convolution

Continuous-Time Convolution

Continuous-Time Markov Chains

Layered continuous time processes in biology

Real-Time Gradient-Domain Painting

Continuous Time Convolution

Continuous-Time System Properties

Continuous-Time Systems

Continuous Performance Testing in Virtual Time

Continuous-time microsimulation in longitudinal analysis

Continuous Time Markov Chains

Continuous Time Signals

Continuous Time Signals

Policy Generation for Continuous-time Stochastic Domains with Concurrency

Continuous Time Domain Filters

Continuous-time Signals

Continuous-Time Convolution

Continuous-time microsimulation in longitudinal analysis

Continuous-Time Convolution