Créer une présentation
Télécharger la présentation

Download

Download Presentation

The Right Way to do Reinforcement Learning with Function Approximation

115 Vues
Download Presentation

Télécharger la présentation
## The Right Way to do Reinforcement Learning with Function Approximation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**The Right Way to do Reinforcement Learningwith Function**Approximation Rich Sutton AT&T Labs with thanks to Satinder Singh, David McAllester, Mike Kearns**The Prize**• To find the “Right Way” to do RL with FA • sound (stable, non-divergent) • ends up with a good policy • gets there quickly, efficiently • applicable to any (discrete-time, finite-state) MDP • compatible with (at least) linear FA • online and incremental • To prove that it is so • Tensions: • Proof and practice often pull in different directions • Lack of knowledge negative knowledge • We have not handled this well as a field critical to viability of RL !**Outline**• Questions • History: from policy to value back to policy • Problem Definition • Why function approximation changes everything • REINFORCE • Policy Gradient Theory • Do we need values? Do we need TD? • Return baselines - using values without bias • TD/boostrapping/truncation • may not be possible without bias • but seems essential for reducing variance**Questions**Is RL theory fundamentally different/harder with FA? yes Are value methods unsound with FA? absolutely not Should we prefer policy methods for other reasons? probably Is it sufficient to learn just a policy, not value? apparently not Didn’t we already do all this policy stuff in the 1980s? only some of it Can values be used without introducing bias? yes Can TD (bootstrapping) be done without bias? I wish Is TD much more efficient than Monte Carlo? apparently Is it TD that makes FA hard? yes and no, but mostly no So are we stuck with dual, “actor-critic” methods? maybe so Are we talking about genetic algorithms? No! What about learning “heuristic” or “relative” values. Are these policy methods or value methods? policy**The Swing towards Value Functions**• Early RL methods all used parameterized policies • But adding value functions seemed key to efficiency • Why not just learn action value functions and compute policies from them! • A prediction problem - almost supervised • Fewer parameters • Enabled first proofs of convergence to optimal policy • Impressive applications using FA • So successful that early policy work was bypassed Q-learning Watkins, 1989 Cleaner Simpler Easier to use**The Swing away from Value Functions**• Theory hit a brick wall for RL with FA • Q-learning shown to diverge with linear FA • Many counterexamples to convergence • Widespread scepticism about any argmax VF solution • that is, about any way to get conventional convergence • But is this really a problem? • In practice, on-policy methods perform well • Is this only a problem for our theory? • With Gordon’s latest result these concerns seem to have been hasty, now invalid Why? Diagram**Why FA makes RL hard**• All the states interact and must be balanced, traded off • Which states are visited is affected by the policy A small change (or error) in the VF estimate can cause a large,discontinuous change in the policy can cause a large change in the VF estimate**Diagram of What Happens in— Value Function Space —**True V* Region of * Original naïve hope inadmissable value functions value functions consistent with guaranteed convergence to good policy best admissable policy Sarsa, TD(l) & other on-policy methods Res gradient et al. chattering without divergence or guaranteed convergence Q-learning, DP & other off-policy methods guaranteed convergence to less desirable policy divergence possible**…and towards Policy Parameterization**• A parameterized policy (PP) can be changed continuously • A PP can find stochastic policies • the optimal policy will often be stochastic with FA • A PP can omit the argmax (the action-space search) • necessary for large/continuous action spaces • A PP can be more direct, simpler • Prior knowledge is often better expressed as a PP • A PP method can be proven convergent to a local optimum for general differentiable FA REINFORCE Williams, 1988**Defining the Problem (RL with FA)Part I: Parameterized**Policies Finite state and action sets Discrete time Transition probabilities Expected rewards Stochastic policy possibly parameterized ´ = S A N 0 1 2 3 = K t , , , , a = = = = p Pr s s s s , a a s s ¢ t + 1 t t a r = = = E r s s , a a s t + 1 t t p = = = ( s , a ) Pr a a s s t t w.l.o.g. n p q Î Â << n N q e.g., L L**s**s t t p p ( ( s s , , a a ) ) q q t t a Examples of Policy Parameterizations Gibbs or S=1 Normalization Action Probabilities features of Ranking #s, one per action = weights features of and Gibbs or S=1 Action Probabilities Ranking # for repeat a " Î a A Ranking #s are mechanistically like action values But do not have value semantics Many “heuristic” or “relative” values are better viewed as ranking #s**a**s a a t t t t p ( s , a ) q t More Policy Parameterizations Continuous actions work too, e.g.: mean of std. dev. of features of Gaussian Sampler = weights implicitly determine the continuous distribution Much stranger parameterizations are possible e.g., cascades of interacting stochastic processes such as in a communications network or factory We require only that our policy process produces**Defining the Problem (RL with FA) II**Choose p to maximize a measure of total future reward, called the return Values are expected returns R t p = = p V ( s ) E R s s , t t p = = = p Q ( s , a ) E R s s , a a , t t t Optimal policies * p * p p = " Î p = arg max V ( s ) s S ( s , a ) arg max Q ( s , a ) p a Value methods maintain a parameterized approximation to a value function, And then compute their policy, e.g., * * or p p p p V , Q , V , Q ˆ p = ( s ) arg max Q ( s , a ) q a**Discounted case**FA Breaks the Standard Problem Def’n! One infinite, ergodic episode : K s , a , r , s , a , r , s , 0 0 1 1 1 2 2 Return: Let P be the space of all policies Let P be all policies consistent with the parameterization Problem: depends on s! no one policy—in P—is best for all states states compete for control of Need an overall (not per state) measure of policy quality, e.g., asymptotic fraction of time spent in under p s p = d ( s ) But! Thm: J(p) is independent of g!J(p) = ave. reward/step**RL Cases Consistent with FA**Average-reward case One infinite, ergodic episode : K s , a , r , s , a , r , s , 0 0 1 1 1 2 2 Episodic case Many epsiodes, all starting from**Outline**• Questions • History: from policy to value back to policy • Problem Definition • Why function approximation changes everything • REINFORCE • Policy Gradient Theory • Do we need values? Do we need TD? • Return baselines - using values without bias • TD/boostrapping/truncation • may not be possible without bias • but seems essential for reducing variance**Do we need Values at all?**episodic case Extended REINFORCE (Williams, 1988) offline updating Thm: There is also an online, incremental implementation using eligibility traces Converges to a local optimum of J for general diff. FA! Simple, clean, a single parameter... Why didn’t we love this algorithm in 1988?? No TD/bootstrapping (a Monte Carlo method) thought to be inefficient Extended to average-case (Baxter and Bartlett, 1999)**does not involve**Policy Gradient Theorem Marbach & Tsitsiklis ‘98 Jaakkola Singh Jordan ‘95 Cao & Chen ‘97 Sutt McAl Sing Mans ‘99 Konda & Tsitsiklis ‘99 Williams ‘88 Thm: how often s occurs under p**Policy Gradient Theory**Thm: how often s occurs under p how often s,a occurs under p D q t**D**q t REINFORCE**D**q t REINFORCE OR actor- critic OR general l form, includes all above possible TD/bootstrapping ideal baseline?**Conjecture: The ideal baseline is**In which case our error term is an advantage Baird ‘93 No bias is introduced by an approximation here: How important is a baseline to the efficiency of REINFORCE? Apparently very important, but previous tests flawed**Random MDP Testbed**• 50 Randomly constructed episodic MDPs • 50 states, uniform starting distribution • 2 actions/state • 2 possible next states per action • expected rewards (1,1); actual rewards +(0,0.1) • 0.1 prob of termination on each step • State aggregation FA - 5 groups of 10 states each • Gibbs action selection • Baseline learned by gradient descent • Parameters initially • Step-size parameters **Effect of Learned Baseline**1 0 . 8 b = . 0 1 REINFORCE with per-episode updating b = . 2 b = . 0 2 b = . 1 b = . 0 4 p J ( ) a f t e r 5 0 No Baseline e p i s o d e s 1 0 . 0 . 0 1 . 1 1 a Much better to learn a baseline approximating p V**Can We TD without Introducing Bias?**Thm: An approximation Q can replace Q without bias if it is of the form Sutton et al. ‘99 Konda & Tsitsiklis ‘99 and has converged to a local optimum. However! Thm: Under batch updating, such a Q results in exactly the same updates as REINFORCE. There is no useful bootstrapping. Empirically, there is also no win with per-episode updating Singh McAllester Sutton unpublished**Effect of Unbiased Linear Q^**1 . 7 ˆ Unbiased at best Q R E I N F O R C E p J ( ) a f t e r 5 0 e p i s o d e s 0 . 1 . 0 1 . 1 1 a per-episode updating**TD Creates Bias; Must We TD?**Is TD really more efficient than Monte Carlo? Apparently “Yes”, but this question deserves a better answer**small change in value**large discontinuous change in policy large change in state distribution Is it TD that makes FA hard? • Yes, TD prediction with FA is trickier than Monte Carlo • even the linear case converges only near an optimum • nonlinear cases can even diverge • No, TD is not the reason the control case is hard • This problem is intrinsic to control + FA • It happens even with Monte Carlo methods**Small Sample Importance Sampling -A Superior Eligibility**Term? Thm: how often s occurs under p n? n? how often s,a occurs under p D q t**Questions**Is RL theory fundamentally different/harder with FA? yes Are value methods unsound with FA? absolutely not Should we prefer policy methods for other reasons? probably Is it sufficient to learn just a policy, not value? apparently not Can values be used without introducing bias? yes Can TD (bootstrapping) be done without bias? I wish Is TD much more efficient than Monte Carlo? apparently Is it TD that makes FA hard? yes and no, but mostly no So are we stuck with dual, “actor-critic” methods? maybe so