1 / 112

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning. Gerry Tesauro IBM T.J.Watson Research Center http://www.research.ibm.com/infoecon http://www.research.ibm.com/massdist. Outline. Statement of the problem: What RL is all about How it’s different from supervised learning Mathematical Foundations

iago
Télécharger la présentation

Introduction to Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Reinforcement Learning Gerry Tesauro IBM T.J.Watson Research Center http://www.research.ibm.com/infoecon http://www.research.ibm.com/massdist

  2. Outline • Statement of the problem: • What RL is all about • How it’s different from supervised learning • Mathematical Foundations • Markov Decision Problem (MDP) framework • Dynamic Programming: Value Iteration, ... • Temporal Difference (TD) and Q Learning • Applications: Combining RL and function approximation

  3. Acknowledgement • Lecture material shamelessly adapted from: R. S. Sutton and A. G. Barto, “Reinforcement Learning” • Book published by MIT Press, 1998 • Available on the web at: RichSutton.com • Many slides shamelessly stolen from web site

  4. Basic RL Framework • 1. Learning with evaluative feedback • Learner’s output is “scored” by a scalar signal (“Reward” or “Payoff” function) saying how well it did • Supervised learning: Learner is told the correct answer! • May need to try different outputs just to see how well they score (exploration …)

  5. 5

  6. 6

  7. Basic RL Framework • 2. Learning to Act: Learning to manipulate the environment • Supervised learning is passive: Learner doesn’t affect the distribution of exemplars or the class labels

  8. 8

  9. Basic RL Framework • Learner has to figure out which action is best, and which actions lead to which states. Might have to try all actions!  • Exploration vs. Exploitation: when to try a “wrong” action vs. sticking to the “best” action

  10. Basic RL Framework • 3. Learning Through Time: • Reward is delayed (Act now, reap the reward later) • Agent may take long sequence of actions before receiving reward • “Temporal Credit Assignment” Problem: Given sequence of actions and rewards, how to assign credit/blame for each action?

  11. 11

  12. 12

  13. 13

  14. Agent’s objective is to maximize expected value of “return” R: sum of future rewards: • is a “discount parameter” (0    1) • Example: Cart-Pole Balancing Problem: • reward = -1 at failure, else 0 • expected return = -k for k steps to failure reward maximized by making k 

  15. We consider non-deterministic environments: • Action at in state st • Probability distribution of rewards rt+1 • Probability distribution of new states st+1 • Some environments have nice property: distributions are history-independent and stationary. These are called Markov environments and the agent’s task is a Markov Decision Problem (MDP)

  16. An MDP specification consists of: • list of states s  S • list of legal action set A(s) for every s • set of transition probabilities for every s,a,s’: • set of expected rewards for every s,a,s’:

  17. Given an MDP specification: • Agent learns a policy : • deterministic policy (s) = action to take in state s • non-deterministic policy  (s,a) = probability of choosing action a in state s • Agent’s objective is to learn the policy that maximizes expected value of return Rt • “Value Function” associated with a policy tells us how good the policy is. Two types of value functions ...

  18. State-Value FunctionV (s) = Expected return starting in state s and following policy : • Action-Value Function Q (s,a) = Expected return starting from action a in state s, and then following policy :

  19. Bellman Equation for a Policy  • The basic idea: • Apply expectation for state s under policy : • A linear system of equations for V ; unique solution

  20. 21

  21. Why V*, Q* are useful • Any policy  that is greedy w.r.t. V* or Q* is an optimal policy *. • One-step lookahead using V*: • Zero-step lookahead using Q*:

  22. Two methods to solve for V*, Q* • Policy improvement: given a policy , find a better policy ’. • Policy Iteration: Keep repeating above and ultimately you will get to *. • Value Iteration: Directly solve Bellman’s optimality equation, without explicitly writing down the policy.

  23. Policy Improvement • Evaluate the policy: given , compute V (s) and Q (s,a) (from linear Bellman equations). • For every state s, construct new policy: do the best initial action, and then follow policy  thereafter. • The new policy is greedy w.r.t. Q (s,a) and V (s)  V’ (s)  V (s)  ’   in our partial ordering.

  24. Policy Improvement, contd. • What if the new policy has the same value as the old policy? ( V’ (s) = V (s) for all s) • But this is the Bellman Optimality equation: if V solves it, then it must be the optimal value function V*.

  25. 26

  26. Value Iteration • Use the Bellman Optimality equation to define an iterative “bootstrap” calculation: • This is guaranteed to converge to a unique V* (backup is a contraction mapping)

  27. Summary of DP methods • Guaranteed to converge to * in polynomial time (in size of state space); in practice often faster than linear • The method of choice if you can do it. • Why it might not be doable: • your problem is not an MDP • the transition probs and rewards are unknown or too hard to specify • Bellman’s “curse of dimensionality:” the state space is too big (>> O(106) states) • RL may be useful in these cases

  28. Monte Carlo Methods • Estimate V (s) by sampling • perform a trial: run the policy starting from s until termination state reached; measure actual return Rt • N trials: average Rt accurate to ~ 1/sqrt(N) • no “bootstrapping:” not using V(s’) to estimate V(s) • Two important advantages of Monte Carlo: • Can learn online without a model of the environment • Can learn in a simulatedenvironment

  29. 30

  30. Temporal Difference Learning • Error signal: difference between current estimate and improved estimate; drives change of current estimate • Supervised learning error: error(x) = target_output(x) - learner_output(x) • Bellman error (DP): “1-step full-width lookahead” - “0-step lookahead” • Monte Carlo error: error(s) = <Rt > - V(s) “many-step sample lookahead” - “0-step lookahead”

  31. TD error signal • Temporal Difference Error Signal: take one step using current policy, observe r and s’, then: “1-step sample lookahead” - “0-step lookahead” • In particular, for undiscounted sequences with no intermediate rewards, we have simply: • Self-consistent prediction goal: predicted returns should be self-consistent from one time step to the next (true of both TD and DP)

  32. Learning using the Error Signal: we could just do a reassignment: • But it’s often a good idea to learn incrementally: where is a small “learning rate” parameter (either constant, or decreases with time) • the above algorithm is known as “TD(0)” ; convergence to be discussed later...

  33. Advantages of TD Learning • Combines the “bootstrapping” (1-step self-consistency) idea of DP with the “sampling” idea of MC; maybe the best of both worlds • Like MC, doesn’t need a model of the environment, only experience • TD, but not MC, can be fully incremental • you can learn before knowing the final outcome • you can learnwithout the final outcome (from incomplete sequences) • Bootstrapping  TD has reduced variance compared to Monte Carlo, but possibly greater bias

  34. 35

  35. 36

  36. 37

  37. 38

  38. The point of the  parameter • (My view):  in TD() is a knob to twiddle: provides a smooth interpolation between =0 (pure TD) and =1 (pure MC) • For many toy grid-world type problems, can show that intermediate values of  work best. • For real-world problems, best  will be highly problem-dependent.

  39. Convergence of TD () • TD() converges to the correct value function V (s) with probability 1 for all . Requires: • lookup table representation (V(s) is a table), • must visit all states an infinite # of times, • a certain schedule for decreasing  (t). (Usually  (t) ~ 1/t) • BUT:TD() converges only for a fixed policy. What if we want to learn  as well as V? We still have more work to do ...

  40. Q-Learning: TD Idea to Learn * • Q-Learning (Watkins, 1989): one-step sample backup to learn action-value function Q(s,a). The most important RL algorithm in use today. Uses one-step error: to define an incremental learning algorithm: where (t) follows same schedule as in TD algorithm.

  41. Nice properties of Q-learning • Q guaranteed to converge to Q* w/probability 1. • Greedy guaranteed to converge to *. • But (amazingly), don’t need to follow a fixed policy, or the greedy policy, during learning! Virtually any policy will do, as long as all (s,a) pairs visited infinitely often. • As with TD, don’t need a model, can learn online, both bootstraps and samples.

  42. RL and Function Approximation • DP infeasible for many real applications due to curse of dimensionality: |S| too big. • FA may provide a way to “lift the curse:” • complexity D of FA needed to capture regularity in environment may be << |S|. • no need to sweep thru entire state space: train on N “plausible” samples and then generalize to similar samples drawn from the same distribution. • PAC learning tells us generalization error ~D/N;  N need only scale linearly with D.

  43. RL + Gradient Parameter Training • Recall incremental training of lookup tables: • If instead V(s) = V (s), adjust  to reduce MSE (R-V(s))2 by gradient descent:

  44. Example: TD() training of neural networks (episodic; =1 and intermediate r = 0):

  45. Case-Study Applications • Several commonalities: • Problems are more-or-less MDPs • |S| is enormous  can’t do DP • State-space representation critical: use of “features” based on domain knowledge • FA is reasonably simple (linear or NN) • Train in a simulator! Need lots of experience, but still << |S| • Only visit plausible states; only generalize to plausible states

  46. 47

  47. Learning backgammon using TD() • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding. (“hand-crafted features” added in later versions) • At final position xf, reward signal z given: • z = 1 if White wins; • z = 0 if Black wins • Train neural net using gradient version of TD() • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )

  48. 49

  49. Q: Who makes the moves?? • A: Let neural net make the moves itself, using its current evaluator: score all legal moves, and pick max Vt for White, or min Vt for Black. • Hopelessly non-theoretical and crazy: • Training V using non-stationary  (no convergence proof) • Training V using nonlinear func. approx. (no cvg. proof) • Random initial weights  Random initial play! Extremely long sequence of random moves and random outcome  Learning seems hopeless to a human observer • But what the heck, let’s just try and see what happens...

More Related