1 / 47

Reinforcement Learning’s Computational Theory of Mind

Reinforcement Learning’s Computational Theory of Mind. Rich Sutton. with thanks to:. Andy Barto. Satinder Singh. Doina Precup. Outline. Computational Theory of Mind Reinforcement Learning Some vivid examples Intuition of RL’s Computational Theory of Mind

Télécharger la présentation

Reinforcement Learning’s Computational Theory of Mind

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning’sComputational Theory of Mind Rich Sutton with thanks to: Andy Barto Satinder Singh Doina Precup

  2. Outline • Computational Theory of Mind • Reinforcement Learning • Some vivid examples • Intuition of RL’s Computational Theory of Mind • Reward, policy, value (prediction of reward) • Some of the Math • Policy iteration (values & policies build on each other) • Discounted reward • TD error • Speculative extensions • Reason • Knowledge

  3. Honeybee Brain & VUM Neuron Hammer, Menzel

  4. Dopamine Neurons Signal “Error/Change” in Prediction of Reward

  5. Marr’s Three Levels at which any information processing system can be understood • Computational Theory Level • What are the goals of the computation? • What is being computed? • Why are these the right things to compute? • What overall strategy is followed? • Representation and Algorithm Level • How are these things computed? • What representation and algorithms are used? • Hardware Implementation Level • How is this implemented physically? What and Why? How? Really how?

  6. Cash Register • Computational Theory • Adding numbers • Making change • Computing tax • Controling access to cash drawer • Representations and Algorithms • Are numbers stored in decimal or binary, or BCD? • Is multiplication done by repeated adding? • Hardware Implementation • Silicon or gears? • Motors or springs? What and Why?

  7. Word Processor • Computational Theory • The whole “application” level • To display document as it will appear on paper • To make it easy to change, enhance • Representations and Algorithms • How is the document stored? • What algorithms are used to maintain its display? • The underlying C code • Hardware Implementation • How does the display work? • How does the silicon implement the computer? What and Why?

  8. Flight • Computational Theory • Aerodynamics • Lift, propulsion, airfoil shape • Representations and Algorithms • Fixed wings or flapping? • Hardware Implementation • Steel, wood, or feathers? What and Why?

  9. Importance of Computational Theory • The levels are loosely coupled • Each has an internal logic, coherence of its own • But computational theory drives all lower levels • We have so often gone wrong by mixing CT with the other levels • Imagined neuro-physiological constraints • The meaning of connectionism, neural networks • Constraints within the CT level have more force • We have little computational theory for AI • No clear problem definition • Many methods for knowledge representation, but no theory of knowledge

  10. Outline • Computational Theory of Mind • Reinforcement Learning • Some vivid examples • Intuition of RL’s Computational Theory of Mind • Reward, policy, value (prediction of reward) • Some of the Math • Policy iteration (values & policies build on each other) • Discounted reward • TD error • Speculative extensions • Reason • Knowledge

  11. Environment action state reward Agent Reinforcement learning: Learning from interactionto achieve a goal •complete agent • temporally situated • continual learning & planning • object is to affect environment • environment stochastic & uncertain

  12. Strands of History of RL Trial-and-error learning Temporal-difference learning Optimal control, value functions Hamilton (Physics) 1800s Thorndike () 1911 Secondary reinforcement () Shannon Samuel Minsky Bellman/Howard (OR) Holland Klopf Witten Werbos Barto et al Sutton Watkins

  13. Hiroshi Kimura’s RL Robots Before After Backward New Robot, Same algorithm

  14. The RoboCup Soccer Competition

  15. Torque applied here q 1 q 2 The Acrobot Problem Goal: Raise tip above line e.g., Dejong & Spong, 1994 Sutton, 1995 fixed base Minimum–Time–to–Goal: 4 state variables: 2 joint angles 2 angular velocities Tile coding with 48 layers tip Reward = -1 per time step

  16. Examples of Reinforcement Learning • Robocup Soccer Teams Stone & Veloso, Reidmiller et al. • World’s best player of simulated soccer, 1999; Runner-up 2000 • Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis • 10-15% improvement over industry standard methods • Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin • World's best assigner of radio channels to mobile telephone calls • Elevator Control Crites & Barto • (Probably) world's best down-peak elevator controller • Many Robots • navigation, bi-pedal walking, grasping, switching between skills... • TD-Gammon and Jellyfish Tesauro, Dahl • World's best backgammon player

  17. Back- prop New Applications of RL • CMUnited Robocup Soccer Team Stone & Veloso • World’s best player of Robocup simulated soccer, 1998 • KnightCap and TDleaf Baxter, Tridgell & Weaver • Improved chess play from intermediate to master in 300 games • Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis • 10-15% improvement over industry standard methods • Walking Robot Benbrahim & Franklin • Learned critical parameters for bipedal walking Real-world applications using on-line learning

  18. Backgammon SITUATIONS: configurations of the playing board (about 1020) ACTIONS: moves REWARDS: win: +1 lose: –1 else: 0 20 Pure delayed reward

  19. TD-Gammon T e s a u r o , 1 9 9 2 - 1 9 9 5 A c t i o n s e l e c t i o n . . . V a l u e . . . b y 2 - 3 p l y s e a r c h . . . . . . T D E r r o r - V V t t + 1 S t a r t w i t h a r a n d o m N e t w o r k P l a y m i l l i o n s o f g a m e s a g a i n s t i t s e l f L e a r n a v a l u e f u n c t i o n f r o m t h i s s i m u l a t e d e x p e r i e n c e T h i s p r o d u c e s a r g u a b l y t h e b e s t p l a y e r i n t h e w o r l d

  20. Outline • Computational Theory of Mind • Reinforcement Learning • Some vivid examples • Intuition of RL’s Computational Theory of Mind • Reward, policy, value (prediction of reward) • Some of the Math • Policy iteration (values & policies build on each other) • Discounted reward • TD error • More speculative extensions • Reason • Knowledge

  21. The Reward Hypothesis The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward)

  22. The Reward Hypothesis can be conceived of, understood as The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward)

  23. The Reward Hypothesis The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward) Must come from outside, not under the mind’s direct control

  24. The Reward Hypothesis The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward) a simple, single number (not a vector or symbol structure)

  25. The Reward Hypothesis • Obvious? Brilliant? Demeaning? Inevitable? Trivial? • Simple, but not trivial • May be adequate, may be completely satisfactory • A good null hypothesis The mind’s goal is to maximize the cumulative sum of a received scalar signal (reward)

  26. Policies • A policy maps each state to an action to take • Like a stimulus–response rule • We seek a policy that maximizes cumulative reward • The policy is a subgoal to achieving reward Reward Policy

  27. Reward Policy Value Function Value Functions • Value functions = Predictions of expected rewardfollowing states: Value: States  Expected future reward • Moment-by-moment estimates of how well its going • All efficient methods for finding optimal policies first estimate value functions • RL methods, state-space planning methods, dynamic programming • Recognizing and reacting to the ups and downs of life is an important part of intelligence

  28. The Mountain Car Problem SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always –1 until car reaches the goal No Discounting Moore, 1990 Goal Gravity wins Minimum-Time-to-Goal Problem

  29. Value Functions Learned while solving the Mountain Car problem Lower is better

  30. Honeybee Brain & VUM Neuron Hammer, Menzel

  31. Dopamine Neurons Signal “Error/Change” in Prediction of Reward

  32. Outline • Computational Theory of Mind • Reinforcement Learning • Some vivid examples • Intuition of RL’s Computational Theory of Mind • Reward, policy, value (prediction of reward) • Some of the Math • Policy iteration (values & policies build on each other) • Discounted reward • TD error • Speculative extensions • Reason • Knowledge

  33. Notation • Rewardr : States  Pr() e.g., • Policies : States  Pr(Actions) p* is optimal • Value Functions discount factor ≈1 but <1

  34. Policy Iteration

  35. Generalized Policy Iteration policy evaluation Value Function Policy p V greedification * V

  36. Reward is a time series Reward Reward Total future reward Discounted (imminent) reward

  37. Discounting Example REWARD

  38. What should the prediction error be?at time t?

  39. Reward Unexpected Reward Value TD error Computation Theoretical TD Errors Reward Expected Cue Value TD error Reward Absent Value TD error

  40. Honeybee Brain & VUM Neuron Hammer, Menzel

  41. Dopamine Neurons Signal TD Error in Prediction of Reward

  42. Exhaustive Dynamic search programming full backups sample Temporal- backups Monte Carlo difference learning l bootstrapping, shallow deep backups backups Representation & Algorithm Level:The many methods for approximating value

  43. Outline • Computational Theory of Mind • Reinforcement Learning • Some vivid examples • Intuition of RL’s Computational Theory of Mind • Reward, policy, value (prediction of reward) • Some of the Math • Policy iteration (values & policies build on each other) • Discounted reward • TD error • Speculative extensions • Reason • Knowledge

  44. Tolman & Honzik, 1930“Reasoning in Rats” Food box Block B Path 1 Block A Path 2 Path 3 Start box

  45. Reason as RL over Imagined Experience 1. Learn a model of the world’s transition dynamics transition probabilities, expected immediate rewards “1-step model” of the world 2. Use model to generate imaginary experiences internal thought trials, mental simulation (Craik, 1943) 3. Apply RL as if experience had really happened Reward Policy Value Function 1-Step Model

  46. Mind is About Predictions Hypothesis: Knowledge is predictive About what-leads-to-what, under what ways of behaving What will I see if I go around the corner? Objects: What will I see if I turn this over? Active vision: What will I see if I look at my hand? Value functions: What is the most reward I know how to get? Such knowledge is learnable, chainable Hypothesis: Mental activity is working with predictions Learning them Combining them to produce new predictions (reasoning) Converting them to action (planning, reinforcement learning) Figuring out which are most useful

  47. An old, simple, appealing idea • Mind as prediction engine! • Predictions are learnable, combinable • They represent cause and effect, and can be pieced together to yield plans • Perhaps this old idea is essentially correct. • Just needs • Development, revitalization in modern forms • Greater precision, formalization, mathematics • The computational perspective to make it respectable • Imagination, determination, patience • Not rushing to performance

More Related