1 / 59

Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp. Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan Science and Technology Agency Nara Institute of Science and Technology. Outline. Introduction Cerebellum, basal ganglia, and cortex

heinz
Télécharger la présentation

Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction, Control and DecisionsKenji Doyadoya@irp.oist.jp Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan Science and Technology Agency Nara Institute of Science and Technology

  2. Outline • Introduction • Cerebellum, basal ganglia, and cortex • Meta-learning and neuromodulators • Prediction time scale and serotonin

  3. Learning to Walk (Doya & Nakano, 1985) • Action: cycle of 4 postures • Reward: speed sensor output • Multiple solutions: creeping, jumping,…

  4. Learning to Stand Up(Morimoto &Doya, 2001) early trials • after learning • Reward: height of the head • No desired trajectory

  5. critic d actor reward r action a agent environment state s Reinforcement Learning (RL) • Framework for learning state-action mapping (policy) by exploration and reward feedback • Critic • reward prediction • Actor • action selection • Learning • external reward r • internal reward d: difference from prediction

  6. Reinforcement Learning Methods • Model-free Methods • Episode-based • parameterize policy P(a|s; q) • Temporal difference • state value function V(s) • (state-)action value function Q(s,a) • Model-based methods • Dynamic Programming • forward model P(s’|s,a)

  7. Temporal Difference Learning • Predict reward: value function • V(s) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s] • Q(s,a) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s, a(t)=a] • Select action • greedy: a = argmax Q(s,a) • Boltzmann: P(a|s)  exp[ b Q(s,a)] • Update prediction: TD error • d(t) = r(t) + gV(s(t+1)) - V(s(t)) • DV(s(t)) = ad(t) • DQ(s(t),a(t)) = ad(t)

  8. Dynamic Programming and RL • Dynamic Programming • model-based, off-line • solve Bellman equation • V(s) = maxaSs’ [ P(s’|s,a) {r(s,a,s’) + gV(s’)}] • Reinforcement Learning • model-free, on-line • learn by TD error • d(t) = r(t) + gV(s(t+1)) - V(s(t)) • DV(s(t)) = ad(t) • DQ(s(t),a(t)) = ad(t)

  9. Discrete vs. Continuous RL (Doya, 2000) • Discrete time • Continuous time

  10. Questions • Computational Questions • How to learn: • direct policy P(a|s) • value functions V(s), Q(s,a) • forward models P(s’|s,a) • When to use which method? • Biological Questions • Where in the brain? • How are they represented/updated? • How are they selected/coordinated?

  11. Brain Hierarchy • Forebrain • Cerebral cortex (a) • neocortex • paleocortex: olfactory cortex • archicortex: basal forebrain, hippocampus • Basal nuclei (b) • neostriatum: caudate, putamen • paleostriatum: globus pallidus • archistriatum: amygdala • Diencephalon • thalamus (c) • hypothalamus (d) • Brain stem & Cerebellum • Midbrain (e) • Hindbrain • pons (f) • cerebellum (g) • Medulla (h) • Spinal cord (i)

  12. Just for Motor Control?(Middleton & Strick 1994) • Basal ganglia (Globus Pallidus) Prefrontal cortex (area46) Cerebellum (dentate nucleus)

  13. Cerebral Cortex:Unsupervised Learning output input Basal Ganglia: Reinforcement Learning reward output input Cerebellum: Supervised Learning target + error - output input Specialization by Learning Algorithms (Doya, 1999) Cortex Basal thalamus Ganglia SN Cerebellum IO

  14. Cerebellum • Purkinje cells • ~105 parallel fibers • single climbing fiber • long-term depression • Supervised learning • perceptron hypothesis • internal models

  15. Internal Models in the Cerebellum (Imamizu et al., 2000) • Learning to use ‘rotated’ mouse after learning early learning

  16. Motor Imagery (Luft et al. 1998) Finger movement Imagery of movement

  17. Basal Ganglia • Striatum • striosome & matrix • dopamine-dependent plasticity • Dopamine neurons • reward-predictive response • TD learning

  18. r V d r V d r V d Dopamine Neurons and TD Errord(t) = r(t) + gV(s(t+1)) - V(s(t)) before learning after learning omit reward (Schultz et al. 1997)

  19. Reward-predicting Activities of Striatal Neurons • Delayed saccade task (Kawagoe et al., 1998) • Not just actions, but resulting rewards Reward: Right Up Left Down All Target: Right Up Left Down

  20. Cerebral Cortex • Recurrent connections • Hebbian plasticity • Unsupervised learning, e.g., PCA, ICA

  21. Replicating V1Receptive Fields (Olshausen & Field, 1996) • Infomax and sparseness • Hebbian plasticity and recurrent inhibition

  22. Specialization by Learning? • Cerebellum: Supervised learning • error signal by climbing fibers • forward model s’=f(s,a) and policy a=g(s) • Basal ganglia: Reinforcement leaning • reward signal by dopamine fibers • value functions V(s) and Q(s,a) • Cerebral cortex: Unsupervised learning • Hebbian plasticity and recurrent inhibition • representation of state s and action a • But how are they recruited and combined?

  23. a s Q s’ a s a s ai V f g Multiple Action Selection Schemes • Model-free • a = argmaxa Q(s,a) • Model-based • a = argmaxa [r+V(f(s,a))] • forward model: f(s,a) • Encapsulation • a = g(s)

  24. Internal models/Cerebellum Reza Shadmehr Stefan Schaal Mitsuo Kawato Reward/Basal ganglia Andrew G. Barto Bernard Balleine Peter Dayan John O’Doherty Minoru Kimura Wolfram Schultz State coding/Cortex Nathaniel Daw Leo Sugrue Daeyeol Lee Jun Tanji Anitha Pasupathy Masamichi Sakagami Lectures at OCNC 2005

  25. Outline • Introduction • Cerebellum, basal ganglia, and cortex • Meta-learning and neuromodulators • Prediction time scale and serotonin

  26. critic d actor reward r action a agent environment state s Reinforcement Learning (RL) • Framework for learning state-action mapping (policy) by exploration and reward feedback • Critic • reward prediction • Actor • action selection • Learning • external reward r • internal reward d: difference from prediction

  27. Reinforcement Learning • Predict reward: value function • V(s) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s] • Q(s,a) = E[ r(t) + gr(t+1) + g2r(t+2)…| s(t)=s, a(t)=a] • Select action • greedy: a = argmax Q(s,a) • Boltzmann: P(a|s)  exp[ b Q(s,a)] • Update prediction: TD error • d(t) = r(t) + gV(s(t+1)) - V(s(t)) • DV(s(t)) = ad(t) • DQ(s(t),a(t)) = ad(t)

  28. Cyber Rodent Project • Robots with same constraint as biological agents • What is the origin of rewards? • What to be learned, what to be evolved? • Self-preservation • capture batteries • Self-reproduction • exchange programs through IR ports

  29. camera range sensor proximity sensors gyro battery latch two wheels IR port speaker microphones R/G/B LED Cyber Rodent: Hardware

  30. Survival catch battery packs Reproduction copy ‘genes’ through IR ports Evolving Robot Colony

  31. large g small g Discounting Future Reward

  32. Setting of Reward Function • Reward r = rmain + rsupp - rcost • e.g., reward for vision of battery

  33. Reinforcement Learning of Reinforcement Learning (Schweighfer&Doya, 2003) • Fluctuations in the metaparameters correlate with average reward • reward • g • b • a

  34. 14 12 10  8 β 6 4 2 Battery level 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Randomness Control by Battery Level • Greedier action at both extremes

  35. Neuromodulators for Metalearning (Doya, 2002) • Metaparameter tuning is critical in RL • How does the brain tune them? Dopamine: TD error d Acetylcholine: learning rate a Noradrenaline: inv. temp. b Serotonin: discount g

  36. Learning Ratea • DV(s(t-1)) = ad(t) • DQ(s(t-1),a(t-1)) = ad(t) • small aslow learning • large aunstable learning Acetylcholine basal forebrain • Regulate memory update and retention (Hasselmo et al.) • LTP in cortex, hippocampus • top-down and bottom-up information flow

  37. Inverse Temperature b • Greediness in action selection • P(ai|s)  exp[ b Q(s,ai)] • small bexploration • large bexploitation Noradrenaline locus coeruleus • Correlation with performance accuracy (Aston-Jones et al.) • Modulation of cellular I/O gain (Cohen et al.)

  38. Discount Factor g • V(s(t)) = E[ r(t+1) + gr(t+2) + g2r(t+3) + …] • Balance between short- and long-term results Serotonin dorsal raphe • Low activity associated with impulsivity • depression, bipolar disorders • aggression, eating disorders

  39. TD Error d • d(t) = r(t) + gV(s(t)) - V(s(t-1)) • Global learning signal • reward prediction: DV(s(t-1)) = ad(t) • reinforcement: DQ(s(t-1),a(t-1)) = ad(t) Dopamine substantia nigra, VTA • Respond to errors in reward prediction • Reinforcement of actions • addiction

  40. Ach? 5-HT? a d NA? r • DA neurons: TD error d • SNr/GPi: action selection: Q(s,a)  a TD Model of Basal Ganglia(Houk et al. 1995, Montague et al. 1996, Schultz et al. 1997,...) • Striosome: state value V(s) • Matrix: action value Q(s,a) s V(s) Q(s,a)

  41. striatum g1 g2 g3 V1 V2 V3 V(s(t)) Dopamineneurons d(t) V(s(t+1)) Possible Control of Discount Factor • Modulation of TD error • Selection/weighting of parallel networks

  42. Markov Decision Task (Tanaka et al., 2004) • State transition and reward functions • Stimulus and response

  43. Behavior Results • All subjects successfully learned optimal behavior

  44. SHORT vs. NO (p < 0.001 uncorrected) OFC Insula Striatum Cerebellum LONG vs. SHORT (p < 0.0001 uncorrected) DLPFC, VLPFC, IPC, PMd Striatum Cerebellum Dorsal raphe Block-Design Analysis Different brain areas involved in immediate and future reward prediction

  45. Ventro-Dorsal Difference Lateral PFC Insula Striatum

  46. Model-based Regressor Analysis • Estimate V(t) and d(t) from subjects’ performance data • Regression analysis of fMRI data Agent Value function V(s) Value function V(s) reward r(t) fMRI data TD error d(t) TD error d(t) 20yen Policy action a(t) Environment state s(t)

  47. Explanatory Variables (subject NS) g = 0 Reward prediction V(t) g = 0.3 g = 0.6 g = 0.8 g = 0.9 g = 0.99 g = 0 Reward prediction error d(t) g = 0.3 g = 0.6 g = 0.8 g = 0.9 g = 0.99 1 312 trial

  48. Regression Analysis mPFC Insula Reward prediction V x = -2 mm x = -42 mm Striatum Reward prediction error d z = 2

  49. Day1: Tr- Day2: Tr0 Day3: Tr+ 2.3g of tryptophan (Control) 10.3g of tryptophan (Loading) No tryptophan (Depletion) Tryptophan Depletion/Loading • Tryptophan: precursor of serotonin • depletion/loading affectcentral serotonin levels • (e.g. Bjork et al. 2001, Luciana et al.2001) • 100 g of amino acid drink • experiments after 6 hours

  50. Blood Tryptophan Levels N.D. (< 3.9 mg/ml)

More Related