100 likes | 202 Vues
Explore the differences between habitual and goal-directed actions in the context of model-based reinforcement learning. Learn how actions are chosen based on previous stimuli, and understand the role of sensory motor cortices and the Dorsolateral Striatum. Discover the mechanisms behind habit formation and the chunking of action sequences over the course of training. Delve into concepts such as exploration vs. exploitation, variability vs. stereotypy, and errors reduction. Gain insights into when actions should get chunked or unchunked based on environmental changes. Simulate tasks like SRTT and instrumental conditioning to understand reinforcement devaluation and non-contingent omission.
E N D
Model-based RL (+ action sequences): maybe it can explain everything Niv lab meeting 6/11/2012 Stephanie Chan
goal-directed v.s. habitual instrumental actions • Habitual • Goal-directed • After extensive training • Choose action based on previous actions/stimuli • Sensory motor cortices + DLS (putamen) • Not sensitive to: • reinforcer devaluation • action-outcome changes in contingency • After moderate training • Choose action based on expected outcome • PFC & DMS(caudate) • Usually: • Model-based RL • Model-free RL
goal-directed v.s. habitual instrumental actions • What do real animals do?
Model-free RL • Explains resistance to devaluation: • Devaluation occurs in “extinction”. No feedback / no TD error • Does NOT explain resistance to changes in action-outcome contingency • In fact, habituated behavior should be MORE sensitive to changes in contingency • Maybe: update rates go small after extended training
Alternative explanation • We don’t need model-free RL • Habit formation = association of individual actions into “action sequences” • More parsimonious • A means of modeling action sequences
Over the course of training • Exploration -> exploitation • Variability -> stereotypy • Errors and RT -> decrease • Individual actions -> “chunked” sequences • PFC + associative striatum -> sensorimotor striatum • “closed loop” -> “open loop”
When should actions get chunked? • Q-learning with dwell time • Q(s,a) = R(s) + E[V(s’)] – D(s)<R> • When costs (possible mistakes) are outweighed by benefits (decrease decision time) • Cost: C(s,a,a’) = E[Q(s’,a’)-V(s’)] = E[A(s’,a’)] • Efficient way to compute this: TDt = [rt – dt<R> + V(st+1)]-V(st) = a sample of A(st,at) • Benefit: (# timesteps saved) <R>
When do they get unchunked? • C(s,a,a’) is insensitive to changes in environment • Primitive actions no longer evaluated, no TD error, no samples for C • But <R> is sensitive to changes… • Action sequences get unchunked when environment changes to decrease <R> • No unchunking if environment changes to present a better alternative to increase <R> • Ostlund et al 2009: rats are immediately sensitive to devaluation of the state that the macro action lands on, but not on the intermediate states
Simulations II: Instrumental conditioning Reinforcer devaluation Non-contingent Omission