1 / 26

Unconditioned stimulus (food) causes unconditioned response (saliva)

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva). Rescola-Wagner Rule. V=wu, with u stimulus (0,1), w weight and v is predicted response. Adapt w to minimize quadratic error.

derry
Télécharger la présentation

Unconditioned stimulus (food) causes unconditioned response (saliva)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

  2. Rescola-Wagner Rule • V=wu, with u stimulus (0,1), w weight and v is predicted response. Adapt w to minimize quadratic error

  3. Rescola Wagner rule for multiple inputs can predict various phenomena: Blocking: learned s1 to r prevents learning of association s2 to r Inhibition: s2 reduces prediction when combined with any predicting stimulus

  4. Temporal difference learning • Interpret v(t) as ‘total future expected reward’ • v(t) is predicted from the past

  5. After learning delta(t)=0 implies: v(t=0) is sum of expected future reward v(t) constant, thus expected reward r(t)=0 v(t) decreasing, positive expected reward

  6. Explanation fig 9.2 Since u(t)=delta(t,0), Eq. 9.6 becomes: v(t)=w(t) Eq. 9.7 becomes delta w(t)= \epsilon delta(t) Thus, delta v(t)= \epsilon(r(t)+v(t+1)-v(t)) R(t)=delta(t,T) Step 1: only change is v(T)=v(T)+epsilon Step 2: change v(T-1) and v(T) Etc.

  7. Dopamine • Monkey release button and press other after stimulus to receive reward. A: VTA cells respond to reward in early trials and to stimulus in late trials. Similar to delta in TD rule fig. 9.2

  8. Dopamine • Dopamine neurons encode reward prediction error (delta). B: witholding reward reduced neural firing in agreement with delta interpretation.

  9. Static action choice • Rewards result from actions • Bees visit flowers whose color (blue, yellow) predict reward (sugar). • M are action values, encode expected reward. Beta implements exploration

  10. The indirect actor model Learn the average nectar volumes for each flower and act accordingly. Implemented by on-line learning. When visit blue flower And leave yellow estimate unchanged Fig: rb=1, ry=2 for t=1:100 and reversed For t=101:200. A: my, mb; B-D Cumulated reward low beta (B), high Beta (C,D).

  11. Bumble bees • Blue: r=2 for all flowers, yellow: r=6 for 1/3 of the flowers. When switched at t=15 bees adapt fast.

  12. Bumble bees • Model with m=< f(r)> with f concave, so that mb=f(2) larger than my=1/3 f(6)

  13. Direct actor (policy gradient)

  14. Direct actor Stochastic gradient ascent: Fig: two sessions as in fig. 9.4 with good and Bad behaviour. Problem is size m prevents Exploration.

  15. Sequential action choice • Reward obtained after sequence of actions • Credit assignment problem.

  16. Sequential action choice • Policy iteration: • Critic: use TD eval. v(state) using current policy • Actor: improve policy m(state)

  17. Policy evaluation • Policy is random left/right at each turn. • Implemented as TD:

  18. Policy improvement • Can be understood as policy gradient rule: where we replace ra-r by And m becomes state dependent. Example: current state is A

  19. Policy improvement • Policy improvement changes policy, thus reevaluate policy for proven convergence • Interleaving PI and PE is called actor-critic • Fig: AC learning of maze. NB learning at C is slow.

  20. Generalizations • Discounted reward: • TD rule changes to • TD(lambda): apply TD rule not only to update value of current state but also of recently past visited states. TD(0)=TD, TD(1)=updating all past states.

  21. Water maze • State u = 493 place cells, 8 actions • AC rules:

  22. Comparing rats and model • RL predicts well initial learning, but not change to new task.

  23. Markov decision process • State transitions P(u’|u,a). • Absorbing states: • Find M such that • Solution: solve Bellman equation

  24. Policy iteration • Is Policy evaluation + policy improvement • Evaluation step: Find value of a policy M: • RL evaluates rhs stochasticly V(u)=v(u) +eps delta(t)

  25. Improvement step: maximize {...} wrt a Requires knowledge of P(u’|u,a). Earlier formula can be derived as stochastic version

More Related