410 likes | 595 Vues
Reinforcement Learning Part 2. Temporal Difference Learning, Actor-Critics, and the brain. Outline of Lecture. Review Temporal Difference learning An efficient way to estimate the value function Actor-Critic Methods A way to learn if you can estimate the value function
E N D
Reinforcement Learning Part 2 Temporal Difference Learning, Actor-Critics, and the brain
Outline of Lecture • Review • Temporal Difference learning • An efficient way to estimate the value function • Actor-Critic Methods • A way to learn if you can estimate the value function • The relationship between TD(λ) and the brain.
(REVIEW) Example: GridWorld Current State Actions States Terminal State (Optional) Initial or Start State
(REVIEW) Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • S is the set of possible states. • A is the set of possible actions. • P describes the transition dynamics. • We use t to denote the (integer) time step. • R describes the rewards. • d0 is the distribution over the states at time t = 0. • γ is a real-valued discount parameter in the interval [0,1].
(REVIEW) Episodes • An episode is one run of an MDP, starting at t=0, and running until a terminal state is reached.
(REVIEW) Trajectory • If you use a policy, π, on an MDP, M, for one episode, you get a trajectory.
(REVIEW) Softmax Policy • One policy parameter per state-action pair.
(REVIEW) Parameterized Gaussian Policy • Let φ(s) be a vector of features associated with the state s.
(REVIEW) Solving an MDP using Local Search • Given policy parameters, we can estimate how good they are by generating many trajectories using them and then averaging the returns. • Use hill-climbing, simulated annealing, a genetic algorithm, or any other local search method to find θ that maximize J.
Temporal Difference Learning • Temporal difference learning (TD) is an algorithm for estimating the value function. • Let be our estimate of the true value of the state s, • We can initialize it randomly or to zero. • If we take action a in state s and go to state s’ and receive a reward of r, how can we update
TD-Error Temporal Difference Error (TD Error) Bellman Error Reward Prediction Error
TD-Error • A positive TD-error means that reality was better than our expectation. • We should increase • A negative TD-error means that reality was worse than our expectation. • We should decrease Reality Expectation
TD(λ) Say that the TD-error at time t + 4 is positive: We would increase our estimate of the value of st+4. If we already updated the value of st+3 before, then we underestimated its value! We used the old value
TD(λ) • Idea: If we observe a positive TD-error (things worked out better than expected), then we could increase the value of many of the recent states. • Note: There are many ways of viewing TD(λ).
TD(λ) • Allows observed rewards to update value estimates for many states immediately.
TD(λ) • Each state has an eligibility trace, which tracks how much a positive TD error should increase its value and a negative TD error should decrease it. • As time passes, the eligibility of a state decays. • When a state occurs, its eligibility is set to 1 (it is very responsible for the TD error).
TD(λ) • Let et(s) be the eligibility of state s. • When state s occurs, set et(s) = 1. • Otherwise set et(s) = γλet(s), where λ is a parameter between 0 and 1. γλ = 0.8 eligibility Time since state occurred
TD(λ) Notice that λ = 0 results in TD.
Actor-Critic • Use TD(λ) to estimate the value function. • Use a parameterized policy to select actions. • When the TD error is positive, increase the probability of the action that was chosen. • When the TD error is negative, decrease the probability of the action that was chosen.
Other Actor-Critics • We could use something like an artificial neural network to represent the policy. • We could add eligibility traces to the policy update. • We could make sure that the policy updates result in the policy following the gradient of J(θ) • Called “Policy Gradient” • We could make sure that the policy updates result in the policy following the natural gradient of J(θ). • Called “Natural Policy Gradient” • My research involves adding safety constraints, safety guarantees, and convergence guarantees to natural policy gradient methods.
TD and the Brain • Preliminaries • RL and TD are not meant to model the brain. • There are many details the we will not discuss. • We will ignore controversies and discuss the leading hypotheses only.
TD and the Brain • “If the problems that animals face are well modeled as [MDPs/POMDPs]—as we think they are—it would be surprising if effective algorithms bore no relationship to the methods that have evolved enabling animals to deal with the problems they face over their lifetimes.” -Andy Barto
Dopamine • Dopamine is a neurotransmitter.
Dopamine • Dopamine is manufactured in the ventral tegmental area (VTA) and substantianigra and broadcast to several parts of the brain. • Evidence suggests that dopamine might encode TD error.
Dopamine • Prefrontal Cortex: • Planning complex cognitive behavior • Decision making • Striatum: • Motor control (Parkinson’s) • Nucleus accumbens: • Pleasure • Laughter • Reward • reinforcement learning • fear, aggression • addiction • Hippocampus: • Long-term memory • Short-term memory • Spatial memory and navigation
Return Midterms • Next time: • TD(λ) with function approximation • More!