1 / 28

QUIZ!!

QUIZ!!. T/F: Optimal policies can be defined from an optimal Value function. TRUE T/F: “ Pick the MEU action first, then follow optimal policy ” is optimal. TRUE T/F: π*(s)=max s ’ V*(s ’ ). FALSE T/F: The Bellman equation can be satisfied by sub-optimal value functions FALSE

naida
Télécharger la présentation

QUIZ!!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. QUIZ!! • T/F: Optimal policies can be defined from an optimal Value function. TRUE • T/F: “Pick the MEU action first, then follow optimal policy” is optimal. TRUE • T/F: π*(s)=max s’ V*(s’). FALSE • T/F: The Bellman equation can be satisfied by sub-optimal value functions FALSE • T/F: Value Iteration: The policy cannot converge before the value function FALSE • Explain the difference between Policy Iteration and Value Iteration. • Why can Policy Iteration be faster than Value Iteration?

  2. CS 511a: Artificial IntelligenceSpring 2013 Lecture 11: MDPs / Reinforcement Learning Feb 25, 2013 Robert Pless, Course adopted from Kilian Weinberger, with many slides from either Dan Klein, Stuart Russell or Andrew Moore

  3. Announcements • Project 2 due Thursday night. • HW 1 due Friday 5pm* • * accepted no penalty or late-day charge until Monday 10am.

  4. Policy Iteration Why do we compute V* or Q*, if all we care about is the best policy *?

  5. s a s, a R(s,a,s’) T(s,a,s’) s’ Utilities for Fixed Policies • Another basic operation: compute the utility of a state s under a fix (general non-optimal) policy • Define the utility of a state s, under a fixed policy : V(s) = expected total discounted rewards (return) starting in s and following  • Recursive relation (one-step look-ahead / Bellman equation): V (s) Q*(s,a) a=(s)

  6. Policy Evaluation • How do we calculate the V’s for a fixed policy? • Idea one: modify Bellman updates • Idea two: Optimal solution is stationary point (equality). Then it’s just a linear system, solve with Matlab (or whatever)

  7. Policy Iteration • Policy evaluation: with fixed current policy , find values with simplified Bellman updates: • Iterate until values converge • Policy improvement: with fixed utilities, find the best action according to one-step look-ahead

  8. Comparison • In value iteration: • Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) • Policy might not change between updates (wastes computation)  • In policy iteration: • Several passes to update utilities with frozen policy • Occasional passes to update policies • Value update can be solved as linear system • Can be faster, if policy changes infequently • Hybrid approaches (asynchronous policy iteration): • Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often

  9. Asynchronous Value Iteration • In value iteration, we update every state in each iteration • Actually, any sequences of Bellman updates will converge if every state is visited infinitely often • In fact, we can update the policy as seldom or often as we like, and we will still converge • Idea: Update states whose value we expect to change: If is large then update predecessors of s

  10. Reinforcement Learning

  11. Reinforcement Learning • Reinforcement learning: • Still have an MDP: • A set of states s  S • A set of actions (per state) A • A model T(s,a,s’) • A reward function R(s,a,s’) • Still looking for a policy (s) • New twist: don’t know T or R • I.e. don’t know which states are good or what the actions do • Must actually try actions and states out to learn Demo

  12. Example: Animal Learning • RL studied experimentally for more than 60 years in psychology • Rewards: food, pain, hunger, drugs, etc. • Mechanisms and sophistication debated • Example: foraging • Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies • Bees have a direct neural connection from nectar intake measurement to motor planning area

  13. Passive Learning • Simplified task • You don’t know the transitions T(s,a,s’) • You don’t know the rewards R(s,a,s’) • You are given a policy (s) • Goal: learn the state values • … what policy evaluation did • In this case: • Learner “along for the ride” • No choice about what actions to take • Just execute the policy and learn from experience • We’ll get to the active case soon • This is NOT offline planning! You actually take actions in the world and see what happens…

  14. Passive Model-Based Learning • Idea: • Learn the model empirically through experience • Solve for values as if the learned model were correct • Simple empirical model learning • Count outcomes for each s,a • Normalize to give estimate of T(s,a,s’) • Discover R(s,a,s’) when we experience (s,a,s’) • Solving the MDP with the learned model • Iterative policy evaluation, for example s (s) s, (s) s, (s),s’ s’

  15. Example: Model-Based Learning y • Episodes: +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) -100 x  = 1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2

  16. Passive Model-Free Learning • Big idea: why bother learning T? • 1. Direct Estimation: • Average V(s) value directly and compute expected discounted reward for each state. • No need to compute T or R. s (s) s, (s) s’

  17. Model-Free Learning • Want to compute an expectation weighted by P(x): • Model-based: estimate P(x) from samples, compute expectation • Model-free: estimate expectation directly from samples • Why does this work? Because samples appear with the right frequencies!

  18. Example:Model-Free Estimation y • Episodes: +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) -100 x  = 1, R = -1 V(2,3) ~ (96 + -103) / 2 = -3.5 V(3,3) ~ (99 + 97 + -102) / 3 = 31.3

  19. Sample-Based Policy Evaluation? • Update V without building T or R. s (s) s, (s),s’ s, (s) s’ s1’ s3’ s2’

  20. Passive Model-Free Learning • Big idea: why bother learning T? • 1. Direct Estimation: • Average V(s) value directly and compute expected discounted reward for each state. • No need to compute T or R. • 2. Temporal-Difference Leearning: • Update value function towards whatever successor occurs – maintain running average. s (s) s, (s) s’

  21. Temporal-Difference Learning • Big idea: learn from every experience! • Update V(s) each time we experience (s,a,s’,r) • Likely s’ will contribute updates more often • Temporal difference learning • Policy still fixed! • Move values toward value of whatever successor occurs: running average! s (s) s, (s) s’ Sample of V(s): Update to V(s): Same update:

  22. Exponential Moving Average • Exponential moving average • Makes recent samples more important • Forgets about the past (distant past values were wrong anyway) • Easy to compute from the running average • Decreasing learning rate can give converging averages

  23. s a s, a s,a,s’ s’ Problems with TD Value Learning • TD value leaning is a model-free way to do policy evaluation • However, if we want to turn values into a (new) policy, we’re sunk: • Idea: learn Q-values directly • Makes action selection model-free too!

  24. Active Learning • Full reinforcement learning • You don’t know the transitions T(s,a,s’) • You don’t know the rewards R(s,a,s’) • You can choose any actions you like • Goal: learn the optimal policy • … what value iteration did! • In this case: • Learner makes choices! • Fundamental tradeoff: exploration vs. exploitation • This is NOT offline planning! You actually take actions in the world and find out what happens…

  25. Detour: Q-Value Iteration • Value iteration: find successive approx optimal values • Start with V0*(s) = 0, which we know is right (why?) • Given Vi*, calculate the values for all states for depth i+1: • But Q-values are more useful! • Start with Q0*(s,a) = 0, which we know is right (why?) • Given Qi*, calculate the q-values for all q-states for depth i+1:

  26. [DEMO – Grid Q’s] Q-Learning • Q-Learning: sample-based Q-value iteration • Learn Q*(s,a) values • Receive a sample (s,a,s’,r) • Consider your old estimate: • Consider your new sample estimate: • Incorporate the new estimate into a running average:

  27. Q-Learning • Q-Learning: sample-based Q-value iteration • Learn Q*(s,a) values • Receive a sample (s,a,s’,r) • Consider your old estimate: • Consider your new sample estimate: • Incorporate the new estimate into a running average:

  28. Example’s Tom Erez, Hopper: • http://www.youtube.com/watch?feature=player_embedded&v=kUfmnoobTHQ - !

More Related