1 / 14

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs. September 8, 2011. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011. Outline.

jory
Télécharger la présentation

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE-517: Reinforcement Learningin Artificial IntelligenceLecture 6: Optimality Criterion in MDPs September 8, 2011 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011

  2. Outline • Optimal value functions (cont.) • Implementation considerations • Optimality and approximation

  3. Recap on Value Functions • We define the state-value function for policy p as • Similarly, we define the action-value function for • The Bellman equation • The value function Vp(s) is the unique solution to its Bellman equation 0 ∆ 0 0

  4. Optimal Value Functions • A policy p is defined to be better than or equal to a policy p*, if its expected return is greater than or equal to that of p* for all states, i.e. • There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies • Optimal policies also share the same optimal action-value function, defined as

  5. Optimal Value Functions (cont.) • The latter gives the expected return for taking action a in state s and thereafter following an optimal policy • Thus, we can write • Since V*(s) is the value function for a policy, it must satisfy the Bellman equation • This is called the Bellman optimality equation • Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state ∆

  6. Optimal Value Functions (cont.) 0 ∆

  7. Optimal Value Functions (cont.) • The Bellman optimality equation for Q* is • Backup diagrams arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy) 0 0

  8. Optimal Value Functions (cont.) • For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy • The Bellman optimality equation is actually a system of equations, one for each state • N equations (one for each state) • N variables – V*(s) • This assumes you know the dynamics of the environment • Once one has V*(s), it is relatively easy to determine an optimal policy … • For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation • Any policy that assigns nonzero probability only to these actions is an optimal policy • This translates to a one-step search, i.e. greedy decisions will be optimal

  9. Optimal Value Functions (cont.) • With Q*, the agent does not even have to do a one-step-ahead search • For any state s – the agent can simply find any action that maximizes Q*(s,a) • The action-value function effectively embeds the results of all one-step-ahead searches • It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair • Agent does not need to know anything about the dynamics of the environment • Q: What are the implementation tradeoffs here? ∆

  10. Implementation Considerations • Computational Complexity • How complex is it to evaluate the value and state-value functions? • In software • In hardware • Data flow constraints • Which part of the data needs to be globally vs. locally available? • Impact of memory bandwidth limitations ∆

  11. Recycling Robot revisited • A transition graph is a useful way to summarize the dynamics of a finite MDP • State node for each possible state • Action node for each possible state-action pair 0

  12. Bellman Optimality Equations for the Recycling Robot • To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re 0

  13. Optimality and Approximation • Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens • Usually involves heavy computational load • Typically agents perform approximations to the optimal policy • A critical aspect of the problem facing the agent is always the computational resources available to it • In particular, the amount of computation it can perform in a single time step • Practical considerations are thus: • Computational complexity • Memory available • Tabular methods apply for small state sets • Communication overhead (for distributed implementations) • Hardware vs. software

  14. Are approximations good or bad ? • RL typically relies on approximation mechanisms (see later) • This could be an opportunity • Efficient “Feature-extraction” type of approximation may actually reduce “noise” • Make it practical for us to address large-scale problems • In general, making “bad” decisions in RL result in learning opportunities (online) • The online nature of RL encourages learning more effectively from events that occur frequently • Supported in nature • Capturing regularities is a key property of RL

More Related