1 / 34

Value and Planning in MDPs

Value and Planning in MDPs. Administrivia. Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty in Artificial Intelligence (UAI-2005) . http://www.cs.umass.edu/~mahadeva/papers/uai-final-paper.pdf Due: Apr 20

leia
Télécharger la présentation

Value and Planning in MDPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Value and Planning in MDPs

  2. Administrivia • Reading 3 assigned today • Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty in Artificial Intelligence (UAI-2005). • http://www.cs.umass.edu/~mahadeva/papers/uai-final-paper.pdf • Due: Apr 20 • Groups assigned this time

  3. Where we are • Last time: • Expected value of policies • Principle of maximum expected utility • The Bellman equation • Today: • A little intuition (pictures) • Finding π*: the policy iteration algorithm • The Q function • On to actual learning (maybe?)

  4. The Bellman equation • The final recursive equation is known as the Bellman equation: • Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M=〈S,A,T,R〉 • When state/action spaces are discrete, can think of V and R as vectors and Tπ as matrix, and get matrix eqn:

  5. Exercise • Solve the matrix Bellman equation (i.e., find V): • I formulated the Bellman equations for “state-based” rewards: R(s) • Formulate & solve the B.E. for: • “state-action” rewards (R(s,a)) • “state-action-state” rewards (R(s,a,s’))

  6. Exercise • Solve the matrix Bellman equation (i.e., find V): • Formulate & solve the B.E. for: • “state-action” rewards (R(s,a)) • “state-action-state” rewards (R(s,a,s’))

  7. Policy values in practice “Robot” navigation in a grid maze Goal state

  8. The MDP formulation • State space: • Action space: • Reward function: • Transition function: ...

  9. The MDP formulation • Transition function: • If desired direction is unblocked • Move in desired direction with probability 0.7 • Stay in same place w/ prob 0.1 • Move “forward right” w/ prob 0.1 • Move “forward left” w/ prob 0.1 • If desired direction is blocked (wall) • Stay in same place w/ prob 1.0

  10. Policy values in practice Optimal policy, π* EAST SOUTH WEST NORTH

  11. Policy values in practice Value function for optimal policy, V* Why does it look like this?

  12. Walls Doors A harder “maze”...

  13. A harder “maze”... Optimal policy, π*

  14. A harder “maze”... Value function for optimal policy, V*

  15. A harder “maze”... Value function for optimal policy, V*

  16. Still more complex...

  17. Still more complex... Optimal policy, π*

  18. Still more complex... Value function for optimal policy, V*

  19. Still more complex... Value function for optimal policy, V*

  20. Planning: finding π* • So we know how to evaluate a single policy, π • How do you find the best policy? • Remember: still assuming that we know M=〈S,A,T,R〉

  21. Planning: finding π* • So we know how to evaluate a single policy, π • How do you find the best policy? • Remember: still assuming that we know M=〈S,A,T,R〉 • Non-solution: iterate through all possible π, evaluating each one; keep best

  22. Policy iteration & friends • Many different solutions available. • All exploit some characteristics of MDPs: • For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy) • The Bellman equation expresses recursive structure of an optimal policy • Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

  23. The policy iteration alg. • Function: policy_iteration • Input: MDPM=〈S,A,T,R〉, discount γ • Output: optimal policyπ*; opt. value func.V* • Initialization: chooseπ0arbitrarily • Repeat { • Vi=eval_policy(M,πi,γ) // from Bellman eqn • πi+1=local_update_policy(πi,Vi) • } Until (πi+1==πi) • Function: π’=local_update_policy(π,V) • for i=1..|S| { • π’(si)=argmaxa∈A( sumj(T(si,a,sj)*V(sj)) ) • }

  24. Why does this work? • 2 explanations: • Theoretical: • The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached • See, “contraction mapping”, “Banach fixed-point theorem”, etc. • http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node22.html • http://planetmath.org/encyclopedia/BanachFixedPointTheorem.html • Contracts w.r.t. the Bellman Error:

  25. Why does this work? • The intuitive explanation • It’s doing a dynamic-programming “backup” of reward from reward “sources” • At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step • Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

  26. P.I. in action Iteration 0 Policy Value

  27. P.I. in action Iteration 1 Policy Value

  28. P.I. in action Iteration 2 Policy Value

  29. P.I. in action Iteration 3 Policy Value

  30. P.I. in action Iteration 4 Policy Value

  31. P.I. in action Iteration 5 Policy Value

  32. P.I. in action Iteration 6: done Policy Value

  33. Properties • Policy iteration • Known to converge (provable) • Observed to converge exponentially quickly • # iterations is O(ln(|S|)) • Empirical observation; strongly believed but no proof (yet) • O(|S|3) time per iteration (policy evaluation)

  34. Variants • Other methods possible • Linear program (poly time soln exists) • Value iteration • Generalized policy iter. (often best in practice)

More Related