1 / 39

Optimal Policies for POMDP

Optimal Policies for POMDP. Presented by Alp Sardağ. As Much Reward As Possible?. Greedy Agent. How long agent take decision?. Finite Horizon Infinite Horizon (discount factor) Values will converge. Good model if the number of decision step is not given. Policy. General plan

taipa
Télécharger la présentation

Optimal Policies for POMDP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Policies for POMDP Presented by Alp Sardağ

  2. As Much Reward As Possible? Greedy Agent

  3. How long agent take decision? • Finite Horizon • Infinite Horizon (discount factor) • Values will converge. • Good model if the number of decision step is not given.

  4. Policy • General plan • Deterministic : one action for each state • Stochastic : pdf over the set of actions • Stationary : can be applied at any time • Non-stationary : dependent on time • Memoryless : no history

  5. Finite Horizon • Agent has to make k decisions, non-stationary

  6. Infinite Horizon • We do not need different policy for each time step. 0<<1 Infiniteness helps us to find stationary policy. ={0, 1,..., t} ={i, i,..., i}

  7. MDP • Finite horizon, solved with dynamic programming. • Infinite horizon S equations S unknowns LP.

  8. MDP • Actions may be stochastic. • Do you know what state end up? • Dealing with uncertainity in observations.

  9. POMDP Model • Finite set of states • Finite set of actions • Transition probabilities (as in MDP) • Observation model • Reinforcement

  10. POMDP Model • Immediate reward for performing action a in state i.

  11. POMDP Model • Belief state : probability distribution over states.  = {0, 1,...., |S|} • Drawback to compute next state world model needed. From Bayes rule:

  12. POMDP Model • Control dynamics for a POMDP

  13. Policies for POMDP • Belief states infinite, value functions in tables infeasible. • For horizon length 1. • No control over observations (not found in MDP), weigh all observations

  14. Value functions for POMDPs • Formula is complex, however if VF is piecewise linear (a way of rep. Continous space VF), it can be written:

  15. Value functions for POMDPs

  16. Value Functions for POMDPs • Given Vt-1, Vt can be calculated. • Keep the action which gives rise to specific  vector. • To find optimal policy at a belief state, just perform maximization over all  vectors and take the associated action.

  17. Geometric Interpretation of VF • Belief simplex: • 2 dimensional case:

  18. Geometric Interpretation of VF • 3 dimensional case :

  19. Alternate VF Interpretation • A decision tree could enumerate each possible policy for k-horizon, if initial belief state given.

  20. Alternate VF Interpretation • The number of nodes for each action: • The number of possible tree (|A| possible actions for each node) • Somehow only generate useful trees, the complexity will be greatly reduced. • Previously, to create entire VF generate  for all , too many for the algorithm to work.

  21. POMDP Solutions • For finite horizon: • Iterate over time steps. Given Vt-1 compute Vt. • Retain all intermediate solutions. • For finitely transient, same idea apply to find infinite horizon. • Iterate until previous optimal value functions are the same for any two consecutive time steps. • Once infinite horizon found, discard all intermediate results.

  22. POMDP Solutions • Given Vt-1 Vt can be calculated for one  from previous formula. No knowledge about which region this is optimal. (Sondik) • Too many  to construct VF, one possible solution: • Choose random points. • If the number of points is large, one can’t miss any of true vectors. • How many points to choose? No guarantee. • Find optimal policies by developing a systematic algorithm to explore the entire continous space of beliefs.

  23. Tiger Problem • Actions: open left door, open right door, listen. • Listenning not accurate. • s0: tiger on the left, s1: tiger on the right. • Rewards: +10 openning right door, -100 for wrong door, -1 for listenning. • Initially:  = (0.5 0.5)

  24. Tiger Problem

  25. Tiger Problem • First action, intuitively: • -100+102=-55 & -1 for listenning • For horizon length 1:

  26. Tiger Problem • For Horizon length 2:

  27. Tiger Problem • For horizon length 4, nice features: • A belief state for the same action & observation transformed to a single belief state. • Observations made precisely define the nodes in the graph that would be traversed.

  28. Infinite Horizon • Finite horizon cumbersome, different policy for the same belief point for each time step. • Different set of vectors for each time step. • Add discount factor to tiger problem, after 56. Step the underlying vectors are slightly different:

  29. Infinite Horizon for Tiger Problem • By this way the finite horizon algorithms can be used for the infinite horizon problems. • Advantage of infinite horizon, keep the last policy.

  30. Policy Graphs • A way to encode, without keeping vectors, no dot products. Beginning state Endstate

  31. Finite Transience • All the belief states within a particular partition element will be transformed to another element for a particular action and observation. • For non-finitely transient policies the policy graphs that are exactly optimal can not be constructed.

  32. Overview of Algorithms • All performed iteratively. • All try to find the set of vectors that define both the value function and the optimal policy at each time step. • Two separate class: • Given Vt-1, generate superset of Vt, reduce that set until the optimal Vt found (Monahan and Eagle). • Given Vt-1 construct subset of optimal Vt. These subsets grow larger until optimal Vt found.

  33. Monahan Algorithm • Easy to implement • Do not expect to solve anything but smallest of problems. • Provides background for understanding of other algorithms.

  34. Monahan Enumeration Phase • Generate all vectors: Number of gen. Vectors = |A|M|| where M vectors of previous state

  35. Monahan Reduction Phase • All vectors can be kept: • Each time maximize over all vectors. • Lot of excess baggage • The number of vectors in next step will be even large. • LP used to trim away useless vectors

  36. Monahan Reduction Phase • For a vector to be useful, there must be at least one belief point it gives larger value than others:

  37. Monahan Algorithm

  38. Monahan’s LP Complication

  39. Future Work • Eagle’s Variant of Monahan’s Algorithm. • Sondik’s One-Pass Algorithm. • Cheng’s Relaxed Region Algorithm. • Cheng’s Linear Support Algorithm.

More Related