1 / 27

An Introduction to PO-MDP

An Introduction to PO-MDP. Presented by Alp Sardağ. MDP. Components: State Action Transition Reinforcement Problem: choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution Solution:

phyre
Télécharger la présentation

An Introduction to PO-MDP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to PO-MDP Presented by Alp Sardağ

  2. MDP • Components: • State • Action • Transition • Reinforcement • Problem: • choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution • Solution: • Policy: value function

  3. Definition • Horizon length • Value Iteration: • Temporal Difference Learning: Q(x,a)  Q(x,a) +(r+ maxbQ(y,b) - Q(x,a)) where  learning rate and  discount rate. • Adding PO to CO-MDP is not trivial: • Requires the complete observability of the state. • PO clouds the current state.

  4. PO-MDP • Components: • States • Actions • Transitions • Reinforcement • Observations

  5. Mapping in CO-MDP & PO-MDP • In CO-MDPs, mapping is from states to actions. • In PO-MDPs, mapping is from probability distributions (over states) to actions.

  6. VI in CO-MDP & PO-MDP • In a CO-MDP, • Track our current state • Update it after each action • In a PO-MDP, • Probability distribution over states • Perform an action and make an observation, then update the distribution

  7. Belief State and Space • Belief State: probability distribution over states. • Belief Space: the entire probability space. • Example: • Assume two state PO-MDP. • P(s1) = p & P(s2) = 1-p. • Line become hyper-plane in higher dimension. s1

  8. Belief Transform • Assumption: • Finite action • Finite observation • Next belief state = T(cbf,a,o) where cbf: current belief state, a:action, o:observation • Finite number of possible next belief state

  9. PO-MDP into continuous CO-MDP • The process is Markovian, the next belief state depends on: • Current belief state • Current action • Observation • Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.

  10. Problem • Using VI in continuous state space. • No nice tabular representation as before.

  11. PWLC • Restrictions on the form of the solutions to the continuous space CO-MDP: • The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. • the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function

  12. Steps in VI • Represent the value function for each horizon as a set of vectors. • Overcome how to represent a value function over a continuous space. • Find the vector that has the largest dot product with the belief state.

  13. a2 is the best a1 is the best PO-MDP Value Iteration Example • Assumption: • Two states • Two actions • Three observations • Ex: horizon length is 1. b=[0.25 0.75] a1 a2 ] [ s1 s2 • 0 • 0 1.5 V(a1,b) = 0.25x1+0.75x0 = 0.25 V(a2,b)=0.25x0+0.75x1.5=1.125

  14. PO-MDP Value Iteration Example • The value of a belief state for horizon length 2 given b,a1,z1: • immediate action plus the value of the next action. • Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.

  15. PO-MDP Value Iteration Example • Find the value for all the belief points given this fixed action and observation. • The Transformed value function is also PWLC.

  16. PO-MDP Value Iteration Example • How to compute the value of a belief state given only the action? • The horizon 2 value of the belief state, given that: • Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2 • P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835

  17. Transformed Value Functions • Each of these transformed functions partitions the belief space differently. • Best next action to perform depends upon the initial belief state and observation.

  18. Best Value For Belief States • The value of every single belief point, the sum of: • Immediate reward. • The line segments from the S() functions for each observation's future strategy. • since adding lines gives you lines, it is linear.

  19. Best Strategy for any Belief Points • All the useful future strategies are easy to pick out:

  20. Value Function and Partition • For the specific action a1, the value function and corresponding partitions:

  21. Value Function and Partition • For the specific action a2, the value function and corresponding partitions:

  22. Which Action to Choose? • put the value functions for each action together to see where each action gives the highest value.

  23. Compact Horizon 2 Value Function

  24. Value Function for Action a1 with a Horizon of 3

  25. Value Function for Action a2 with a Horizon of 3

  26. Value Function for Both Action with a Horizon of 3

  27. Value Function for Horizon of 3

More Related