270 likes | 370 Vues
An Introduction to PO-MDP. Presented by Alp Sardağ. MDP. Components: State Action Transition Reinforcement Problem: choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution Solution:
E N D
An Introduction to PO-MDP Presented by Alp Sardağ
MDP • Components: • State • Action • Transition • Reinforcement • Problem: • choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution • Solution: • Policy: value function
Definition • Horizon length • Value Iteration: • Temporal Difference Learning: Q(x,a) Q(x,a) +(r+ maxbQ(y,b) - Q(x,a)) where learning rate and discount rate. • Adding PO to CO-MDP is not trivial: • Requires the complete observability of the state. • PO clouds the current state.
PO-MDP • Components: • States • Actions • Transitions • Reinforcement • Observations
Mapping in CO-MDP & PO-MDP • In CO-MDPs, mapping is from states to actions. • In PO-MDPs, mapping is from probability distributions (over states) to actions.
VI in CO-MDP & PO-MDP • In a CO-MDP, • Track our current state • Update it after each action • In a PO-MDP, • Probability distribution over states • Perform an action and make an observation, then update the distribution
Belief State and Space • Belief State: probability distribution over states. • Belief Space: the entire probability space. • Example: • Assume two state PO-MDP. • P(s1) = p & P(s2) = 1-p. • Line become hyper-plane in higher dimension. s1
Belief Transform • Assumption: • Finite action • Finite observation • Next belief state = T(cbf,a,o) where cbf: current belief state, a:action, o:observation • Finite number of possible next belief state
PO-MDP into continuous CO-MDP • The process is Markovian, the next belief state depends on: • Current belief state • Current action • Observation • Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.
Problem • Using VI in continuous state space. • No nice tabular representation as before.
PWLC • Restrictions on the form of the solutions to the continuous space CO-MDP: • The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. • the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function
Steps in VI • Represent the value function for each horizon as a set of vectors. • Overcome how to represent a value function over a continuous space. • Find the vector that has the largest dot product with the belief state.
a2 is the best a1 is the best PO-MDP Value Iteration Example • Assumption: • Two states • Two actions • Three observations • Ex: horizon length is 1. b=[0.25 0.75] a1 a2 ] [ s1 s2 • 0 • 0 1.5 V(a1,b) = 0.25x1+0.75x0 = 0.25 V(a2,b)=0.25x0+0.75x1.5=1.125
PO-MDP Value Iteration Example • The value of a belief state for horizon length 2 given b,a1,z1: • immediate action plus the value of the next action. • Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.
PO-MDP Value Iteration Example • Find the value for all the belief points given this fixed action and observation. • The Transformed value function is also PWLC.
PO-MDP Value Iteration Example • How to compute the value of a belief state given only the action? • The horizon 2 value of the belief state, given that: • Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2 • P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835
Transformed Value Functions • Each of these transformed functions partitions the belief space differently. • Best next action to perform depends upon the initial belief state and observation.
Best Value For Belief States • The value of every single belief point, the sum of: • Immediate reward. • The line segments from the S() functions for each observation's future strategy. • since adding lines gives you lines, it is linear.
Best Strategy for any Belief Points • All the useful future strategies are easy to pick out:
Value Function and Partition • For the specific action a1, the value function and corresponding partitions:
Value Function and Partition • For the specific action a2, the value function and corresponding partitions:
Which Action to Choose? • put the value functions for each action together to see where each action gives the highest value.