Introduction to PO-MDP: A Comprehensive Overview of Partially Observable Markov Decision Processes

An Introduction to PO-MDP Presented by Alp Sardağ

MDP • Components: • State • Action • Transition • Reinforcement • Problem: • choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution • Solution: • Policy: value function

Definition • Horizon length • Value Iteration: • Temporal Difference Learning: Q(x,a)  Q(x,a) +(r+ maxbQ(y,b) - Q(x,a)) where  learning rate and  discount rate. • Adding PO to CO-MDP is not trivial: • Requires the complete observability of the state. • PO clouds the current state.

PO-MDP • Components: • States • Actions • Transitions • Reinforcement • Observations

Mapping in CO-MDP & PO-MDP • In CO-MDPs, mapping is from states to actions. • In PO-MDPs, mapping is from probability distributions (over states) to actions.

VI in CO-MDP & PO-MDP • In a CO-MDP, • Track our current state • Update it after each action • In a PO-MDP, • Probability distribution over states • Perform an action and make an observation, then update the distribution

Belief State and Space • Belief State: probability distribution over states. • Belief Space: the entire probability space. • Example: • Assume two state PO-MDP. • P(s1) = p & P(s2) = 1-p. • Line become hyper-plane in higher dimension. s1

Belief Transform • Assumption: • Finite action • Finite observation • Next belief state = T(cbf,a,o) where cbf: current belief state, a:action, o:observation • Finite number of possible next belief state

PO-MDP into continuous CO-MDP • The process is Markovian, the next belief state depends on: • Current belief state • Current action • Observation • Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.

Problem • Using VI in continuous state space. • No nice tabular representation as before.

PWLC • Restrictions on the form of the solutions to the continuous space CO-MDP: • The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. • the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function

Steps in VI • Represent the value function for each horizon as a set of vectors. • Overcome how to represent a value function over a continuous space. • Find the vector that has the largest dot product with the belief state.

a2 is the best a1 is the best PO-MDP Value Iteration Example • Assumption: • Two states • Two actions • Three observations • Ex: horizon length is 1. b=[0.25 0.75] a1 a2 ] [ s1 s2 • 0 • 0 1.5 V(a1,b) = 0.25x1+0.75x0 = 0.25 V(a2,b)=0.25x0+0.75x1.5=1.125

PO-MDP Value Iteration Example • The value of a belief state for horizon length 2 given b,a1,z1: • immediate action plus the value of the next action. • Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.

PO-MDP Value Iteration Example • Find the value for all the belief points given this fixed action and observation. • The Transformed value function is also PWLC.

PO-MDP Value Iteration Example • How to compute the value of a belief state given only the action? • The horizon 2 value of the belief state, given that: • Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2 • P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835

Transformed Value Functions • Each of these transformed functions partitions the belief space differently. • Best next action to perform depends upon the initial belief state and observation.

Best Value For Belief States • The value of every single belief point, the sum of: • Immediate reward. • The line segments from the S() functions for each observation's future strategy. • since adding lines gives you lines, it is linear.

Best Strategy for any Belief Points • All the useful future strategies are easy to pick out:

Value Function and Partition • For the specific action a1, the value function and corresponding partitions:

Value Function and Partition • For the specific action a2, the value function and corresponding partitions:

Which Action to Choose? • put the value functions for each action together to see where each action gives the highest value.

Compact Horizon 2 Value Function

Value Function for Action a1 with a Horizon of 3

Value Function for Action a2 with a Horizon of 3

Value Function for Both Action with a Horizon of 3

Value Function for Horizon of 3

Introduction to PO-MDP: A Comprehensive Overview of Partially Observable Markov Decision Processes

Introduction to PO-MDP: A Comprehensive Overview of Partially Observable Markov Decision Processes

Presentation Transcript

MDP

MDP

PO 001.01- Introduction to Physical Therapy

AN INTRODUCTION TO:

AN INTRODUCTION TO:

AN INTRODUCTION TO:

PO 111: INTRODUCTION TO AMERICAN POLITICS

MDP 301

An Introduction to

AN INTRODUCTION TO:

PO 111: INTRODUCTION TO AMERICAN POLITICS

PO 141: INTRODUCTION TO PUBLIC POLICY

PO 111: INTRODUCTION TO AMERICAN POLITICS

PO 111: INTRODUCTION TO AMERICAN POLITICS

Introduction of MDP

MDP

PO 111: INTRODUCTION TO AMERICAN POLITICS

PO 141: INTRODUCTION TO PUBLIC POLICY

PO 111: INTRODUCTION TO AMERICAN POLITICS