Optimal Policies for POMDP

Optimal Policies for POMDP Presented by Alp Sardağ

As Much Reward As Possible? Greedy Agent

How long agent take decision? • Finite Horizon • Infinite Horizon (discount factor) • Values will converge. • Good model if the number of decision step is not given.

Policy • General plan • Deterministic : one action for each state • Stochastic : pdf over the set of actions • Stationary : can be applied at any time • Non-stationary : dependent on time • Memoryless : no history

Finite Horizon • Agent has to make k decisions, non-stationary

Infinite Horizon • We do not need different policy for each time step. 0<<1 Infiniteness helps us to find stationary policy. ={0, 1,..., t} ={i, i,..., i}

MDP • Finite horizon, solved with dynamic programming. • Infinite horizon S equations S unknowns LP.

MDP • Actions may be stochastic. • Do you know what state end up? • Dealing with uncertainity in observations.

POMDP Model • Finite set of states • Finite set of actions • Transition probabilities (as in MDP) • Observation model • Reinforcement

POMDP Model • Immediate reward for performing action a in state i.

POMDP Model • Belief state : probability distribution over states.  = {0, 1,...., |S|} • Drawback to compute next state world model needed. From Bayes rule:

POMDP Model • Control dynamics for a POMDP

Policies for POMDP • Belief states infinite, value functions in tables infeasible. • For horizon length 1. • No control over observations (not found in MDP), weigh all observations

Value functions for POMDPs • Formula is complex, however if VF is piecewise linear (a way of rep. Continous space VF), it can be written:

Value functions for POMDPs

Value Functions for POMDPs • Given Vt-1, Vt can be calculated. • Keep the action which gives rise to specific  vector. • To find optimal policy at a belief state, just perform maximization over all  vectors and take the associated action.

Geometric Interpretation of VF • Belief simplex: • 2 dimensional case:

Geometric Interpretation of VF • 3 dimensional case :

Alternate VF Interpretation • A decision tree could enumerate each possible policy for k-horizon, if initial belief state given.

Alternate VF Interpretation • The number of nodes for each action: • The number of possible tree (|A| possible actions for each node) • Somehow only generate useful trees, the complexity will be greatly reduced. • Previously, to create entire VF generate  for all , too many for the algorithm to work.

POMDP Solutions • For finite horizon: • Iterate over time steps. Given Vt-1 compute Vt. • Retain all intermediate solutions. • For finitely transient, same idea apply to find infinite horizon. • Iterate until previous optimal value functions are the same for any two consecutive time steps. • Once infinite horizon found, discard all intermediate results.

POMDP Solutions • Given Vt-1 Vt can be calculated for one  from previous formula. No knowledge about which region this is optimal. (Sondik) • Too many  to construct VF, one possible solution: • Choose random points. • If the number of points is large, one can’t miss any of true vectors. • How many points to choose? No guarantee. • Find optimal policies by developing a systematic algorithm to explore the entire continous space of beliefs.

Tiger Problem • Actions: open left door, open right door, listen. • Listenning not accurate. • s0: tiger on the left, s1: tiger on the right. • Rewards: +10 openning right door, -100 for wrong door, -1 for listenning. • Initially:  = (0.5 0.5)

Tiger Problem

Tiger Problem • First action, intuitively: • -100+102=-55 & -1 for listenning • For horizon length 1:

Tiger Problem • For Horizon length 2:

Tiger Problem • For horizon length 4, nice features: • A belief state for the same action & observation transformed to a single belief state. • Observations made precisely define the nodes in the graph that would be traversed.

Infinite Horizon • Finite horizon cumbersome, different policy for the same belief point for each time step. • Different set of vectors for each time step. • Add discount factor to tiger problem, after 56. Step the underlying vectors are slightly different:

Infinite Horizon for Tiger Problem • By this way the finite horizon algorithms can be used for the infinite horizon problems. • Advantage of infinite horizon, keep the last policy.

Policy Graphs • A way to encode, without keeping vectors, no dot products. Beginning state Endstate

Finite Transience • All the belief states within a particular partition element will be transformed to another element for a particular action and observation. • For non-finitely transient policies the policy graphs that are exactly optimal can not be constructed.

Overview of Algorithms • All performed iteratively. • All try to find the set of vectors that define both the value function and the optimal policy at each time step. • Two separate class: • Given Vt-1, generate superset of Vt, reduce that set until the optimal Vt found (Monahan and Eagle). • Given Vt-1 construct subset of optimal Vt. These subsets grow larger until optimal Vt found.

Monahan Algorithm • Easy to implement • Do not expect to solve anything but smallest of problems. • Provides background for understanding of other algorithms.

Monahan Enumeration Phase • Generate all vectors: Number of gen. Vectors = |A|M|| where M vectors of previous state

Monahan Reduction Phase • All vectors can be kept: • Each time maximize over all vectors. • Lot of excess baggage • The number of vectors in next step will be even large. • LP used to trim away useless vectors

Monahan Reduction Phase • For a vector to be useful, there must be at least one belief point it gives larger value than others:

Monahan Algorithm

Monahan’s LP Complication

Future Work • Eagle’s Variant of Monahan’s Algorithm. • Sondik’s One-Pass Algorithm. • Cheng’s Relaxed Region Algorithm. • Cheng’s Linear Support Algorithm.

Optimal Policies for POMDP

Optimal Policies for POMDP

Presentation Transcript

Policies for adaptation

KI2 – MDP / POMDP

Optimal policies

Policies for POMDPs

Optimal Pricing and Return Policies for Perishable Commodities B. A. Pasternack

Optimal Conversion and Put Policies

Discovering Optimal Training Policies: A New Experimental Paradigm

Optimal Multi-Period Asset Allocation for Life Insurance Policies

Optimal

Optimal control for integrodifferencequations

Optimal Load Balancing Policies for Heterogeneous Server Farms

Comments on Optimal trade and storage policies: Issues for the concerned policy advisor

Optimal Spectral Decomposition (OSD): An Advanced Approach for Optimal

Dynamic Restarts Optimal Randomized Restart Policies with Observation

OPTIMAL ENGAGEMENT POLICIES

POMDP