Partially Observable MDP

Partially Observable MDP Automated Planning and Decision Making 2007

MDP = Perfect Observation • Basic assumption: we know the state of the world at each stage • In essence: we have perfect sensors • Typically: we have imperfect sensors  we can only have partial information about the state • When we have imperfect information we sometimes take actions simply to gain information Automated Planning and Decision Making

POMDP • <S, A, Tr, R, Ω, O> • S – State space. • A – Actions set. • Tr - SxA→[S]. State space over S. • Tr(s, a, s’) = p - The probability to reach s’ from s using a.s – the state before.a – action.s’ – the state after. • R - SxA→ℝ. The reward for doing aA at state S. • Ω- set of possible observations. • O - SxA→(Ω). • O(s, a, o) = p - The probability of observing oΩ after performing a in s.OR the probability to observe oΩafter doing a and reaching s. Automated Planning and Decision Making

POMDPValue of Information • A robot with a wall sensor starts at one of the I with same probability. • Following a move, it sense the walls around it • By moving up, observed walls configuration will be the same for both options. • By moving down, we get different configuration. Automated Planning and Decision Making

Solving a POMDP • As in MDP, because of uncertainty, we need a policy, not a plan • But what does the policy depend on – we don’t know the state? • Option 1: History • How much history do we need to remember? • How big is our policy • Problem: Highly non-uniform, hard to work with Automated Planning and Decision Making

POMDP s0 b a1 a2 a1 a2 s1 s2 s3 s4 o1 o2 o3 A much harder tree – Each state is different then the other, as it is based on different actions and observations. o1 o2 o3 o4 o5 o6 The history of observations defines a state Automated Planning and Decision Making

Option 2: Belief State • What matters about the future is the current state • We don’t know what the current state is • Instead, we can maintain a probability distribution over the current state • Called the belief state Automated Planning and Decision Making

POMDPBelief State b0 a1 a2 Evaluated using action and observations b1 b2 b3 b4 a2 a1 b5 b6 b7 b8 • How do we compute the next belief state? Automated Planning and Decision Making

POMDPUpdating the belief state • Let b be the current belief state. • We calculate b`, the belief state resulted from b by applying a and observing o. • b(s) = the probability of s according to b. = = Normalizing factor. Ignore it in the calculations, and normalize to 1 later. , Pr(x)=yYPr(x|y)•Pr(y) Bayes Rue: Automated Planning and Decision Making

POMDP  MDP • We can reduce the pomdp to an MDP over belief states • State = belief states • Actions = same actions • R(b,a) = Σsb(s)*R(s,a) • (b,a,b`)=Pr(b`|a,b)=oΩPr(b`|a,o,b)•Pr(o|a,b) Automated Planning and Decision Making

R(s1,a) R(s2,a`) R(s1,a`) R(s2,a) Average (linear func.) b(s1) = p b(s2) = 1-p p•R(s1,a)+(1-p)•R(s2,a) 0.3 Policy switch point POMDPBelief State MDP’s Value Function • At every belief state, choose the action that maximize the value. • The best value is v*(a). Automated Planning and Decision Making

R(s1,a) a p s1 q Pr(s1)=p Pr(s2)=q Pr(s3)=1-p-q R(s2,a) s2 POMDPBelief State MDP • For more then one action – v*(b) is the expected value. • vnρ=R(s1,ρ(s1)) + oΩPr(o|s1,a) • vn-1 ρ/o(b0a) Automated Planning and Decision Making

POMDPBelief State MDP a(ρ) ρ: • αa = [R(s1,a), R(s2,a)] in our example. • αρ = … vector of size |S| where each state has the value of the prize for a. • va(b)=b•αa • vρ(b)=b•αρ • ρ1…n are policies of length m. • P={α1, …, αn} • ρ*p(b)=argmax αP αib, i{1..n} • v*p(b)=maxαP αb • vρ(b)=sSb(s)•r(s,a) + δoΩPr(o|b,a)•vρ/o(bao) ok o1 ρo-1 ρo-k … the policy at the sub tree of ρ matching o immediate prize for applying the 1st action resulted belief state for applying a at b and observing o Automated Planning and Decision Making

POMDPValue Iteration for belief states • init: aA, v1a(b)=sSb(s)R(s,a) • build a value function for k+1 steps, given the functions for k steps. 1 let ρ1..ρn be the possible policies from depth k that are not dominated by the rest of the policies from depth k (exists a belief state b for which they are optimal). 2 Build all the policies trees from depth k of the form: 3 Calculate vρ(b)=sSb(s)•r(s,a)+δoΩPr(o|b,a)•vρ/o(bao) for each of the trees. aA O1 Ok Where i{1..n} ρi1 ρik … Automated Planning and Decision Making

POMDPPoint Based Value Iteration • Idea: maintain a fixed size set α1, …, αn of vectors. Each vector αi matches a belief statebi and action a(αi). • maxi[1..n]b•αi • Advantage: Num of vectors is bounded by n. • Disadvantage: Only approximation to optimality. α1 α2 Automated Planning and Decision Making

POMDPPoint Based Value Iteration • The method: • Same as in Value-Iteration, but initialize the α`s to match the optimal actions for b1, …, bn. • At the iterative part – build all trees for k+1 given the functions α1, …, αnof the kth step. • use the value function only from step k. • keep only the new policies and the matching αi, which are optimal for bi. • Number of possible trees: |A|n|Ω| Automated Planning and Decision Making

POMDPRepresenting policy as an automata • Automata + Initial state  policy • The idea: • Based on solution to m vector equations. • To every state of the automata match a value function represented by the correspondingα-vector. • For every state of the automata, evaluate the best value function under the assumption that after executing the first action, we continue to one of the value function evaluated above. Automated Planning and Decision Making

Partially Observable MDP