1 / 17

Partially Observable MDP

Partially Observable MDP. MDP = Perfect Observation. Basic assumption: we know the state of the world at each stage In essence: we have perfect sensors Typically: we have imperfect sensors  we can only have partial information about the state

Télécharger la présentation

Partially Observable MDP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Partially Observable MDP Automated Planning and Decision Making 2007

  2. MDP = Perfect Observation • Basic assumption: we know the state of the world at each stage • In essence: we have perfect sensors • Typically: we have imperfect sensors  we can only have partial information about the state • When we have imperfect information we sometimes take actions simply to gain information Automated Planning and Decision Making

  3. POMDP • <S, A, Tr, R, Ω, O> • S – State space. • A – Actions set. • Tr - SxA→[S]. State space over S. • Tr(s, a, s’) = p - The probability to reach s’ from s using a.s – the state before.a – action.s’ – the state after. • R - SxA→ℝ. The reward for doing aA at state S. • Ω- set of possible observations. • O - SxA→(Ω). • O(s, a, o) = p - The probability of observing oΩ after performing a in s.OR the probability to observe oΩafter doing a and reaching s. Automated Planning and Decision Making

  4. POMDPValue of Information • A robot with a wall sensor starts at one of the I with same probability. • Following a move, it sense the walls around it • By moving up, observed walls configuration will be the same for both options. • By moving down, we get different configuration. Automated Planning and Decision Making

  5. Solving a POMDP • As in MDP, because of uncertainty, we need a policy, not a plan • But what does the policy depend on – we don’t know the state? • Option 1: History • How much history do we need to remember? • How big is our policy • Problem: Highly non-uniform, hard to work with Automated Planning and Decision Making

  6. POMDP s0 b a1 a2 a1 a2 s1 s2 s3 s4 o1 o2 o3 A much harder tree – Each state is different then the other, as it is based on different actions and observations. o1 o2 o3 o4 o5 o6 The history of observations defines a state Automated Planning and Decision Making

  7. Option 2: Belief State • What matters about the future is the current state • We don’t know what the current state is • Instead, we can maintain a probability distribution over the current state • Called the belief state Automated Planning and Decision Making

  8. POMDPBelief State b0 a1 a2 Evaluated using action and observations b1 b2 b3 b4 a2 a1 b5 b6 b7 b8 • How do we compute the next belief state? Automated Planning and Decision Making

  9. POMDPUpdating the belief state • Let b be the current belief state. • We calculate b`, the belief state resulted from b by applying a and observing o. • b(s) = the probability of s according to b. = = Normalizing factor. Ignore it in the calculations, and normalize to 1 later. , Pr(x)=yYPr(x|y)•Pr(y) Bayes Rue: Automated Planning and Decision Making

  10. POMDP  MDP • We can reduce the pomdp to an MDP over belief states • State = belief states • Actions = same actions • R(b,a) = Σsb(s)*R(s,a) • (b,a,b`)=Pr(b`|a,b)=oΩPr(b`|a,o,b)•Pr(o|a,b) Automated Planning and Decision Making

  11. R(s1,a) R(s2,a`) R(s1,a`) R(s2,a) Average (linear func.) b(s1) = p b(s2) = 1-p p•R(s1,a)+(1-p)•R(s2,a) 0.3 Policy switch point POMDPBelief State MDP’s Value Function • At every belief state, choose the action that maximize the value. • The best value is v*(a). Automated Planning and Decision Making

  12. R(s1,a) a p s1 q Pr(s1)=p Pr(s2)=q Pr(s3)=1-p-q R(s2,a) s2 POMDPBelief State MDP • For more then one action – v*(b) is the expected value. • vnρ=R(s1,ρ(s1)) + oΩPr(o|s1,a) • vn-1 ρ/o(b0a) Automated Planning and Decision Making

  13. POMDPBelief State MDP a(ρ) ρ: • αa = [R(s1,a), R(s2,a)] in our example. • αρ = … vector of size |S| where each state has the value of the prize for a. • va(b)=b•αa • vρ(b)=b•αρ • ρ1…n are policies of length m. • P={α1, …, αn} • ρ*p(b)=argmax αP αib, i{1..n} • v*p(b)=maxαP αb • vρ(b)=sSb(s)•r(s,a) + δoΩPr(o|b,a)•vρ/o(bao) ok o1 ρo-1 ρo-k … the policy at the sub tree of ρ matching o immediate prize for applying the 1st action resulted belief state for applying a at b and observing o Automated Planning and Decision Making

  14. POMDPValue Iteration for belief states • init: aA, v1a(b)=sSb(s)R(s,a) • build a value function for k+1 steps, given the functions for k steps. 1 let ρ1..ρn be the possible policies from depth k that are not dominated by the rest of the policies from depth k (exists a belief state b for which they are optimal). 2 Build all the policies trees from depth k of the form: 3 Calculate vρ(b)=sSb(s)•r(s,a)+δoΩPr(o|b,a)•vρ/o(bao) for each of the trees. aA O1 Ok Where i{1..n} ρi1 ρik … Automated Planning and Decision Making

  15. POMDPPoint Based Value Iteration • Idea: maintain a fixed size set α1, …, αn of vectors. Each vector αi matches a belief statebi and action a(αi). • maxi[1..n]b•αi • Advantage: Num of vectors is bounded by n. • Disadvantage: Only approximation to optimality. α1 α2 Automated Planning and Decision Making

  16. POMDPPoint Based Value Iteration • The method: • Same as in Value-Iteration, but initialize the α`s to match the optimal actions for b1, …, bn. • At the iterative part – build all trees for k+1 given the functions α1, …, αnof the kth step. • use the value function only from step k. • keep only the new policies and the matching αi, which are optimal for bi. • Number of possible trees: |A|n|Ω| Automated Planning and Decision Making

  17. POMDPRepresenting policy as an automata • Automata + Initial state  policy • The idea: • Based on solution to m vector equations. • To every state of the automata match a value function represented by the correspondingα-vector. • For every state of the automata, evaluate the best value function under the assumption that after executing the first action, we continue to one of the value function evaluated above. Automated Planning and Decision Making

More Related