1 / 45

Incremental Pruning

Incremental Pruning. CSE 574 May 9, 2003 Stanley Kok. Value-Iteration (Recap). DP update – a step in value-iteration MDP S – finite set of states in the world A – finite set of actions T: SxA -> Π (S) (e.g. T(s,a,s’) = 0.2) R: SxA -> R (e.g. R(s,a) = 10) Algm. POMDP.

kami
Télécharger la présentation

Incremental Pruning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incremental Pruning CSE 574 May 9, 2003 Stanley Kok

  2. Value-Iteration (Recap) • DP update – a step in value-iteration • MDP • S – finite set of states in the world • A – finite set of actions • T: SxA -> Π(S) (e.g. T(s,a,s’) = 0.2) • R: SxA -> R (e.g. R(s,a) = 10) • Algm

  3. POMDP • <S, A, T, R, Ω, O> tuple • S, A, T, R of MDP • Ω – finite set of observations • O:SxA-> Π(Ω) • Belief state • - information state • – b, probability distribution over S • - b(s1)

  4. POMDP - SE • SE – State Estimator • updates belief state based on • previous belief state last action, current observation • SE(b,a,o) = b’

  5. POMDP - SE

  6. POMDP - Π • Focus on Π component • POMDP-> “Belief MDP” • MDP parameters: • S => B, set of belief states • A => same • T => τ(b,a,b’) • R => ρ(b, a) • Solve with value-iteration algm

  7. POMDP - Π • τ(b,a,b’) • ρ(b, a)

  8. Two Problems • How to represent value function over continuous belief space? • How to update value function Vt with Vt-1? • POMDP -> MDP S => B, set of belief states A => same T => τ(b,a,b’) R => ρ(b, a)

  9. Running Example • POMDP with • Two states (s1 and s2) • Two actions (a1 and a2) • Three observations (z1, z2, z3) 1D belief space for a 2 state POMDP Probability that state is s1

  10. First Problem Solved • Key insight: value function • piecewise linear & convex (PWLC) • Convexity makes intuitive sense • Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward • Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward • Each line (hyperplane) represented with vector • Coefficients of line (hyperplane) • e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1)) • To find value function at b, find vector with largest dot pdt with b

  11. Second Problem • Can’t iterate over all belief states (infinite) for value-iteration but… • Given vectors representing Vt-1, generate vectors representing Vt

  12. Horizon 1 • No future • Value function consists only of immediate reward • e.g. • R(s1, a1) = 0, R(s2, a1) = 1.5, • R(s1, a2) = 1, R(s2, a2) = 0 • b = <0.25, 0.75> • Value of doing a1 • = 1 x b(s1) + 0 x b(s2) • = 1 x 0.25 + 0 x 0.75 • Value of doing a2 • = 0 x b(s1) + 1.5 x b(s2) • = 0 x 0.25 + 1.5 x 0.75

  13. Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state

  14. Horizon 2 – Given action & obs • If in belief state b,what is the best value of • doing action a1 and seeing z1? • Best value = best value of immediate action + best value of next action • Best value of immediate action = horizon 1 value function

  15. Horizon 2 – Given action & obs • Assume best immediate action is a1 and obs is z1 • What’s the best action for b’ that results from initial b when perform a1 and observe z1? • Not feasible – do this for all belief states (infinite)

  16. Horizon 2 – Given action & obs • Construct function over entire (initial) belief space • from horizon 1 value function • with belief transformation built in

  17. Horizon 2 – Given action & obs • S(a1, z1) corresponds to paper’s • S() built in: • - horizon 1 value function • - belief transformation • - “Weight” of seeing z after performing a • - Discount factor • - Immediate Reward • S() PWLC

  18. Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state

  19. Horizon 2 – Given action • What is the horizon 2 value of a belief state given immediate action is a1? • Horizon 2, do action a1 • Horizon 1, do action…?

  20. Horizon 2 – Given action • What’s the best strategy at b? • How to compute line (vector) representing best strategy at b? (easy) • How many strategies are there in figure? • What’s the max number of strategies (after taking immediate action a1)?

  21. Horizon 2 – Given action • How can we represent the 4 regions (strategies) as a value function? • Note: each region is a strategy

  22. Horizon 2 – Given action • Sum up vectors representing region • Sum of vectors = vectors (add lines, get lines) • Correspond to paper’s transformation

  23. Horizon 2 – Given action • What does each region represent? • Why is this step hard (alluded to in paper)?

  24. Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state

  25. Horizon 2 a1 U a2

  26. Horizon 2 This tells you how to act! =>

  27. Purge

  28. Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state • Use horizon 2 value function to update horizon 3’s ...

  29. The Hard Step • Easy to visually inspect to obtain different regions • But in higher dimensional space, with many actions and observations….hard problem

  30. Naïve way - Enumerate • How does Incremental Pruning do it?

  31. Incremental Pruning • How does IP improve naïve method? • Will IP ever do worse than naïve method? Combinations Purge/ Filter

  32. Incremental Pruning • What’s other novel idea(s) in IP? • RR: Come up with smaller set D as argument to Dominate() • RR has more linear pgms but less contraints in the worse case. • Empirically ↓ constraints saves more time than ↑ linear programs require

  33. Incremental Pruning Why are the terms after U needed? • What’s other novel idea(s) in IP? • RR: Come up with smaller set D as argument to Dominate()

  34. Identifying Witness • Witness Thm: • -Let Ua be a set of vectors representing value function • -Let u be in Ua (e.g. u = αz1,a2 + αz2,a1 + αz3,a1 ) • -If there is a vector v which differs from u in one observation (e.g. v = αz1,a1 + αz2,a1 + αz3,a1) and • there is a b such that b.v > b.u, • -then Ua is not equal to the true value function

  35. Witness Algm b’ b’’ b b’ b’’ • Randomly choose a belief state b • Compute vector representing best value at b (easy) • Add vector to agenda • While agenda is not empty • Get vector Vtop from top of agenda • b’ = Dominate(Vtop, Ua) • If b’ is not null (there is a witness), • compute vector u for best value at b’ and add it to Ua • compute all vectors v’s that differ from u at one observation and add them to agenda

  36. Linear Support • If value function is incorrect, biggest diff is at edges (convexity)

  37. Linear Support

  38. Experiments • Comments???

  39. Important Ideas • Purge()

  40. Flaws • Insufficient background/motivation

  41. Future Research • Better best-case/worse-case analyses • Precision parameter Є

  42. Variants • Reactive Policy • - st = zt; • - π(z) = a • - branch & bound search • - gradient ascent search • - perceptual aliasing problem • Finite History Window • - π(z1…zk) = a • - Suffix tree to represent observation, leaf action • Recurrent Neural Nets • - use neural nets to maintain some state (so information about past is not forgotten)

  43. Variants – Belief State MDP • Exact V, exact b • Approximate V, exact b • - Discreting b into a grid and interpolate • Exact V, approximate b • - Use particle filters to sample b • - track approximate belief state using DBN • Approximate V, Approximate b • - combine previous two

  44. Variants - Pegasus • Policy Evaluation of Goodness And Search Using Scenarios • Convert POMDP to another POMDP with deterministic state transitions • Search for policy of transformed POMDP with highest estimated value

  45. That’s it!

More Related