1 / 64

Approximate POMDP planning: Overcoming the curse of history!

Approximate POMDP planning: Overcoming the curse of history!. Presented by: Joelle Pineau Joint work with: Geoff Gordon and Sebastian Thrun Machine Learning Lunch - March 10, 2003. To use or not to use a POMDP. POMDPs provide a rich framework for sequential decision-making, which can model:

vidor
Télécharger la présentation

Approximate POMDP planning: Overcoming the curse of history!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate POMDP planning:Overcoming the curse of history! Presented by: Joelle Pineau Joint work with: Geoff Gordon and Sebastian Thrun Machine Learning Lunch - March 10, 2003

  2. To use or not to use a POMDP • POMDPs provide a rich framework for sequential decision-making, which can model: • varying rewards across actions and goals • uncertainty in the action effects • uncertainty in the state of the world Machine Learning Lunch - March 10, 2003

  3. Existing applications of POMDPs • Maintenance scheduling • Puterman, 1994 • Robot navigation • Koenig & Simmons, 1995; Roy & Thrun, 1999 • Helicopter control • Bagnell & Schneider, 2001; Ng et al., 2002 • Dialogue modeling • Roy, Pineau & Thrun, 2000; Peak&Horvitz, 2000 • Preference elicitation • Boutilier, 2002 Machine Learning Lunch - March 10, 2003

  4. Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function st-1 st What goes on: Machine Learning Lunch - March 10, 2003

  5. Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function st-1 st What goes on: (s) (s) rt-1 rt at-1 at Machine Learning Lunch - March 10, 2003

  6. Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function st-1 st What goes on: (s) (s) ot-1 ot rt-1 rt What we see: at-1 at Machine Learning Lunch - March 10, 2003

  7. Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function st-1 st What goes on: ot-1 ot rt-1 rt What we see: at-1 at (b) (b) bt-1 bt What we infer: Machine Learning Lunch - March 10, 2003

  8. Understanding the belief state • A belief is a probability distribution over states Where Dim(B) = |S|-1 • E.g. Let S={s1, s2} 1 P(s1) 0 Machine Learning Lunch - March 10, 2003

  9. Understanding the belief state • A belief is a probability distribution over states Where Dim(B) = |S|-1 • E.g. Let S={s1, s2, s3} 1 P(s1) 0 P(s2) 1 Machine Learning Lunch - March 10, 2003

  10. Understanding the belief state • A belief is a probability distribution over states Where Dim(B) = |S|-1 • E.g. Let S={s1, s2, s3 , s4} 1 P(s3) P(s1) 0 P(s2) 1 Machine Learning Lunch - March 10, 2003

  11. The first curse of POMDP planning • The curse of dimensionality: • dimension of the belief = # of states • dimension of planning problem = # of states • related to the MDP curse of dimensionality Machine Learning Lunch - March 10, 2003

  12. Planning for POMDPs • Learning a value function V(b) bB: • Learning an action-selection policy (b) bB: Machine Learning Lunch - March 10, 2003

  13. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 V0(b) P(s1) b Machine Learning Lunch - March 10, 2003

  14. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 V1(b) P(s1) b Machine Learning Lunch - March 10, 2003

  15. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 V1(b) P(s1) b Machine Learning Lunch - March 10, 2003

  16. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

  17. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 3 2187 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

  18. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 3 2187 4 14,348,907 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

  19. Properties of exact value iteration • Value function is always piecewise-linear convex • Many hyper-planes can be pruned away |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 5 3 9 4 7 5 13 10 27 15 47 20 59 … V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

  20. Is pruning sufficient? |S|=20, |A|=6, ||=8 Iteration # hyper-planes 0 1 1 5 2 213 3 ????? … Not for this problem! Machine Learning Lunch - March 10, 2003

  21. The second curse of POMDP planning • The curse of dimensionality: • the dimension of each hyper-plane = # of states • The curse of history: • the number of hyper-planes grows exponentially with the planning horizon Machine Learning Lunch - March 10, 2003

  22. The second curse of POMDP planning • The curse of dimensionality: • the dimension of each hyper-plane = # of states • The curse of history: • the number of hyper-planes grows exponentially with the planning horizon dimensionality history Complexity of POMDP value iteration: Machine Learning Lunch - March 10, 2003

  23. s1 s0 s2 Possible approximation approaches • Ignore the belief: • Discretize the belief: • Compress the belief: • Plan for trajectories: - overcomes both curses - very fast - performs poorly in high entropy beliefs [Littman et al., 1995] - overcomes the curse of history (sort of) - scales exponentially with # states [Lovejoy, 1991; Brafman 1997; Hauskrecht, 1998; Zhou&Hansen, 2001] - overcomes the curse of dimensionality [Poupart&Boutilier, 2002; Roy&Gordon, 2002] - can diminish both curses - requires restricted policy class - local minimum, slow-changing gradients [Baxter&Bartlett, 2000; Ng&Jordan, 2002] Machine Learning Lunch - March 10, 2003

  24. A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

  25. A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points • Plan for those belief points only V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

  26. A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points  Focus on reachable beliefs • Plan for those belief points only V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

  27. A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points  Focus on reachable beliefs • Plan for those belief points only  Learn value and its gradient V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

  28. Point-based value update V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

  29. Point-based value update • Initialize the value function (…and skip ahead a few iterations) Vn(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

  30. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: Vn(b) P(s1) b Machine Learning Lunch - March 10, 2003

  31. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: Vn(b) P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

  32. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: Vn(b) ba1,o1,ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

  33. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: Vn(b) ba1,o1,ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

  34. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: Vn(b) ba1,o1,ba2,o1 ba1,o2 ba2,o2 P(s1) b Machine Learning Lunch - March 10, 2003

  35. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003

  36. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: • Max over actions: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003

  37. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: • Max over actions: Vn+1(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

  38. Complexity of value update Exact Update Point-based Update I - Projection S2An S2AB II - Sum S2An SAB2 III - Max SAn SAB where: S = # states n = # solution vectors at iteration n A = # actions B = # belief points  = # observations n+1 Machine Learning Lunch - March 10, 2003

  39. Theoretical properties of point-based updates • Theorem: For any belief set B and any horizon n, the error of the PBVI algorithm n=||VnB-Vn*|| is bounded by: Machine Learning Lunch - March 10, 2003

  40. Back to the full algorithm • Main idea: • Select a small set of belief points  PART II • Plan for those belief points only  PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

  41. Experimental results: Lasertag domain State space = RobotPositionOpponentPosition Observable: RobotPosition - always OpponentPosition- only if same as Robot Action space = {North, South, East, West, Tag} Opponent strategy: Move away from robot w/ Pr=0.8 |S|=870, |A|=5, ||=30 Machine Learning Lunch - March 10, 2003

  42. Performance of PBVI on Lasertag domain Opponent tagged 59% of trials Opponent tagged 17% of trials Machine Learning Lunch - March 10, 2003

  43. Performance on well-known POMDPs Maze33 |S|=36, |A|=5, ||=17 Hallway |S|=60, |A|=5, ||=20 Hallway2 |S|=92, |A|=5, ||=17 Method QMDP Grid PBUA PBVI Reward 0.198 0.94 2.30 2.25 Time(s) 0.19 n.v. 12166 3448 B n.a. 174 660 470 %Goal 47 n.v 100 95 Reward 0.261 n.v. 0.53 0.53 Time(s) 0.51 n.v. 450 288 B n.a. n.a. 300 86 %Goal 22 98 100 98 Reward 0.109 n.v. 0.35 0.34 Time(s) 1.44 n.v. 27898 360 B n.a. 337 1840 95 Machine Learning Lunch - March 10, 2003

  44. Back to the full algorithm • Main idea: • Select a small set of belief points  PART II • Plan for those belief points only  PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

  45. Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. Machine Learning Lunch - March 10, 2003

  46. Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. a1,o1 a2,o1 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

  47. Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. • What can we learn from MDP exploration techniques? • Select widely-spaced beliefs, rather than near-by beliefs. P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

  48. Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. • What can we learn from MDP exploration techniques? • Select widely-spaced beliefs, rather than near-by beliefs. P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

  49. How does PBVI actually select belief points? • Start with B  b0 • For any belief point bB: P(s1) b Machine Learning Lunch - March 10, 2003

  50. How does PBVI actually select belief points? • Start with B  b0 • For any belief point bB: • For each action aA: • Generate a new belief ba by applying a and stochastically picking an observation o. P(s1) b ba1 a1,o2 Machine Learning Lunch - March 10, 2003

More Related