470 likes | 589 Vues
A Hybridized Planner for Stochastic Domains. Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento. Planning under Uncertainty (ICAPS’03 Workshop). Qualitative (disjunctive) uncertainty Which real problem can you solve?.
E N D
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento
Planning under Uncertainty(ICAPS’03 Workshop) • Qualitative (disjunctive) uncertainty • Which real problem can you solve? • Quantitative (probabilistic) uncertainty • Which real problem can you model?
The Quantitative View • Markov Decision Process • models uncertainty with probabilistic outcomes • general decision-theoretic framework • algorithms are slow • do we need the full power of decision theory? • is an unconverged partial policy any good?
The Qualitative View • Conditional Planning • Model uncertainty as logical disjunction of outcomes • exploits classical planning techniques FAST • ignores probabilities poor solutions • how bad are pure qualitative solutions? • can we improve the qualitative policies?
HybPlan: A Hybridized Planner • combine probabilistic + disjunctive planners • produces good solutions in intermediate times • anytime: makes effective use of resources • bounds termination with quality guarantee • Quantitative View • completes partial probabilistic policy by using qualitative policies in some states • Qualitative View • improves qualitative policies in more important regions
Outline • Motivation • Planning with Probabilistic Uncertainty (RTDP) • Planning with Disjunctive Uncertainty (MBP) • Hybridizing RTDP and MBP (HybPlan) • Experiments • Conclusions and Future Work
Markov Decision Process < S, A, Pr, C, s0, G > S : a set of states A : a set of actions Pr : prob. transition model C : cost model s0 : start state G: a set of goals Find a policy (S!A) • minimizes expected cost to reach a goal • for an indefinite horizon • for a fully observable • Markov decision process. Optimal cost function, J*, ~ optimal policy
Example 2 Longer path s0 Goal All states are dead-ends 2 Wrong direction, but goal still reachable
Optimal State Costs 2 2 3 3 4 4 1 1 3 2 1 1 4 0 1 1 3 2 1 Goal 8 8 2 7 7 6
Optimal Policy 3 2 1 4 0 3 2 1 Goal
Bellman Backup: Create better approximation to cost function @ s
Bellman Backup: Create better approximation to cost function @ s Trial=simulate greedy policy & update visited states
Real Time Dynamic Programming(Barto et al. ’95; Bonet & Geffner’03) Bellman Backup: Create better approximation to cost function @ s Repeat trials until cost function converges Trial=simulate greedy policy & update visited states
Planning with Disjunctive Uncertainty • < S, A, T, s0, G > S : a set of states A : a set of actions T : disjunctive transition model s0 : the start state G: a set of goals • Find a strong-cyclic policy (S!A) • that guarantees reaching a goal • for an indefinite horizon • for a fully observable • planning problem
Model Based Planner (Bertoli et. al.) • States, transitions, etc. represented logically • Uncertainty multiple possible successor states • Planning Algorithm • Iteratively removes “bad” states. • Bad = don’t reach anywhere or reach other bad states
MBP Policy Sub-optimal solution Goal
Outline • Motivation • Planning with Probabilistic Uncertainty (RTDP) • Planning with Disjunctive Uncertainty (MBP) • Hybridizing RTDP and MBP (HybPlan) • Experiments • Conclusions and Future Work
HybPlan Top Level Code 0. run MBP to find a solution to goal • run RTDP for some time • compute partial greedy policy (rtdp) • compute hybridized policy (hyb) by • hyb(s) = rtdp(s) if visited(s) > threshold • hyb(s) = mbp(s) otherwise • cleanhyb by removing • dead-ends • probability 1 cycles • evaluatehyb • save best policy obtained so far repeat until 1) resources exhaust or 2)a satisfactory policy found
First RTDP Trial 0 run RTDP for some time 2 0 0 0 0 0 0 0 0 0 0 0 Goal 0 0 0 0 0 0 0 2 0 0 0
Bellman Backup 0 run RTDP for some time 2 0 0 0 0 0 0 0 0 0 0 Goal 0 0 0 0 0 0 0 Q1(s,N) = 1 + 0.5£ 0 + 0.5£ 0 Q1(s,N) = 1 Q1(s,S) = Q1(s,W) = Q1(s,E) = 1 J1(s) = 1 Let greedy action be North 2 0 0 0
Simulation of Greedy Action 0 run RTDP for some time 2 0 0 0 0 0 0 0 0 0 0 1 Goal 0 0 0 0 0 0 0 2 0 0 0
Continuing First Trial 0 run RTDP for some time 2 0 0 0 0 0 0 0 0 0 1 Goal 0 0 0 0 0 0 0 2 0 0 0
Continuing First Trial 0 run RTDP for some time 2 0 0 1 0 0 0 0 0 0 1 Goal 0 0 0 0 0 0 0 2 0 0 0
Finishing First Trial run RTDP for some time 2 1 0 0 1 0 0 0 0 0 0 1 Goal 0 0 0 0 0 0 0 2 0 0 0
Cost Function after First Trial 2 run RTDP for some time 2 1 0 0 1 0 0 0 0 0 0 1 Goal 0 0 0 0 0 0 0 2 0 0 0
Partial Greedy Policy 2 2. compute greedy policy (rtdp) 2 1 0 1 1 Goal
Construct Hybridized Policy w/ MBP 2 3. compute hybridized policy (hyb) (threshold = 0) 2 1 0 0 1 1 Goal
Evaluate Hybridized Policy 2 2 5. evaluatehyb 6. store hyb 2 1 0 3 3 0 1 4 4 1 Goal 5 After first trial J(hyb) = 5
Second Trial 2 2 1 0 0 1 0 0 0 0 0 2 1 Goal 1 1 0 0 0 0 0 2 0 0 0
Partial Greedy Policy 0 2 1 1 1
Absence of MBP Policy 2 2 1 0 MBP Policy doesn’t exist! no path to goal 0 1 0 £ 2 1 Goal 1 1
Third Trial 2 2 1 0 0 1 0 0 0 0 0 2 1 Goal 1 1 0 0 0 0 1 2 1 0 3
Partial Greedy Policy 1 0 1 2 1 3
Probability 1 Cycles repeat find a state s in cycle hyb(s) = mbp(s) until cycle is broken 1 0 1 2 1 0 3
Probability 1 Cycles repeat find a state s in cycle hyb(s) = mbp(s) until cycle is broken 1 0 1 2 1 0 3
Probability 1 Cycles repeat find a state s in cycle hyb(s) = mbp(s) until cycle is broken 1 0 1 2 1 0 3
Probability 1 Cycles repeat find a state s in cycle hyb(s) = mbp(s) until cycle is broken 1 0 1 2 1 0 3
Probability 1 Cycles 2 2 1 0 0 1 repeat find a state s in cycle hyb(s) = mbp(s) until cycle is broken 1 Goal 0 1 2 1 0 3
Error Bound 2 2 2 1 0 3 3 0 1 4 4 J*(s0) · 5 J*(s0) ¸ 1 ) Error(hyb) = 5-1 = 4 1 Goal 5 After 1st trial J(hyb) = 5
Termination • when a policy of required error bound is found • when the planning time exhausts • when the available memory exhausts Properties • outputs a proper policy • anytime algorithm (once MBP terminates) • HybPlan = RTDP, if infinite resources available • HybPlan = MBP, if extremely limited resources • HybPlan = better than both, otherwise
Outline • Motivation • Planning with Probabilistic Uncertainty (RTDP) • Planning with Disjunctive Uncertainty (MBP) • Hybridizing RTDP and MBP (HybPlan) • Experiments • Anytime Properties • Scalability • Conclusions and Future Work
Domains NASA Rover Domain Factory Domain Elevator domain
Anytime Properties RTDP
Anytime Properties RTDP
Conclusions • First algorithm that integrates disjunctive and probabilistic planners. • Experiments show that HybPlan is • anytime • scales better than RTDP • produces better quality solutions than MBP • can interleaved planning and execution
Hybridized Planning: A General Notion • Hybridize other pairs of planners • an optimal or close-to-optimal planner • a sub-optimal but fast planner to yield a planner that produces • a good quality solution in intermediate running times • Examples • POMDP : RTDP/PBVI with POND/MBP/BBSP • Oversubscription Planning : A* with greedy solutions • Concurrent MDP : Sampled RTDP with single-action RTDP