410 likes | 425 Vues
Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes. Anthony Cassandra Computer Science Dept. Brown University Providence, RI 02912 arc@cs.brown.edu. Michael L. Littman Dept. of Computer Science Duke University Durham, NC 27708-0129
E N D
Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University Providence, RI 02912 arc@cs.brown.edu Michael L. Littman Dept. of Computer Science Duke University Durham, NC 27708-0129 mlittman@cs.duke.edu Nevin L. Zhang Computer Science Dept. The Hong Kong U. of Sci. & Tech. Clear Water Bay, Kwolon, HK lzhang@cs.ust.hk Presented by Costas Djouvas
POMDPs: Who Needs them? Tony CassandraSt. Edwards UniversityAustin, TX http://www.cassandra.org/pomdp/talks/who-needs-pomdps/index.shtml
Markov Decision Processes (MDP) • A discrete model for decision making under uncertainty. • The four components of MDP model: • States: The world is divided into states. • Actions: Each state has a finite number of actions to choose from. • Transition Function: Probabilistic relationship between states and available actions for each state. • Reward Function: The expected reward of taking action a under state s.
MDP More Formally • S = A set of possible world states. • A = A set of possible actions. • Transition Function: A real number function T(s,a,s') = Pr(s'|s, a). • Reward Function: A real number function R(s,a).
MDP Example (1/2) • S = {OK, DOWN}. • A = {NO-OP, ACTIVE-QUERY, RELOCATE}. • Reward Function
MDP Example (2/2) • Transition Functions: POMDP
Best Strategy • Value Iteration Algorithm: • Input: Actions, States, Reward Function, Probabilistic Transition Function. • Derive a mapping from states to “best” actions for a given horizon of time. • Starts with horizon length 1 and iteratively found the value function for the desired horizon. • Optimal Policy • Maps states to actions (S A). • It depends only on current state (Markov Property). • To apply this we must know the agent’s state.
Partially Observable Markov Decision Processes • Domains with partial information available about the current state (we can’t observe the current state). • The observation can be probabilistic. • We need an observation function. • Uncertainly about current state. • Non-Markovian process: required keeping track of the entire history.
Partially Observable Markov Decision Processes • In addition to MDP model we have: • Observation: A set of observation of the state. • Z = A set of observations. • Observation Function: Relation between the state and the observation. • O(s, a, z) = Pr(z |s, a).
POMDP Example • In addition to the definitions of the MDP example, we must define the observation set and the observation probability function. Z={pink-ok, pink-timeout, active-ok, active-down}. Optimal Policy
Background on Solving POMDPs • We have to find a mapping from probability distribution over states to actions. • Belief State: the probability distribution over states. • Belief Space: the entire probability space. • Assuming finite number of possible actions and observations, there are finite number of possible next beliefs states. • Our next belief state is fully determined and it depends only on the current belief state (Markov Property).
Background on Solving POMDPs Next Belief State
Background on Solving POMDPs • Start from belief state b (Yellow Dot). • Two states s1, s2. • Two actions a1, a2. • Tree observations z1, z2, z3. Belief States Belief Space
Policies for POMDPs • An optimal POMDP policy maps belief states to actions. • The way in which one would use a computed policy is to start with some a priori belief about where you are in the world. The continually: • Use the policy to select action for current belief state; • Execute the action; • Receive an observation; • Update the belief state using current belief, action and observation; • Repeat.
Example for Optimal Policy ACTIVE NO-OP RELACATE ACTIVE NO-OP NO-OP Belief Space 0 1 Value Function
Value Function • The Optimal Policy computation is based on Value Iteration. • Main problem using the value iteration is that the space of all belief states is continuous.
Value Function • For each belief state get a single expected value. • Find the expected value of all belief states. • Yield a value function defined over all belief space.
Value Iteration Example • Two states, two actions, three observations. • We will use a figure to represent the Belief Space and the Transformed Value Function. • We will use the s(a, z) function to transform the continues space Value Function. Dot Product Transformed Value Belief Space
Value Iteration Example • Start from belief state b • One available action, a1 for the first decision and then two a1 and a2. • Three possible observations, z1, z2, z3.
Value Iteration Example • For each of the three new belief states compute the new value function, for all actions. Transformed Value Functions for all observations Partition for action a1
Value Iteration Example Value Function and partition for action a1 Value Function and partition for action a2 Combined a1 and a2 values functions Values functions for horizon 2
Transformed Value Example MDP Example
Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes • The agent is not aware of its current state. • It only knows its information (belief) state x (probability distribution over possible states). new information state xa where Notations S: a finite set of states A: a finite set of possible actions Z: a finite set of possible observations α Α s S z Z rα(s) R Transition function: Pr(s'|s, α) [0, 1] Observation function: Pr(z'|s', α) [0, 1] z a
Introduction • Algorithms for POMDPs use a form of dynamic programming, called dynamic programming updates. • One Value Function is translated into a another. • Some of the algorithms using DPU: • One pass (Sondik 1971) • Exhaustive (Monahan 1982) • Linear support (Cheng 1988) • Witness (Littman, Cassandra & Kaelbling 1996) • Dynamic Pruning (Zhang & Liu 1996)
Dynamic Programming Updates • Idea: Define a new value function V' in terms of a given value function V. • Using value iteration, in infinite-horizon, V' represents an approximation that is very close to optimal value function. • The V' is defined by: • So the function V can be expressed as vectors for some finite set of |S|-vectors Sα, Sα, S' • The transformations preserve piecewise linearity and convexity (Smallwood & Sondik, 1973). z
Dynamic Programming Updates Some more notations • Vector Comparison: • Vector dot product: • Cross sum: • Set subtraction: α1 > α2 if and only if for a1(s) > a2(s) for all s S. α.β = Σs α(s)β(s) A B = {α + β|α Α, β Β} Α\Β = {α Α|β Β}
Dynamic Programming Updates • Using these notations, we can characterize the “S” sets described earlier as: purge(.) takes a set of vectors and reduces it to its unique minimum form
Pruning Sets of Vectors • Given a set of |S|-vectors A and a vector α, define: which is called “witness region” the set of information states for which vector α is the clear “winner” (has the largest dot product) compared to all the others vectors of A. • Using the definition of R, we can define: which is the set of vectors in A that have non-empty witness region and is precisely the minimum-size set.
Pruning Sets of Vectors • Implementation of purge(F) Returns an information state x for which α gives larger dot product that any vector in A. Returns the vectors in F with non-empty witness region.
Incremental Pruning • Computes Sαefficiently: • Conceptually easier than witness. • Superior performance and asymptotic complexity. • A = purge(A), B = purge(B). • W = purge(A B). • |W| ≥ max(|A|, |B|). • It never grows explosively compared to its final size.
Incremental Pruning • We first construct all of S(a,z) sets. • We do all combinations of the S(a,z1) and S(a,z2) vectors.
Incremental Pruning • We yields the new value function. • We then eliminate all useless (light blue) vectors.
Incremental Pruning • We are left with just three vectors. • We then combine these three with the vectors in S(a,z3). • This is repeated for the other action.
Generalizing Incremental Pruning • Modification of FILTER to take advantage of the fact that the set of vectors has a great deal of regularity. • Replace x DOMINATE(Φ, W) with x DOMINATE(Φ, D\{Φ}). • Recall: • A B : filtering set of vectors. • W: set of wining vectors. • Φ: the “winner” vectors of the W • D A B
Generalizing Incremental Pruning • D must satisfying any of the following properties: • Different choices of D result in different incremental pruning algorithms. • The smaller the D set the more efficient the algorithm. (1) (2) (3) (4) (5)
Generalizing Incremental Pruning • To IP algorithm uses equation 1. • A variation of the incremental pruning method using a combination of 4 and 5 is referred as restricted region (RR) algorithm. • The asymptotic total number of linear programs does not change RR, actually requires slightly more linear programs than IP in the worst case. • However empirically it appears that the savings in the total constraints usually saves more time than the extra linear programs require.
Generalizing Incremental Pruning Complete RR algorithm
Empirical Results Total execution time Total time spent constructing Sαsets.
Conclusions • We examined the incremental pruning method for performing dynamic programming updates in partially observable Markov decision processes. • It compares favorably in terms of ease of implementation to the simplest of the previous algorithms. • It has asymptotic performance as good as or better than the most efficient of the previous algorithms and is empirically the fastest algorithm of its kind.
Conclusion • In any event even the slowest variation of the incremental pruning method that we studied is a consistent improvement over earlier algorithms. • This algorithm will make it possible to greatly expand the set of POMDP problems that can be solved efficiently. • Issues to be explored: • All algorithms studied have a precision parameter ε, which differs from algorithm to algorithm. • Develop better best-case and worst-case analyses for RR.