630 likes | 771 Vues
This document reviews classical AI planning methods and focuses on Markov Decision Processes (MDPs) and dynamic programming. It discusses key concepts such as state representation, action choices, transition functions, and reward structures, emphasizing uncertainty management in planning. It explores advanced techniques like Bayesian Networks and Decision Diagrams, along with state space abstraction methods that optimize computations in MDPs. The text delves into strategies aimed at achieving efficient planning through abstraction, aggregation, and decomposition to manage the curse of dimensionality in AI planning frameworks.
E N D
Logistics • Reading for Mon • No class Wed 11/26 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Classical AI planning Operations Research No uncertainty Uncertainty Achieve goals Maximize utility Active research area Knowledge-based representation Markov decision process Dynamic programming Heuristic search (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Review • MDPs • Bayesian Networks • DBNs • Factored MDPs • BDDs & ADDs (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Markov Decision Processes S = set of states set (|S| = n) A = set of actions (|A| = m) Pr = transition function Pr(s,a,s’) represented by set of m n x n stochastic matrices each defines a distribution over SxS R(s) = bounded, real-valued reward function represented by an n-vector (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Planning • Plan? • Objective? • Policy? • Objective? (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Initial value function DP improves value function -optimal value function Initial policy Evaluate policy DP improves policy -optimal policy Dynamic programming (DP) Value iteration [Bellman, 1957] Policy iteration [Howard, 1960] (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Bellman’s Curse of Dimensionality (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Earthquake Radio Burglary Alarm Nbr1Calls Nbr2Calls Bayes NetsCompact Rep’n Joint Prob, Distribution Pr(B=t) Pr(B=f) 0.05 0.95 Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
T T(t+1) T(t+1) T 0.91 0.09 F 0.0 1.0 DBN Representation: DelC RHM R(t+1) R(t+1) T 1.0 0.0 F 0.0 1.0 RHMt RHMt+1 fRHM(RHMt,RHMt+1) Mt Mt+1 fT(Tt,Tt+1) Tt Tt+1 L CR RHC CR(t+1) CR(t+1) O T T 0.2 0.8 E T T 1.0 0.0 O F T 0.0 1.0 E F T 0.0 1.0 O T F 1.0 0.1 E T F 1.0 0.0 O F F 0.0 1.0 E F F 0.0 1.0 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 fCR(Lt,CRt,RHCt,CRt+1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
RHMt RHMt+1 Mt Mt+1 Tt Tt+1 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Benefits of DBN Representation s1 s2 ... s160 s1 0.9 0.05 ... 0.0 s2 0.0 0.20 ... 0.1 . . . s160 0.1 0.0 ... 0.0 • Only 48 parameters vs. • 25440 for matrix (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
OBDD Binary decision tree x3 x3 1 1 0 0 x2 x2 0 x1 x1 0 1 1 0 1 1 1 1 1 1 Example (x3 and x2) or not x1 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
RHMt RHMt+1 Mt Mt+1 Tt Tt+1 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Action Representation – DBN/ADD Algebraic Decision Diagram (ADD) CR t RHC t f f L e o CR(t+1) CR(t+1) CR(t+1) f t f f t t 0.0 1.0 0.2 0.8 fCR(Lt,CRt,RHCt,CRt+1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Today – Solving the curse • Abstraction • Approximation • Reachability (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Structured Computation • Given compact representation, can we solve MDP without explicit state space enumeration? • Can we avoid O(|S|)-computations by exploiting regularities made explicit by propositional or first-order representations? • Two general schemes: • abstraction/aggregation • decomposition (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
State Space Abstraction • General method: state aggregation • group states, treat aggregate as single state • commonly used in OR [SchPutKin85, BertCast89] • viewed as automata minimization [DeanGivan96] • Abstraction is a specific aggregation technique • aggregate by ignoring details (features) • ideally, focus on relevant features (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 5.3 5.2 5.5 5.3 2.9 2.7 9.3 9.0 Dimensions of Abstraction Uniform Exact Adaptive A B C A B C A B C A B C A B C A B C A B C A B C Nonuniform Approximate Fixed A A A B B = A B C C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
A Fixed, Uniform Approximate Abstraction Method • Uniformly delete features from domain [BD94/AIJ97] • Ignore features based on degree of relevance • rep’n used to determine importance to sol’n quality • Allows tradeoff between abstract MDP size and solution quality 0.5 0.8 A B C A B C A B C A B C 0.5 0.2 A B C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Immediately Relevant Variables • Rewards determined by particular variables • impact on reward clear from STRIPS/ADD rep’n of R • e.g., difference between CR/-CR states is 10, while difference between T/-T states is 3, MW/-MW is 5 • Approximate MDP: focus on “important” goals • e.g., we might only plan for CR • we call CR an immediately relevant variable (IR) • generally, IR-set is a subset of reward variables (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Relevant Variables • We want to control the IR variables • must know which actions influence these and under what conditions • A variable is relevant if it is the parent in the DBN for some action a of some relevant variable • ground (fixed pt) definition by making IR vars relevant • analogous def’n for PSTRIPS • e.g., CR (directly/indirectly) influenced by L, RHC, CR • Simple “backchaining” algorithm to contruct set • linear in domain descr. size, number of relevant vars (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Constructing an Abstract MDP • Simply delete all irrelevant atoms from domain • state space S’: set of assts to relevant vars • transitions: let Pr(s’,a,t’) = S t t’Pr(s,a,t’) for any ss’ • construction ensures identical for all ss’ • reward: R(s’) = max {R(s): ss’} - min {R(s): ss’} / 2 • midpoint gives tight error bounds • Construction of DBN/PSTRIPS with these properties involves little more than simplifying action descriptions by deletion (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Example • Abstract MDP • only 3 variables • 20 states instead of 160 • some actions become identical, so action space is simplified • reward distinguishes only CR and –CR (but “averages” penalties for MW and –T) Lt Lt+1 CRt CRt+1 RHCt RHCt+1 DelC action Reward (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Solving Abstract MDP • Abstract MDP can be solved using std methods • Error bounds on policy quality derivable • Let d be max reward span over abstract states • Let V’ be optimal VF for M’, V* for original M • Let p’ be optimal policy for M’ and p* for original M (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
FUA Abstraction: Relative Merits • FUA easily computed (fixed polynomial cost) • FUA prioritizes objectives nicely • a priori error bounds computable (anytime tradeoffs) • can refine online (heuristic search) [DeaBou97] • FUA is inflexible • can’t capture conditional relevance • approximate (may want exact solution) • can’t be adjusted during computation • may ignore the only achievable objectives (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 5.3 5.2 5.5 5.3 2.9 2.7 9.3 9.0 Dimensions of Abstraction Uniform Exact Adaptive A B C A B C A B C A B C A B C A B C A B C A B C Nonuniform Approximate Fixed A A A B B = A B C C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Constructing Abstract MDPs • Many ways to abstract an MDP • methods will exploit the logical representation • Abstraction can be viewed as a form of automaton minimization • general minimization schemes require state space enumeration • Instead, exploit the logical structure of the domain (state, actions, rewards) to construct logical descriptions of abstract states, avoiding state enumeration (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Decision-Theoretic Regression • Abstraction based on analog of regression • as abstraction: dynamic, nonuniform, exact/approx. • exploits logical representation of MDP • Overview • value iteration as variable elimination • propositional decision-theoretic regression • approximate decision-theoretic regression • first-order decision-theoretic regression (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Classical Regression • Goal regression a classical abstraction method • Regression of a logical condition/formula G through action a is a weakest logical formula C = Regr(G,a) such that: G is guaranteed to be true after doing a if C is true before doing a • Weakest precondition for G wrt a C G do(a) C G (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Example: Regression in SitCalc • For the situation calculus • Regr(G(do(a,s))): logical condition C(s) under which a leads to G (aggregates C states and ~C states) • Regression in sitcalc straightforward • Regr(F(x, do(a,s))) F(x,a,s) • Regr(1) Regr(1) • Regr(12) Regr(1) Regr(2) • Regr(x.1) x.Regr(1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Decision-Theoretic Regression • In MDPs, we don’t have goals, but regions of distinct value • Decision-theoretic analog: given “logical description” of Vt+1, produce such a description of Vt or optimal policy (e.g., using ADDs) • Cluster together states at any point in calculation with same best action (policy), or with same value (VF) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
p1 G2 p2 p3 C1 G1 G3 Decision-Theoretic Regression • Decision-theoretic complications: • multiple formulae G describe fixed value partitions • a can leads to multiple partitions (stochastically) Qt+1(a) Vt (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
RHMt RHMt+1 Mt Mt+1 Tt Tt+1 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Functional View of DTR • Generally, Vt+1 depends on only a subset of variables @ t (usually in a structured way) • What is value of action a at time t (at any s)? Vt+1 fRm(Rmt,Rmt+1) fM(Mt,Mt+1) CR fT(Tt,Tt+1) M fL(Lt,Lt+1) -10 0 fCr(Lt,Crt,Rct,Crt+1) fRc(Rct,Rct+1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
CR JC t RHC CP CR t f f L CC M e o CR(t+1) CR(t+1) CR(t+1) JP BC JP f 0 -10 t f f t t 0 0.0 1.0 0.2 0.8 10 9 12 Bellman Backup (Regression) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
A Simple Action/Reward Example W X X 0.0 1.0 X Z 0.9 Y Y Y 1.0 0.0 10 0 Y 0.9 Z Z Z 1.0 0.0 Network Rep’n for Action A Reward Function R (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Y Y Y Y Y Y Z Z Z Z: 0.9 9.0 Z 8.1 8.1 8.1 8.1 Z Z Z Z 10 10 0 0 Z: 1.0 Z: 0.0 10.0 0.0 19.0 9.0 9.0 9.0 0.0 0.0 0.0 0.0 P(Z|a,s) P(Z|a,s)V0 Example: Generation of V1 P(Z|a,s)V0 V0 = R Maxa … = + R(s) +Maxa … = V1 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Example: Generation of V2 Y X X 8.1 Z Y Y: 0.9 Y Y Y: 0.0 Y: 1.0 9.0 0.0 Z Y: 1.0 Z Y: 0.9 Z: 0.9 Y:0.0 Z: 1.0 Y: 0.0 Z: 0.0 Y:0.9 Z: 1.0 Y: 0.9 Z: 0.0 V1 P(Y|a, s) P(Z|a,s) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Some Results: Natural Examples (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Some Results: Worst-case (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Some Results: Best-case (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
DTR: Relative Merits • Adaptive, nonuniform, exact abstraction method • provides exact solution to MDP • much more efficient on certain problems (time/space) • 400 million state problems (ADDs) in a couple hrs • Some drawbacks • produces piecewise constant VF • some problems admit no compact solution representation (though ADD overhead “minimal”) • approximation may be desirable or necessary (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Criticisms of SPUDD (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Future Work (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 5.3 5.2 5.5 5.3 2.9 2.7 9.3 9.0 Dimensions of Abstraction Uniform Exact Adaptive A B C A B C A B C A B C A B C A B C A B C A B C Nonuniform Approximate Fixed A A A B B = A B C C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Approximate DTR • Easy to approximate solution using DTR • Simple pruning of value function • Can prune trees [BouDearden96]or ADDs [StaubinHoeyBou00] • Gives regions of approximately same value (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
HCU HCR [9.00, 10.00] Loc Loc [5.19, 6.19] [7.45,8.45] [6.64, 7.64] A Pruned Value ADD HCU HCR W 9.00 10.00 W 5.19 R W W U 7.45 6.64 R R 6.19 5.62 U U 8.45 8.36 7.64 6.81 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Approximate Structured VI • Run normal SVI using ADDs/DTs • at each leaf, record range of values • At each stage, prune interior nodes whose leaves all have values with some threshold d • tolerance can be chosen to minimize error or size • tolerance can be adjusted to magnitude of VF • Convergence requires some care • If max span over leaves < d and term. tol. < e: (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Approximate DTR: Relative Merits • Relative merits of ADTR • fewer regions implies faster computation • can provide leverage for optimal computation • 30-40 billion state problems in a couple hours • allows fine-grained control of time vs. solution quality with dynamic (a posteriori) error bounds • technical challenges: variable ordering, convergence, fixed vs. adaptive tolerance, etc. • Some drawbacks • (still) produces piecewise constant VF • doesn’t exploit additive structure of VF at all (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Reachability (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
DP vs. heuristic search Each iteration, DP improves solution for each state DP solves problem for all possible starting states. Solution graph: all states reachable by optimal solution Explicit graph: states evaluated during search Implicit graph: all states Start state Given a start state, heuristic search can find an optimal solution without evaluating all states. (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Solution structures Cyclic solution graph Solution path Acyclic solution graph (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
DP vs. heuristic search Heuristic search = dynamic programming + starting state + forward expansion of solution + admissible heuristic (c) 2002-3, C. Boutilier, E. Hansen, D. Weld