630 likes | 763 Vues
Logistics. Reading for Mon No class Wed 11/26. Classical AI planning. Operations Research. No uncertainty. Uncertainty. Achieve goals. Maximize utility. Active research area. Knowledge-based representation. Markov decision process. Dynamic programming. Heuristic search. Review .
Logistics • Reading for Mon • No class Wed 11/26 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Classical AI planning Operations Research No uncertainty Uncertainty Achieve goals Maximize utility Active research area Knowledge-based representation Markov decision process Dynamic programming Heuristic search (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Review • MDPs • Bayesian Networks • DBNs • Factored MDPs • BDDs & ADDs (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Markov Decision Processes S = set of states set (|S| = n) A = set of actions (|A| = m) Pr = transition function Pr(s,a,s’) represented by set of m n x n stochastic matrices each defines a distribution over SxS R(s) = bounded, real-valued reward function represented by an n-vector (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Planning • Plan? • Objective? • Policy? • Objective? (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Initial value function DP improves value function -optimal value function Initial policy Evaluate policy DP improves policy -optimal policy Dynamic programming (DP) Value iteration [Bellman, 1957] Policy iteration [Howard, 1960] (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Bellman’s Curse of Dimensionality (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Earthquake Radio Burglary Alarm Nbr1Calls Nbr2Calls Bayes NetsCompact Rep’n Joint Prob, Distribution Pr(B=t) Pr(B=f) 0.05 0.95 Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
T T(t+1) T(t+1) T 0.91 0.09 F 0.0 1.0 DBN Representation: DelC RHM R(t+1) R(t+1) T 1.0 0.0 F 0.0 1.0 RHMt RHMt+1 fRHM(RHMt,RHMt+1) Mt Mt+1 fT(Tt,Tt+1) Tt Tt+1 L CR RHC CR(t+1) CR(t+1) O T T 0.2 0.8 E T T 1.0 0.0 O F T 0.0 1.0 E F T 0.0 1.0 O T F 1.0 0.1 E T F 1.0 0.0 O F F 0.0 1.0 E F F 0.0 1.0 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 fCR(Lt,CRt,RHCt,CRt+1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
RHMt RHMt+1 Mt Mt+1 Tt Tt+1 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Benefits of DBN Representation s1 s2 ... s160 s1 0.9 0.05 ... 0.0 s2 0.0 0.20 ... 0.1 . . . s160 0.1 0.0 ... 0.0 • Only 48 parameters vs. • 25440 for matrix (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
OBDD Binary decision tree x3 x3 1 1 0 0 x2 x2 0 x1 x1 0 1 1 0 1 1 1 1 1 1 Example (x3 and x2) or not x1 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
RHMt RHMt+1 Mt Mt+1 Tt Tt+1 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Action Representation – DBN/ADD Algebraic Decision Diagram (ADD) CR t RHC t f f L e o CR(t+1) CR(t+1) CR(t+1) f t f f t t 0.0 1.0 0.2 0.8 fCR(Lt,CRt,RHCt,CRt+1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Today – Solving the curse • Abstraction • Approximation • Reachability (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Structured Computation • Given compact representation, can we solve MDP without explicit state space enumeration? • Can we avoid O(|S|)-computations by exploiting regularities made explicit by propositional or first-order representations? • Two general schemes: • abstraction/aggregation • decomposition (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
State Space Abstraction • General method: state aggregation • group states, treat aggregate as single state • commonly used in OR [SchPutKin85, BertCast89] • viewed as automata minimization [DeanGivan96] • Abstraction is a specific aggregation technique • aggregate by ignoring details (features) • ideally, focus on relevant features (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 5.3 5.2 5.5 5.3 2.9 2.7 9.3 9.0 Dimensions of Abstraction Uniform Exact Adaptive A B C A B C A B C A B C A B C A B C A B C A B C Nonuniform Approximate Fixed A A A B B = A B C C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
A Fixed, Uniform Approximate Abstraction Method • Uniformly delete features from domain [BD94/AIJ97] • Ignore features based on degree of relevance • rep’n used to determine importance to sol’n quality • Allows tradeoff between abstract MDP size and solution quality 0.5 0.8 A B C A B C A B C A B C 0.5 0.2 A B C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Immediately Relevant Variables • Rewards determined by particular variables • impact on reward clear from STRIPS/ADD rep’n of R • e.g., difference between CR/-CR states is 10, while difference between T/-T states is 3, MW/-MW is 5 • Approximate MDP: focus on “important” goals • e.g., we might only plan for CR • we call CR an immediately relevant variable (IR) • generally, IR-set is a subset of reward variables (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Relevant Variables • We want to control the IR variables • must know which actions influence these and under what conditions • A variable is relevant if it is the parent in the DBN for some action a of some relevant variable • ground (fixed pt) definition by making IR vars relevant • analogous def’n for PSTRIPS • e.g., CR (directly/indirectly) influenced by L, RHC, CR • Simple “backchaining” algorithm to contruct set • linear in domain descr. size, number of relevant vars (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Constructing an Abstract MDP • Simply delete all irrelevant atoms from domain • state space S’: set of assts to relevant vars • transitions: let Pr(s’,a,t’) = S t t’Pr(s,a,t’) for any ss’ • construction ensures identical for all ss’ • reward: R(s’) = max {R(s): ss’} - min {R(s): ss’} / 2 • midpoint gives tight error bounds • Construction of DBN/PSTRIPS with these properties involves little more than simplifying action descriptions by deletion (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Example • Abstract MDP • only 3 variables • 20 states instead of 160 • some actions become identical, so action space is simplified • reward distinguishes only CR and –CR (but “averages” penalties for MW and –T) Lt Lt+1 CRt CRt+1 RHCt RHCt+1 DelC action Reward (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Solving Abstract MDP • Abstract MDP can be solved using std methods • Error bounds on policy quality derivable • Let d be max reward span over abstract states • Let V’ be optimal VF for M’, V* for original M • Let p’ be optimal policy for M’ and p* for original M (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
FUA Abstraction: Relative Merits • FUA easily computed (fixed polynomial cost) • FUA prioritizes objectives nicely • a priori error bounds computable (anytime tradeoffs) • can refine online (heuristic search) [DeaBou97] • FUA is inflexible • can’t capture conditional relevance • approximate (may want exact solution) • can’t be adjusted during computation • may ignore the only achievable objectives (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 5.3 5.2 5.5 5.3 2.9 2.7 9.3 9.0 Dimensions of Abstraction Uniform Exact Adaptive A B C A B C A B C A B C A B C A B C A B C A B C Nonuniform Approximate Fixed A A A B B = A B C C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Constructing Abstract MDPs • Many ways to abstract an MDP • methods will exploit the logical representation • Abstraction can be viewed as a form of automaton minimization • general minimization schemes require state space enumeration • Instead, exploit the logical structure of the domain (state, actions, rewards) to construct logical descriptions of abstract states, avoiding state enumeration (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Decision-Theoretic Regression • Abstraction based on analog of regression • as abstraction: dynamic, nonuniform, exact/approx. • exploits logical representation of MDP • Overview • value iteration as variable elimination • propositional decision-theoretic regression • approximate decision-theoretic regression • first-order decision-theoretic regression (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Classical Regression • Goal regression a classical abstraction method • Regression of a logical condition/formula G through action a is a weakest logical formula C = Regr(G,a) such that: G is guaranteed to be true after doing a if C is true before doing a • Weakest precondition for G wrt a C G do(a) C G (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Example: Regression in SitCalc • For the situation calculus • Regr(G(do(a,s))): logical condition C(s) under which a leads to G (aggregates C states and ~C states) • Regression in sitcalc straightforward • Regr(F(x, do(a,s))) F(x,a,s) • Regr(1) Regr(1) • Regr(12) Regr(1) Regr(2) • Regr(x.1) x.Regr(1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Decision-Theoretic Regression • In MDPs, we don’t have goals, but regions of distinct value • Decision-theoretic analog: given “logical description” of Vt+1, produce such a description of Vt or optimal policy (e.g., using ADDs) • Cluster together states at any point in calculation with same best action (policy), or with same value (VF) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
p1 G2 p2 p3 C1 G1 G3 Decision-Theoretic Regression • Decision-theoretic complications: • multiple formulae G describe fixed value partitions • a can leads to multiple partitions (stochastically) Qt+1(a) Vt (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
RHMt RHMt+1 Mt Mt+1 Tt Tt+1 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Functional View of DTR • Generally, Vt+1 depends on only a subset of variables @ t (usually in a structured way) • What is value of action a at time t (at any s)? Vt+1 fRm(Rmt,Rmt+1) fM(Mt,Mt+1) CR fT(Tt,Tt+1) M fL(Lt,Lt+1) -10 0 fCr(Lt,Crt,Rct,Crt+1) fRc(Rct,Rct+1) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
CR JC t RHC CP CR t f f L CC M e o CR(t+1) CR(t+1) CR(t+1) JP BC JP f 0 -10 t f f t t 0 0.0 1.0 0.2 0.8 10 9 12 Bellman Backup (Regression) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
A Simple Action/Reward Example W X X 0.0 1.0 X Z 0.9 Y Y Y 1.0 0.0 10 0 Y 0.9 Z Z Z 1.0 0.0 Network Rep’n for Action A Reward Function R (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Y Y Y Y Y Y Z Z Z Z: 0.9 9.0 Z 8.1 8.1 8.1 8.1 Z Z Z Z 10 10 0 0 Z: 1.0 Z: 0.0 10.0 0.0 19.0 9.0 9.0 9.0 0.0 0.0 0.0 0.0 P(Z|a,s) P(Z|a,s)V0 Example: Generation of V1 P(Z|a,s)V0 V0 = R Maxa … = + R(s) +Maxa … = V1 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Example: Generation of V2 Y X X 8.1 Z Y Y: 0.9 Y Y Y: 0.0 Y: 1.0 9.0 0.0 Z Y: 1.0 Z Y: 0.9 Z: 0.9 Y:0.0 Z: 1.0 Y: 0.0 Z: 0.0 Y:0.9 Z: 1.0 Y: 0.9 Z: 0.0 V1 P(Y|a, s) P(Z|a,s) (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Some Results: Natural Examples (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Some Results: Worst-case (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Some Results: Best-case (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
DTR: Relative Merits • Adaptive, nonuniform, exact abstraction method • provides exact solution to MDP • much more efficient on certain problems (time/space) • 400 million state problems (ADDs) in a couple hrs • Some drawbacks • produces piecewise constant VF • some problems admit no compact solution representation (though ADD overhead “minimal”) • approximation may be desirable or necessary (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Criticisms of SPUDD (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Future Work (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 5.3 5.2 5.5 5.3 2.9 2.7 9.3 9.0 Dimensions of Abstraction Uniform Exact Adaptive A B C A B C A B C A B C A B C A B C A B C A B C Nonuniform Approximate Fixed A A A B B = A B C C A B C (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Approximate DTR • Easy to approximate solution using DTR • Simple pruning of value function • Can prune trees [BouDearden96]or ADDs [StaubinHoeyBou00] • Gives regions of approximately same value (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
HCU HCR [9.00, 10.00] Loc Loc [5.19, 6.19] [7.45,8.45] [6.64, 7.64] A Pruned Value ADD HCU HCR W 9.00 10.00 W 5.19 R W W U 7.45 6.64 R R 6.19 5.62 U U 8.45 8.36 7.64 6.81 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Approximate Structured VI • Run normal SVI using ADDs/DTs • at each leaf, record range of values • At each stage, prune interior nodes whose leaves all have values with some threshold d • tolerance can be chosen to minimize error or size • tolerance can be adjusted to magnitude of VF • Convergence requires some care • If max span over leaves < d and term. tol. < e: (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Approximate DTR: Relative Merits • Relative merits of ADTR • fewer regions implies faster computation • can provide leverage for optimal computation • 30-40 billion state problems in a couple hours • allows fine-grained control of time vs. solution quality with dynamic (a posteriori) error bounds • technical challenges: variable ordering, convergence, fixed vs. adaptive tolerance, etc. • Some drawbacks • (still) produces piecewise constant VF • doesn’t exploit additive structure of VF at all (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Reachability (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
DP vs. heuristic search Each iteration, DP improves solution for each state DP solves problem for all possible starting states. Solution graph: all states reachable by optimal solution Explicit graph: states evaluated during search Implicit graph: all states Start state Given a start state, heuristic search can find an optimal solution without evaluating all states. (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
Solution structures Cyclic solution graph Solution path Acyclic solution graph (c) 2002-3, C. Boutilier, E. Hansen, D. Weld
DP vs. heuristic search Heuristic search = dynamic programming + starting state + forward expansion of solution + admissible heuristic (c) 2002-3, C. Boutilier, E. Hansen, D. Weld