310 likes | 547 Vues
Dynamic Programming Applications. Lecture 6 Infinite Horizon. Infinite horizon. Rules of the game: Infinite no. of stages Stationary system Finite number of states Why do we care? Good appox. for problem w/many states Analysis is elegant and insightful
E N D
Dynamic Programming Applications Lecture 6 Infinite Horizon
Infinite horizon Rules of the game: • Infinite no. of stages • Stationary system • Finite number of states Why do we care? • Good appox. for problem w/many states • Analysis is elegant and insightful • Implementation of optimal policy is simple • Stationary policy: p={m, m, …} DPA6
N-1 k=0 Total Cost Problems Jp(x0) = limN E{ S ak g(xk, mk(xk), wk)} J*(x0) = minpJp(x0) • Stochastic Shortest Paths (SSP) : a=1 Objective: reach cost free termination state • Discounted problems w/bounded cost/stage: a<1 |g|<M, so Jp(x0) < M/(1-a) is well defined (e.g. if the state and control sets are finite.) • Discounted problems unbounded cost/stage: a?1 Hard: we don’t do it here.. DPA6
N-1 k=0 Average cost problems • Jp(x0) = for all feas. policies p and state x0 • Average cost/stage: lim1 E {S ak g(xk, mk(xk), wk)} • this is well defined and finite • LATER N N DPA6
Preview • Convergence: J*(x)=limN J*N(x) , for all x • Limiting solution (Bellman Equation): J*(x) = minu Ew {g(x,u,w) + J*(f(x,u,w))} • Optimal stationary policy: m(x) that solves above. DPA6
SSP • Finite* constraint set U(i) for all i • Zero-cost state: p00(u)=1, g(0,u,0)=0, uU(0) Special cases: • Deterministic SP • Finite horizon DPA6
Shorthand • J=(J(1),…,J(n)); J(0)=0 • TJ (i)= min S pij(u)( g(i,u,j) + J(j) ) TJ : optimal cost-to-go for one stage problem w/cost per stage g and initial cost J. • TmJ (i)= S pij(m(i))( g(i, m(i),j) + J(j) ) TmJ : cost-to-go under m for one stage problem w/cost per stage g and initial cost J. DPA6
Shorthand TmJ = gm + PmJ • where gm (i)= Sjpij(m(i)) g(i, m(i),j) • and Pm = (pij(m(i))) for i,j=1,…n (not 0) TJ = g + PJ • where g(i)= Sjpij(m*(i)) g(i, m*(i),j) • and P = Pm* DPA6
Value iteration • TkJ=T(Tk-1J) , T0J =J • TkJ: optimal cost-to-go for k-stage problem w/cost/stage g and initial cost J • …and similarly for Tm DPA6
T Properties Monotonicity Lemma: If J J’ and m stationary, then TkJ TkJ’ and TmkJ TmkJ’ . Subadditivity: If m stationary, e=(1,1..1), r >0, then Tk(J + re)(i) TkJ(i) + r and Tmk(J + re)(i) TmkJ(i) + r DPA6
Property Define: Proper stationary policy m : Terminal state reachable from any state w.p. > 0 (in n stages) Assumptions: • There exists at least one properm . • Cost-to-go Jp(i) of improper m is infinite for some i. 2’. Expected cost/stage: g(i,u)= Sjpij(u)g(i,u,j) 0 i 0 and u U(i). What do these mean in deterministic case? DPA6
Alternative assumption • There exists an integer m such that for any policy p and initial state x, the probability of reaching the terminal state from x in m stages under policy p is non-zero. (3) • This is a stronger assumption than 1 & 2. DPA6
Main Theorem Under assumptions 1 and 2 (or under 3): • limk Tk J=J* , for every vector J. • J*=TJ*, and J* is the only solution of J=TJ. • For any properpolicym and for every vector J, limk Tmk(J)= Jm and Jm = TmJm and Jm is the only solution. 4. Stationary m is optimal iff TmJ*=TJ* DPA6
Lemma Suppose all stationary policies are proper. Then >0 s.t. for all stationary m, T and Tm are contraction mappings w.r.t. the weighted max-norm ||.||. • weighted max-norm: ||J|| =max|J(i)|/(i) • contraction mapping: ||TJ –TJ’|| b||J –J’|| DPA6
How to find J* and m*? • Value iteration • Policy iteration • Variants DPA6
Asynchronous Value Iteration • Start with arbitrary J0. • Stage k: pick ik and iterate Jk+1 (ik) TJk (ik) (all rest is same: Jk+1 (i)=Jk (i) , for ik i ). • Assume each ik is chosen infinitely often. • Then Jk J*. This is also called the Gaus-Seidel method. DPA6
Decomposition • Suppose S can be partitioned into S1,S2,..SM so that if iSm then under any policy, the successor state j=0 or jSm-k , for some m-1k0 • Then the solution decomposes as sequential solution of M SSPs that can be solved using optimal sol. of the preceding subproblems. • If k > 0 above, then the Gauss-Seidel method that iterates on states in order of their membership in Sm needs only one iteration per state to get to optimum. (e.g. finite horizon problems) DPA6
Policy Iteration • Start with given policy mk: • Policy evaluation step Compute Jmk(i) by solving linear system (J(0)=0): J= gmk + PmkJ • Policy improvement step Compute new policy mk+1 as solution to: Tmk+1Jmk =TJmk , that is mk+1(i)= arg min S pij(u)( g(i,u,j) + Jmk(j) ) • Terminate iff Jmk= Jmk+1 (no improvement): m=mk DPA6
Policy Iteration Theorem • The algorithm generates an improving sequence of proper policies, that is for all i,k: Jmk+1(i)Jmk(i) and terminates with an optimal policy. DPA6
Multistage Look-ahead • start at state i • make m subsequent decisions & incur costs • end up in state j and pay terminal cost Jm(j) Multistage policy iteration terminates w/optimal policy under same conditions. DPA6
Value vs. Policy iteration • In general value iteration requires infinite number of iterations to obtain optimal cost-to-go • Policy iteration always terminates finitely • Value iteration is easier operation than policy iter. • Idea: should combine them. DPA6
Modified policy iteration • Let J0 s.t. TJ0J0 , and • J1 ,J2 ,… and m0,m1,m2 ,.. s.t. TmkJ = TJk and Jk+1 = (Tmk)mk(Jk) • if mk =1 for all k: value iteration • if mk= for all k: policy iteration, where the evaluation step done iteratively via value iteration • heuristic choices of mk >1 keeping in mind that TmJ is much cheaper to compute than TJ DPA6
Asynchronous Policy Iteration • Generate a sequence of costs-to-go Jk and stationary policies mk. • Given (Jk,mk): select Sk, generate new (Jk+1,mk+1) by alternatively updating : a) Jk+1(i) = Tmk Jk(i) , if i Sk Jk(i), else and mk+1= mk b) mk+1(i)= arg min S pij(u)(g(i,u,j)+ Jk(j)) , if i Sk mk(i), else andJk+1= Jk DPA6
Convergence • If both value update and policy update are executed infinitely often for all states, and • If initial conditions J0 and m0 are s.t. Tm0J0 J0 (for example select m0 and J0 = Jm0 ). • Then Jk converges to J*. DPA6
Linear programming • Since limk TkJ =J* for all J, then J TJ J J* = TJ* • So J* = arg max{J | J TJ}, that is: • maximize S li subject to li S pij(u)(g(i,u,j)+ lj ) i=1,..,n, u U(i) Problem: very big when n is big ! DPA6
Discounted problems • Let a < 1. No termination state. • Prove special case of SSP • modify definitions and proofs • TJ (i)= min S pij(u)( g(i,u,j) + a J(j) ) • TmJ (i)= S pij(m(i))( g(i, m(i),j) + a J(j) ) • TmJ = gm + a PmJ DPA6
T a-Properties Monotonicity Lemma: If J J’ and m stationary, then TkJ TkJ’ and TmkJ TmkJ’ . a-Subadditivity: If m stationary, r >0 ,then Tk(J + re)(i) TkJ(i) + ak r and Tmk(J + re)(i) TmkJ(i) + ak r DPA6
Contraction • For any J and J’ and any policy m, the following contraction properties hold: ||TJ –TJ’|| a||J –J’|| ||TmJ –TmJ’|| a||J –J’|| • max-norm: ||J|| =max|J(i)| DPA6
Convergence Theorem • Convert to SSP • Define new terminal state 0 and transition probabilities: Pa(j|i,u) = a P(j|i,u) Pa (0|i,u) = 1-a • All policies are proper • All previous algorithms & convergence properties. • Separate proof for infinite no. of states • Can extend to compact control set w/continuous probabilities. DPA6
Applications 1. Asset selling w/infinite horizon- continued 2. Inventory w/batch processing - infinite horizon: • An order is placed at time t w.p. p • Given current backlog j, the manufacturer can either • process the whole batch at a fixed cost K or • postpone and incur a cost c/unit. • The maximum backlog is n • Policy that minimizes expected total cost ? DPA6