1 / 30

Dynamic Programming Applications

Dynamic Programming Applications. Lecture 6 Infinite Horizon. Infinite horizon. Rules of the game: Infinite no. of stages Stationary system Finite number of states Why do we care? Good appox. for problem w/many states Analysis is elegant and insightful

helia
Télécharger la présentation

Dynamic Programming Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Programming Applications Lecture 6 Infinite Horizon

  2. Infinite horizon Rules of the game: • Infinite no. of stages • Stationary system • Finite number of states Why do we care? • Good appox. for problem w/many states • Analysis is elegant and insightful • Implementation of optimal policy is simple • Stationary policy: p={m, m, …} DPA6

  3. N-1 k=0 Total Cost Problems Jp(x0) = limN E{ S ak g(xk, mk(xk), wk)} J*(x0) = minpJp(x0) • Stochastic Shortest Paths (SSP) : a=1 Objective: reach cost free termination state • Discounted problems w/bounded cost/stage: a<1 |g|<M, so Jp(x0) < M/(1-a) is well defined (e.g. if the state and control sets are finite.) • Discounted problems unbounded cost/stage: a?1 Hard: we don’t do it here.. DPA6

  4. N-1 k=0 Average cost problems • Jp(x0) =  for all feas. policies p and state x0 • Average cost/stage: lim1 E {S ak g(xk, mk(xk), wk)} • this is well defined and finite • LATER N  N DPA6

  5. Preview • Convergence: J*(x)=limN J*N(x) , for all x • Limiting solution (Bellman Equation): J*(x) = minu Ew {g(x,u,w) + J*(f(x,u,w))} • Optimal stationary policy: m(x) that solves above. DPA6

  6. SSP • Finite* constraint set U(i) for all i • Zero-cost state: p00(u)=1, g(0,u,0)=0,  uU(0) Special cases: • Deterministic SP • Finite horizon DPA6

  7. Shorthand • J=(J(1),…,J(n)); J(0)=0 • TJ (i)= min S pij(u)( g(i,u,j) + J(j) ) TJ : optimal cost-to-go for one stage problem w/cost per stage g and initial cost J. • TmJ (i)= S pij(m(i))( g(i, m(i),j) + J(j) ) TmJ : cost-to-go under m for one stage problem w/cost per stage g and initial cost J. DPA6

  8. Shorthand TmJ = gm + PmJ • where gm (i)= Sjpij(m(i)) g(i, m(i),j) • and Pm = (pij(m(i))) for i,j=1,…n (not 0) TJ = g + PJ • where g(i)= Sjpij(m*(i)) g(i, m*(i),j) • and P = Pm* DPA6

  9. Value iteration • TkJ=T(Tk-1J) , T0J =J • TkJ: optimal cost-to-go for k-stage problem w/cost/stage g and initial cost J • …and similarly for Tm DPA6

  10. T Properties Monotonicity Lemma: If J  J’ and m stationary, then TkJ  TkJ’ and TmkJ  TmkJ’ . Subadditivity: If m stationary, e=(1,1..1), r >0, then Tk(J + re)(i)  TkJ(i) + r and Tmk(J + re)(i)  TmkJ(i) + r DPA6

  11. Property Define: Proper stationary policy m : Terminal state reachable from any state w.p. > 0 (in  n stages) Assumptions: • There exists at least one properm . • Cost-to-go Jp(i) of improper m is infinite for some i. 2’. Expected cost/stage: g(i,u)= Sjpij(u)g(i,u,j)  0  i  0 and u  U(i). What do these mean in deterministic case? DPA6

  12. Alternative assumption • There exists an integer m such that for any policy p and initial state x, the probability of reaching the terminal state from x in m stages under policy p is non-zero. (3) • This is a stronger assumption than 1 & 2. DPA6

  13. Main Theorem Under assumptions 1 and 2 (or under 3): • limk Tk J=J* , for every vector J. • J*=TJ*, and J* is the only solution of J=TJ. • For any properpolicym and for every vector J, limk Tmk(J)= Jm and Jm = TmJm and Jm is the only solution. 4. Stationary m is optimal iff TmJ*=TJ* DPA6

  14. Lemma Suppose all stationary policies are proper. Then  >0 s.t. for all stationary m, T and Tm are contraction mappings w.r.t. the weighted max-norm ||.||. • weighted max-norm: ||J|| =max|J(i)|/(i) • contraction mapping: ||TJ –TJ’|| b||J –J’|| DPA6

  15. How to find J* and m*? • Value iteration • Policy iteration • Variants DPA6

  16. Asynchronous Value Iteration • Start with arbitrary J0. • Stage k: pick ik and iterate Jk+1 (ik) TJk (ik) (all rest is same: Jk+1 (i)=Jk (i) , for ik i ). • Assume each ik is chosen infinitely often. • Then Jk J*. This is also called the Gaus-Seidel method. DPA6

  17. Decomposition • Suppose S can be partitioned into S1,S2,..SM so that if iSm then under any policy, the successor state j=0 or jSm-k , for some m-1k0 • Then the solution decomposes as sequential solution of M SSPs that can be solved using optimal sol. of the preceding subproblems. • If k > 0 above, then the Gauss-Seidel method that iterates on states in order of their membership in Sm needs only one iteration per state to get to optimum. (e.g. finite horizon problems) DPA6

  18. Policy Iteration • Start with given policy mk: • Policy evaluation step Compute Jmk(i) by solving linear system (J(0)=0): J= gmk + PmkJ • Policy improvement step Compute new policy mk+1 as solution to: Tmk+1Jmk =TJmk , that is mk+1(i)= arg min S pij(u)( g(i,u,j) + Jmk(j) ) • Terminate iff Jmk= Jmk+1 (no improvement): m=mk DPA6

  19. Policy Iteration Theorem • The algorithm generates an improving sequence of proper policies, that is for all i,k: Jmk+1(i)Jmk(i) and terminates with an optimal policy. DPA6

  20. Multistage Look-ahead • start at state i • make m subsequent decisions & incur costs • end up in state j and pay terminal cost Jm(j) Multistage policy iteration terminates w/optimal policy under same conditions. DPA6

  21. Value vs. Policy iteration • In general value iteration requires infinite number of iterations to obtain optimal cost-to-go • Policy iteration always terminates finitely • Value iteration is easier operation than policy iter. • Idea: should combine them. DPA6

  22. Modified policy iteration • Let J0 s.t. TJ0J0 , and • J1 ,J2 ,… and m0,m1,m2 ,.. s.t. TmkJ = TJk and Jk+1 = (Tmk)mk(Jk) • if mk =1 for all k: value iteration • if mk=  for all k: policy iteration, where the evaluation step done iteratively via value iteration • heuristic choices of mk >1 keeping in mind that TmJ is much cheaper to compute than TJ DPA6

  23. Asynchronous Policy Iteration • Generate a sequence of costs-to-go Jk and stationary policies mk. • Given (Jk,mk): select Sk, generate new (Jk+1,mk+1) by alternatively updating : a) Jk+1(i) = Tmk Jk(i) , if i Sk Jk(i), else and mk+1= mk b) mk+1(i)= arg min S pij(u)(g(i,u,j)+ Jk(j)) , if i Sk mk(i), else andJk+1= Jk DPA6

  24. Convergence • If both value update and policy update are executed infinitely often for all states, and • If initial conditions J0 and m0 are s.t. Tm0J0 J0 (for example select m0 and J0 = Jm0 ). • Then Jk converges to J*. DPA6

  25. Linear programming • Since limk TkJ =J* for all J, then J  TJ  J  J* = TJ* • So J* = arg max{J | J  TJ}, that is: • maximize S li subject to li S pij(u)(g(i,u,j)+ lj ) i=1,..,n, u  U(i) Problem: very big when n is big ! DPA6

  26. Discounted problems • Let a < 1. No termination state. • Prove special case of SSP • modify definitions and proofs • TJ (i)= min S pij(u)( g(i,u,j) + a J(j) ) • TmJ (i)= S pij(m(i))( g(i, m(i),j) + a J(j) ) • TmJ = gm + a PmJ DPA6

  27. T a-Properties Monotonicity Lemma: If J  J’ and m stationary, then TkJ  TkJ’ and TmkJ  TmkJ’ . a-Subadditivity: If m stationary, r >0 ,then Tk(J + re)(i)  TkJ(i) + ak r and Tmk(J + re)(i)  TmkJ(i) + ak r DPA6

  28. Contraction • For any J and J’ and any policy m, the following contraction properties hold: ||TJ –TJ’||  a||J –J’||  ||TmJ –TmJ’||  a||J –J’||  • max-norm: ||J|| =max|J(i)| DPA6

  29. Convergence Theorem • Convert to SSP • Define new terminal state 0 and transition probabilities: Pa(j|i,u) = a P(j|i,u) Pa (0|i,u) = 1-a • All policies are proper • All previous algorithms & convergence properties. • Separate proof for infinite no. of states • Can extend to compact control set w/continuous probabilities. DPA6

  30. Applications 1. Asset selling w/infinite horizon- continued 2. Inventory w/batch processing - infinite horizon: • An order is placed at time t w.p. p • Given current backlog j, the manufacturer can either • process the whole batch at a fixed cost K or • postpone and incur a cost c/unit. • The maximum backlog is n • Policy that minimizes expected total cost ? DPA6

More Related