1 / 68

Markov Decision Processes: A Survey

Markov Decision Processes: A Survey. Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization Research Group March 22, 2004. Outline. Introduction Markov Theory Markov Decision Processes Conclusion Future Work. Introduction. Decision Theory. Probability Theory +

kimimela
Télécharger la présentation

Markov Decision Processes: A Survey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Markov Decision Processes: A Survey Adviser:Yeong-Sung Lin Graduate Student:Cheng-Ta Lee Network Optimization Research Group March 22, 2004 Markov Decision Processes: A Survey

  2. Outline • Introduction • Markov Theory • Markov Decision Processes • Conclusion • Future Work Markov Decision Processes: A Survey

  3. Introduction Decision Theory • Probability Theory + • Utility Theory = • Decision Theory Describes what an agent should believe based on evidence. Describes what an agent wants. Describes what an agent should do. Markov Decision Processes: A Survey

  4. Introduction • Markov decision processes (MDPs) theory has developed substantially in the last three decades and become an established topic within many operational research. • Modeling of (infinite) sequence of recurring decision problems (general behavioral strategies) • MDPs defined • Objective functions • Policies Markov Decision Processes: A Survey

  5. Markov Theory • Markov process • A mathematical model that us useful in the study of complex systems. • The basic concepts of the Markov process are those “state” of a system and state “transition”. • A graphic example of a Markov process is presented by a frog in a lily pond. • State transition system • Discrete-time process • Continuous-time process Markov Decision Processes: A Survey

  6. Markov Theory • To study the discrete-time process • Suppose that there are N states in the system numbered from 1 to N. If the system is a simple Markov process, then the probability of a transition to state j during the next time interval, given that the system now occupies state i, is a function only of i and j and not of any history of the system before its arrival in i. • In other words, we may specify a set of conditional probabilitypij. • where Markov Decision Processes: A Survey

  7. The Toymaker Example • First state: the toy is great favor. • Second state: the toy is out of favor. • Matrix form • Transition diagram Markov Decision Processes: A Survey

  8. The Toymaker Example • , the probability that the system will occupy state i after n transitions. • If its state at n=0 is known. It follow that Markov Decision Processes: A Survey

  9. The Toymaker Example • If the toymaker starts with a successful toy, then and , so that Markov Decision Processes: A Survey

  10. The Toymaker Example • Table 1.1 Successive State Probabilities of Toymaker Starting with a Successful Toy • Table 1.2 Successive State Probabilities of Toymaker Starting without a Successful Toy Markov Decision Processes: A Survey

  11. The Toymaker Example • The row vector with components is thus the limit as n approaches infinity of Markov Decision Processes: A Survey

  12. For the study of transient behavior and for theoretical convenience, it is useful to study the Markov process from the point of view of the generating function or, as we shall call it, the z-transform. Consider a time function f(n) that takes on arbitrary values f(0), f(1), f(2), and so on, at nonnegative, discrete, integrally spaced points of time and that is zero for negative time. Such a time function is shown in Fig. 2.4 Fig. 2.4 An Arbitrary discrete-time function z-Transformation Markov Decision Processes: A Survey

  13. z-transform F(z) such that Table 1.3. z-Transform Pairs z-Transformation Markov Decision Processes: A Survey

  14. z-Transformation • Consider first the step functionthe z-transform is orFor the geometric sequence f(n)=αn,n≧0, or Markov Decision Processes: A Survey

  15. z-Transformation • We shall now use the z-transform to analyze Markov processes.In this expression I is the identity matrix. Markov Decision Processes: A Survey

  16. z-Transformation • Let us investigate the toymaker’s problem by z-transformation. Let the matrix H(n) be the inverse transform of (I-zP)-1 on an element-by-element basis Markov Decision Processes: A Survey

  17. z-Transformation • If the toymaker starts in the successful state 1, then π(0)=[1 0] and or , • If the toymaker starts in the unsuccessful state 2, then π(0)=[0 1] and or , • We have now obtained analytic forms for the data in Table 1.1 and 1.2. Markov Decision Processes: A Survey

  18. We shall extend our previous work to the case in which the process may make transitions at random time intervals. The Laplace transform of a time function f(t) which is zero for t<0 is defined by Table 2.4. Laplace Transform Pairs Laplace Transformation Markov Decision Processes: A Survey

  19. Laplace Transformation Markov Decision Processes: A Survey

  20. Laplace Transformation • We shall now use the Laplace transform to analyze Markov processes. For discrete processes, or Markov Decision Processes: A Survey

  21. Laplace Transformation • Recall the toymaker’s initial policy, for which the transition-probability matrix was Markov Decision Processes: A Survey

  22. Laplace Transformation • Let the matrix H(t) be the inverse transform (sI-A)-1 • Then becomes by means of inverse transformation Markov Decision Processes: A Survey

  23. Laplace Transformation • If the toymaker starts in the successful state 1, then π(0)=[1 0] and or , • If the toymaker starts in the unsuccessful state 2, then π(0)=[0 1] and or , • We have now obtained analytic forms for the data in Table 1.1 and 1.2. Markov Decision Processes: A Survey

  24. Markov Decision Processes • MDPs applies dynamic programming to the solution of a stochastic decision with a finite number of states. • The transition probabilities between the states are described by a Markov chain. • The reward structure of the process is described by a matrix that represents the revenue (or cost) associated with movement from one state to another. • Both the transition and revenue matrices depend on the decision alternatives available to the decision maker. • The objective of the problem is to determine the optimal policy that maximizes the expected revenue over a finite or infinite number of stages. Markov Decision Processes: A Survey

  25. Markov Process with Rewards • Suppose that an N-state Markov process earns rijdollars when it makes a transition from state i to j. • We call rij the “reward” associated with the transition from i to j. • The rewards need not be in dollars, they could be voltage levels, unit of production, or any other physical quantity relevant to the problem. • Let us define vi(n) as the expected total earnings in the next n transitions if the system is now in state i. Markov Decision Processes: A Survey

  26. Markov Process with Rewards • Recurrence relation v(n)=q+Pv(n-1) Markov Decision Processes: A Survey

  27. The Toymaker Example • Table 3.1. Total Expected Reward for Toymaker as a Function of State and Number of Weeks Remaining Markov Decision Processes: A Survey

  28. Toymaker’s problem: total expected reward in each state as a function of week remaining Markov Decision Processes: A Survey

  29. z-Transform Analysis of the Markov Process with Rewards • The z-Transform of the total-value vector v(n) will be called where v(0)=0 Markov Decision Processes: A Survey

  30. z-Transform Analysis of the Markov Process with Rewards The total-value vector v(n) is then F(n)q by inverse transformation of , and, since , Let the matrix F(n) be the inverse transform of We see that, as n becomes very large. Both v1(n) and v2(n) have slope 1 and v1(n)-v2(n)=10. Markov Decision Processes: A Survey

  31. Optimization Techniques in General Markov Decision Processes • Value Iteration • Exhaustive Enumeration • Policy Iteration • Linear Programming • Lagrangian Relaxation Markov Decision Processes: A Survey

  32. Value Iteration • Original • Advertising? No Yes • Research? No Yes Markov Decision Processes: A Survey

  33. Diagram of States and Alternatives Markov Decision Processes: A Survey

  34. The Toymaker’s Problem Solved by Value Iteration • The quantity is the expected reward from a single transition from state i under alternative k. Thus, • The alternatives for the toymaker are presented in Table 3.1. Markov Decision Processes: A Survey

  35. The Toymaker’s Problem Solved by Value Iteration • We call di(n) the “decision” in state i at the nth stage. When di(n) has been specified for all i and all n, a “policy” has been determined. The optimal policy is the one that maximizes total expected return for each i and n. • To analyze this problem. Let us redefine as the total expected return in n stages starting from state i if an optimal policy is followed. It follows that for any n • “Principle of optimality” of dynamic programming: in an optimal sequence of decisions or choices, each subsequence must also be optimal. Markov Decision Processes: A Survey

  36. The Toymaker’s Problem Solved by Value Iteration Table 3.6 Toymaker’s Problem Solved by Value Iteration Markov Decision Processes: A Survey

  37. The Toymaker’s Problem Solved by Value Iteration • Note that for n=2, 3, and 4, the second alternative in each state is to be preferred. This means that the toymaker is better advised to advertise and to carry on research in spite of the costs of these activities. • For this problem the convergence seems to have taken place at n=2, and the second alternative in each state has been chosen. However, in many problems it is difficult to tell when convergence has been obtained. Markov Decision Processes: A Survey

  38. Evaluation of the Value-Iteration Approach • Even though the value-iteration method is not particularly suited to long-duration processes. Markov Decision Processes: A Survey

  39. Exhaustive Enumeration • The methods for solving the infinite-stage problem. • The method calls for evaluating all possible stationary policies of the decision problem. • This is equivalent to an exhaustive enumeration process and can be used only if the number of stationary policies is reasonably small. Markov Decision Processes: A Survey

  40. Exhaustive Enumeration • Suppose that the decision problem has S stationary policies, and assume that Ps and Rs are the (one-step) transition and revenue matrices associated with the policy, s=1, 2, …, S. Markov Decision Processes: A Survey

  41. Exhaustive Enumeration • The steps of the exhaustive enumeration method are as follows. • Step 1. Compute vsi, the expected one-step (one-period) revenue of policy s given state i, i=1, 2, …, m. • Step 2. Compute πsi, the long-run stationary probabilities of the transition matrix Ps associated with policy s. These probabilities, when they exist, are computed from the equationsπsPs =πsπs1 +πs2 +…+πsm =1where πs =(πs1 , πs2 , …, πsm ). • Step 3. Determine Es, the expected revenue of policy s per transition step (period), by using the formula • Step 4. The optimal policy s* id determined such that Markov Decision Processes: A Survey

  42. Exhaustive Enumeration • We illustrate the method by solving the gardener problem for an infinite-period planning horizon. • The gardener problem has a total of eight stationary policies, as the following table shows: Markov Decision Processes: A Survey

  43. Exhaustive Enumeration • The matrices Ps and Rs for policies 3 through 8 are derived from those of policies 1 and 2 and are given as Markov Decision Processes: A Survey

  44. Exhaustive Enumeration • Step1: • The values of vsi can thus be computed as given in the following table. Markov Decision Processes: A Survey

  45. Exhaustive Enumeration • Step 2: • The computations of the stationary probabilities are achieved by using the equationsπsPs =πsπs1 +πs2 +…+πsm =1 • As an illustration, consider s=2. The associated equations areThe solution yieldsIn this case, the expected yearly revenue is Markov Decision Processes: A Survey

  46. Exhaustive Enumeration • Step 3&4: • The following table summarizes πs and Es for all the stationary policies. • Policy 2 yields the largest expected yearly revenue. The optimum long-range policy calls for applying fertilizer regardless of the system. 2.256= Markov Decision Processes: A Survey

  47. Policy Iteration • The system is completely ergodic, the limiting state probabilities πi are independent of the starting state, and the gain g of the system iswhere qi is the expected immediate return in state i defined by Markov Decision Processes: A Survey

  48. Policy Iteration • A possible five-state problem. • The alternative thus selected is called the “decision” for that state; it is no longer a function of n. The set of X’s or the set of decisions for all states is called a “policy”. Markov Decision Processes: A Survey

  49. Policy Iteration • It is possible to describe the policy by a decision vector d whose elements represent the number of the alternative selected in each state. In this case • An optimal policy is defined as a policy that maximizes the gain, or average return per transition. Markov Decision Processes: A Survey

  50. Policy Iteration • In five-state problem diagrammed, there are different policies. • However feasible this may be for 120 policies, it becomes unfeasible for very large problem. • For example, a problem with 50 states and 50 alternatives in each state contains 5050(≒1085) policies. • The policy-iteration method that will be described will find the optimal policy in a small number of iterations. • It is composed of two parts, the value-determination operation and the policy-improvement routine. Markov Decision Processes: A Survey

More Related