930 likes | 1.3k Vues
Neuro-Dynamic Programming. Jos é A. Ramírez , Yan Liao Advanced Decision Processes ECECS 841, Spring 2003 University of Cincinnati. Outline. 1. Neuro-Dynamic Programming (NDP): motivation. 2. Introduction to Infinite Horizon Problems Minimization of Total Cost, Discounted Problems,
E N D
Neuro-Dynamic Programming José A. Ramírez,Yan Liao Advanced Decision Processes ECECS 841, Spring 2003University of Cincinnati
Outline 1.Neuro-Dynamic Programming (NDP): motivation. 2. Introduction to Infinite Horizon Problems • Minimization of Total Cost, • Discounted Problems, • Finite-State Systems, • Value Iteration and Error Bounds, • Policy Iteration, • The Role of Contraction Mappings. 3. Stochastic Control Overview: • State Equation (system model), • Value Function, • Stationary policies and value function, • Introductory example: Tetris (game). Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
Outline 4. Control of Complex Systems: • Motivation about use of NDP in complex systems, • Examples of complex systems where NDP could be applied. 5. Value Function Approximation: • Linear parameterization: parameter vector and basis functions, • Continuation Tetris example. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
Outline 6. Temporal-Difference Learning (TD()): • Introduction: Autonomous systems, general TD() algorithm, • Controlled Systems, TD() for more general systems: • Approximate policy iteration, • Controlled TD, • Q-functions, and approximating the Q-function (Q-learning), • Comments about relationship with Approximate Value Iteration. 7. Actors and Critics: • Averaged Rewards, • Independent actors, • Using critic Feedback. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
1. Neuro-Dynamic Programming (NDP): motivation “rational” and “irrational” behavior -How decisions are made (psychologists, economists). Study ofDecision-Making -How decisions ought to be made: “rational decision-making” (engineers and management scientists). clear objectives, strategic behavior. Rational decision problems: -Development of mathematical theory: understanding of dynamics models,uncertainties, objectives, and characterization of optimal decision strategies. -If optimal strategies do exist, then computational methods are used as complement (e.g., Implementation). Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
1. Neuro-Dynamic Programming (NDP): motivation -In contrast to rational decision-making, there is no clear-cut mathematical theory about decisions made by participants of natural systems (speculative theories, refining ideas by experimentation). -One approach: hypothesis that behavior is in some sense rational, then ideas from study of rational decision-making are used to characterize such behavior, e.g., utility and equilibrium theory in financial economics. -Also, study of animal behavior is subject of interest: evolutionary theory and its popular precept “survival of the fittest” –support the possibility that behavior to some extent concurs with that rational agent. -Contributions from study of natural systems to science of rational decision-making: -Computational complexity of decision problems and lack of systematic approaches for dealing with it. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
1. Neuro-Dynamic Programming (NDP): motivation -For example: practical problems addressed by the theory of dynamic programming (DP) can rarely solved using DP algorithms because the computational time required for the generation of optimal strategies typically grows exponentially in the number of variables involved Curse of dimensionality. -This call for an understanding of suboptimalsolutions /decision-making under computational constraints. Problem no satisfactory theory has been developed to this end. -Interesting: the fact that biological mechanisms facilitate the efficient synthesis of adequate strategies motivates the possibility that understanding such mechanisms can inspire new and computationally feasible methodologies for strategic decision-making. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
1. Neuro-Dynamic Programming (NDP): motivation -Reinforcement Learning (RL):over two decades, RL algorithms –originally conceived as descriptive models for phenomena observed in animal behavior- have grown out in the field of artificial intelligence and been applied to solving complex sequential decision problems. -Success of RL in solving large-scale problems has generated special interest among operations researchers and control theorists research devoted to understand those methods and their potential. -Developments from the operations research and control theorists: focused in normative view, acknowledge of relative disconnect from descriptive models of animal behavior, some operations researchers and control theorists have come to refer this area of research as Neuro-Dynamic Programming (NDP)instead of RL. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
1. Neuro-Dynamic Programming (NDP): motivation -During these lectures we will present a sample of the recent developments and open issues of research in NDP. -Specifically, we will be focused in two algorithmic ideas of greatest use in NDP, and for which there has been significant theoretical progress in recent years: -Temporal-Difference learning -Actor-Critic Methods. -First, we begin providing some background and perspective on the methodology and problems may address. Comments about references Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems Material taken from “Dynamic Programming and Optimal Control”, vol. I, II; and “Neuro-Dynamic Programming” by Dimitri P. Berstsektas and John Tsitsiklis. The Dynamic Programming Problems with infinite horizon are characterized by the following aspects:a) The number of stages is infinite*.b) The system is stationary the system equation, the cost per stage, and the random disturbance statistics do not change from one stage to the next. Why Infinite Horizon Problems?: -They are interesting because their analysis is elegant and insightful. -Implementation of optimal policies is often simple. Optimal policies are typically stationary, e.g., optimal rule used to choose controls does not change from stage to stage. -NDP! complex systems. *This assumption is never satisfied in practice, but is a reasonable approximation for problems with a finite but very large number of stages. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems • -They require more sophisticated analysis than the finite horizon problems. • It is needed to analyze limiting behavior as the horizon tends to infinity. • -We consider four principal classes of infinite horizon problems. The first two classes try to minimize J (x0), the total cost over an infinite number of stages: • Stochastic shortest path problems:in this case = 1 and assume that there is an additional state 0, which is a cost-free termination state; once the system reach the termination state it remains there at not additional cost. objective: reach the termination state with the minimal cost. • Discounted problems with bounded cost per stage: here < 1, and the absolute one-stage cost |g(x,u,w)| is bounded from above by some constant M. Thus, J(x0) is well defined because it is the infinite sum of a sequence of numbers that are bounded in absolute value by the decreasing geometric progression Mk. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems • iii) Discounted problems with unbounded cost per stage:here the discount factor may or may not be less than 1, and the cost per stage is unbounded. this problem is difficult to analyze because of the possibility of infinite cost for some policies (more details in chap.3, Dynamic Programming,vol. II, by Bertsekas). • iv) Average cost problems:in some problems we have J(x0)=, for every policy and initial state i, then in many problems the average cost per stage, given by • where JN(i) is the N-stage cost-to-go of policy starting at state x0, is well defined • as a limit and is finite. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems • A Preview of Infinite Horizon Results: • Let J* the optimal cost-to-go function of the infinite horizon problem, and consider • the case = 1, with JN(x) as the optimal cost of the problem involving N stages, • initial state x, cost per stage g(x,u,w), and zero terminal cost. Thus, the N-stage cost • can be computed after N iterations of the DP algorithm*: • Thus, we can speculate the following: • The optimal infinite horizon cost is the limit of the corresponding N-stage optimal costs as N ∞ : *Note that the time indexing has been reversed from the original DP algorithm, thus the optimal finite horizon cost functions can be computed with a singleDP recursion (more details in chap.1, “Dynamic Programming”, vol. II, by D.P. Bertsekas. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems • ii) The following limiting form of the DP algorithm should hold for all states:this is not an algorithm, but a system of equations (one equation per state), which has as solution the costs-to-go for all states. It is also viewed as a functional equation for the cost-to-go function J* , and it is called Bellman’s equation. • iii) If (x) attains the minimum in the right-hand side of the Bellman’s equation for each x, then the policy ={, , …} should be optimal. This is true for most infinite horizon problems of interest. stationary policies. • Most of the analysis of infinite horizon problems are focused around the above three • issues and efficient methods to compute J* and optimal policies. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems Stationary Policy: A stationary policy is an admissible policy ={, , …} with a corresponding cost function J (x). is optimal if J(x)=J*(x) for all states x. Some Shorthand Notation: The use of single recursions in the DP algorithm to compute optimal costs over a finite horizon, motivates the introduction of two mappings that play an important theoretical role and give us a convenient shorthand notation for expressions that are Complicated to write. For any function J:S, where S is the states space, we consider the function obtained by applying the DP mapping J as follows: T can be viewed as a mapping that transforms J on S into the function TJ on S. TJ represent the optimal cost function for the one-stage problem that has stage cost g and terminal cost J. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems Similarly, for any control function : S C, where C is the space of controls, we have: Also, we denote the composition Tk of the mapping T with itself k times Then, for k=0 we have Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems Some Basic Properties Monotonicity Lemma:For any functions J:S and J’:S, such that and for any function :SC with (x) U(x), for all x S, we have Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems The Role of Contraction Mappings (Dynamic Programming, vol. II, Bertsekas) Definition 1.4.1: A mapping H: B(S) B(S) is said to be a contraction mapping if there exists a scalar <1 such that Where || ∙ || is the norm It is said to be an m-stage contraction mapping if there exists a positive integer m and some < 1 such that where Hm denotes the composition of H with itself m times. Note: B(S) is the set of all bounded real-valued functions on S. Every function J:S . Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
2. Introduction to Infinite Horizon Problems The Role of Contraction Mappings (Dynamic Programming, vol. II, Bertsekas) Proposition 1.4.1: (Contraction Mapping Fixed-Point Theorem) If H: B(S) B(S) is a contraction mapping or an m-stage contraction mapping, then there exists a unique fixed point of H; i.e., there exists a unique function J* B(S) such that Furthermore, if J is any function in B(S) and Hk is the composition of H with itself k times, then Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
3. Stochastic Control Overview State Equation: Let’s consider a discrete-time dynamic system, that at each time t, takes on a state xtand evolves according to: where wtis a disturbance (iid) and at is a control decision. We restrict attention to finite state, disturbances, and control spaces, denoted by , W, and , respectively. Value Function: Let r : x associates a rewardr( xt , at ) with a decision at, made at state xt. Let a stationary policy with : . For each policy we define a value function v( ∙ , ) : : Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
3. Stochastic Control Overview Optimal Value Function: we define the optimal value function V as follows: From dynamic programming, we have that any stationary policy *given by is optimal in the sense that Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
3. Stochastic Control Overview Example 1: Tetris The video arcade game of Tetris can be viewed as an instance of stochastic control. In particular, we can view the state xt as an encoding of the current “wall of bricks” and the shape of the current “falling piece.” The decision at identifies an orientation and horizontal position for placement of the falling piece onto the wall. Though the arcade game employs a more complicated scoring system, consider for simplicity a reward r(xt, at) equal to the number of rows eliminated by placing the piece in the position described by at. Then, a stationary policy that maximizes the value essentially optimizes a combination of present and future row elimination, with decreasing emphasis placed on rows to be eliminated at times farther into the future. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
w=10 h=20 3. Stochastic Control Overview Example 1: Tetris, cont. Tetris was first programmed by Alexey Pajitnov, Dmitry Pavlovsky, and Vadim Gerasimov, computer engineers at the Computer Center of the Russian Academy of Sciences in 1985-86. Standard shapes Number of states Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
3. Stochastic Control Overview Dynamic programming algorithms compute the optimal value function V. The result is stored in a “look-up” table with one entry V(x) per state x X. When is required, the value function is used to generate optimal decisions. For example, given a current state xt X, a decision at is selected according to Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
4. Control of Complex Systems • The main objective is in the development of a methodology for the control of “complex systems”. • Two common characteristics of these type of systems are: • i-An intractable state space :intractable state spaces preclude the use of classical DP which compute and store one numerical value per state. • ii- Severe nonlinearities: methods of traditional linear control, which are applicable even in large state spaces, are ruled out by severe nonlinearities. • Let’s review some examples of complex systems, where NDP could be and has been applied. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
4. Control of Complex Systems a) Call Admission and Routing With rising demand in telecommunication network resources, effective management is as important as ever. Admission (deciding which calls to accept/reject) and routing (allocating links in the network to particular calls) are examples of decisions that must be made at any point in time. The objective is to make the “best” use of limited network resources. In principle, such sequential decision problems can be addressed by dynamic programming. Unfortunately, the enormous state spaces involved render dynamic programming algorithms inapplicable, and heuristic control strategies are used in lieu. b) Strategic Asset Allocation Strategic asset allocation is the problem of distributing an investor’s wealth among assets in the market in order to take on a combination of risk and expected return that best suits the investor’s preferences. In general, the optimal strategy involves dynamic rebalancing of wealth among assets over time. If each asset offers a fixed rate of risk and return, and some additional simplifying assumptions are made, the only state variable is wealth, and the problem can be solved efficiently by dynamic programming algorithms. There are even closed form solutions in cases involving certain types of investor preferences. However, in the more realistic setting involving risks and returns that fluctuate with economic conditions, economic indicators must be taken into account as state variables, and this quickly leads to an intractable state space. The design of effective strategies in such situations constitutes an important challenge in the growing field of financial engineering. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
4. Control of Complex Systems c) Supply–Chain Management With today’s tight vertical integration, increased production complexity, and diversification, the inventory flow within and among corporations can be viewed as a complex network – called a supply chain – consisting of storage, production, and distribution sites. In a supply chain, raw materials and parts from external vendors are processed through several stages to produce finished goods. Finished goods are then transported to distributors, then to wholesalers, and finally retailers, before reaching customers. The goal in supply–chain management is to achieve a particular level of product availability while minimizing costs. The solution is a policy that decides how much to order or produce at various sites given the present state of the company and the operating environment. d) Emissions Reductions The threat of global warming that may result from accumulation of carbon dioxide and other “greenhouse gasses” poses a serious dilemma. In particular, cuts in emission levels bear a detrimental short–term impact on economic growth. At the same time, a depleting environment can severely hurt the economy – especially the agricultural sector – in the longer term. To complicate the matter further, scientific evidence on the relationship between emission levels and global warming is inconclusive, leading to uncertainty about the benefits of various cuts. One systematic approach to considering these conflicting goals involves the formulation of a dynamic system model that describes our understanding of economic growth and environmental science. Given such a model, the design of environmental policy amounts to dynamic programming. Unfortunately, classical algorithms are inapplicable due to the size of the state space. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
4. Control of Complex Systems e) Semiconductor Wafer Fabrication The manufacturing floor at a semiconductor wafer fabrication facility is organized into service stations, each equipped with specialized machinery. There is a single stream of jobs arriving on a production floor. Each job follows a deterministic route that revisits the same station multiple times. This leads to a scheduling problem where, at any time, each station must select a job to service such that (long term) production capacity is maximized. Such a system can be viewed as a special class of queueing networks, which are models suitable for a variety of applications in manufacturing, telecommunications, and computer systems. Optimal control of queueing networks is notoriously difficult, and this reputation is strengthened by formal characterizations of computational complexity. Other systems: parking lots, football, games strategy, combinatorial optimization – maintenance and repair, dynamic channel allocation, backgammon. Some papers in applications: -Tsitsiklis, J. and Van Roy, B. “Neuro-Dynamic Programming Overview and a Case Study in Optimal Stopping.”IEEE Proceedings of the 36th Conference on Decision & Control, San Diego, California, pp. 1181-1186, December, 1997. -Van Roy, B., Bertesekas, D.P., Lee, Y., and Tsitsiklis, J. “A Neuro-Dynamic Programming Approach to Retailer Inventory Management”IEEE Proceedings of the 36th Conference on Decision & Control, San Diego, California, pp. 4052-4057, December, 1997.-Marbach, P. and Tsitsiklis, J. “A Neuro-Dynamic Programming Approach to Admission Control in ATM Networks: The Single Link Case”Technical Report LIDS-P-2402, Laboratory for Information and Decision Systems, M.I.T., November 1997. - Marbach, P, Mihatsch, O, and Tsitsiklis, J. “Call Admission Control and Routing in Integrated Services Networks Using Reinforcement Learning”IEEE Proceedings of the 37th Conference on Decision & Control, Tampa, Florida, pp. 563-568, December, 1998.-Bertsekas, D.P., Homer, M.L., “Missile Defense and Interceptor Allocation by Neuro-Dynamic Programming”IEEE Transactions on Systems Man and Cybernetics, vol. 30, pp.101-110, 2000. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
4. Control of Complex Systems -For the examples presented, state spaces are intractable as consequence of the “curse of dimensionality”, that is, state spaces grow exponentially in the number of state variables. difficult (if not impossible) to compute (store) one value per state as is required by classical DP. -Additional shortcoming with classical DP: computations require use of transition probabilities. For many complex systems, such probabilities are not readily accessible. On the other hand, is often easier develop simulation models for the system and generate sample trajectories. -Objective of NDP: overcoming curse of dimensionalitythrough use of parameterized (value) function approximators and through use of output generated by simulators, rather than explicit transition probabilities. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
5. Value Function Approximation -Intractability of state spaces value function approximation. -Two important pre-conditions for the development of effective approximation: i-Choose a parameterization*: that yields a good approximation ii-Algorithms for computing appropriate parameter values. *Note: the choice of a suitable parameterization requires some practical experience or theoretical analysis that provides rough information about the shape of the function to be approximated. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
5. Value Function Approximation Linear parameterization* -General classes of parameterizations have been found used in NDP, to keep the exposition simple, let us focus on linear parameterizations, which take the form where 1, …, K are “basis functions” mapping X to and u = (u(1), …, u(K))’ is a vector of scalar weights.Similar to statistical regression, the basis functions 1, …, Kare selected by a human based on intuition or analysis to the problem at hand. Hint: one interpretation that is useful for the construction of basis functions involves viewing each function k asa “feature” – that is, a numerical value capturing a salient characteristic of the state that may be pertinent to effective decision making. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
5. Value Function Approximation Example 2: Tetris, continuation. In our stochastic control formulation of Tetris, the state is an encoding of the current wall configuration and the current falling piece. There are clearly too many states for exact dynamic programming algorithms to be applicable. However, we may believe that most information relevant to game–playing decisions can be captured by a few intuitive features. In particular, one feature, say 1, may map states to the height of the wall. Another, say 2, could map states to a measure of “jaggedness” of the wall. A third might provide a scalar encoding of the type of the current falling piece (there are seven different shapes in the arcade game). Given a collection of such features, the next task is to select weights u(1), . . . , u(K) such that for all states x. This approximation could then be used to generate a game–playing strategy. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
w=10 h=20 5. Value Function Approximation • Example 2: Tetris, continuation. • Similar approach is presented in the book “Neuro-Dynamic Programming” (chapter 8, cases • of study) by D.P. Bertesekas and J. Tsitsiklis, with the following parameterization, after some • experimentation: • Let the height hk of the kth column of the wall. There are w such features, where w is the wall’s width. • The absolute difference hk - hk+1 between the heights of the kth and the (k+1)st column, k=1,…, w-1. • The maximum wall height maxk hk. • The number of holes L in the wall, that is, the number of empty positions of the wall that are surrounded by full positions. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
5. Value Function Approximation Example 2: Tetris, continuation. Thus, there are 2w+1 features, which together with a constant offset, require 2w+2 weights in a linear architecture of the form Using this parameterization, with w=10 (22 features), an strategy is generated by NDP that eliminates an average of 3554 rows per game, reflecting performance comparable of an expert player. offset Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning Introduction to Temporal-Difference Learning Material from: Richard Sutton, “Learning to Predict by Methods of Temporal Differences”, Machine Learning, 3: 9-44, 1988. this paper provide the first formal results in the theory of temporal- difference (TD) methods. - “Learning to predict”: -Use of past experience (historical information) with a incompletely know system to predict its future behavior. -”Learning to predict“ is one of the most basic and prevalent kinds of learning. -In prediction learning training examples can be taken directly from the temporal -sequence of ordinary sensory input; no special supervisor or teacher is required. -Conventional prediction-learning methods (Widrow-Hoff, LMS, Delta Rule, Backpropagation): -Driven by error between predicted and actual outcomes. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • -TD methods: • Driven by error or difference between temporally successive predictions. learning occurs whenever there is a change in prediction over time. • -Advantages of TD methods over conventional methods: • They are more incremental, and therefore more easier to compute. • They tend to make more efficient use of their experience: they converge faster and produce better predictions. • -TD Approach: • Predictions are based on numerical features combined using adjustable parameters or “weights”. similar to connectionist models (Neural Networks) Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • -TD and supervised-learning approaches to prediction: • Historically, the most important learning paradigm has been supervised learning:learner is asked to associate pairs of items (input,output). • Supervised learning has been used in patter classification, concept acquisition, learning from examples, system identification, and associative memory. Input Real Output A + LearningAlgorithm Error - Input Estimated Output Adjust estimator parameters Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • -Single-step and multi-step prediction problems: • Single-step: all information about the correctness of each prediction is revealed at once. supervised learning methods. • Multi-step: correctness is not revealed until more than one step after the prediction is made, but partial information relevant to its correctness is revealed at each step. TD learning methods. • -Computational issues: • Sutton introduce a particular TD procedure by formally relating it to a classical supervised-learning procedure, the Widrow-Hoff rule (also known as “delta rule’, the ADALINE –Adaptive Linear Element, and the Least Mean Squares –LMS- filter). • We consider multi-step prediction problems in which experience comes in observation-outcome sequences of the form x1, x2, x3, …, xm, z, where each xt is a vector of observations available at time t in the sequence, and z is the outcome of the sequence. Also, xt n and z . Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • -Computational issues (cont.): • For each observation-outcome, the learner produces a corresponding sequence of predictions P1, P2, P3, …, Pm, each of which is an estimate of z. • Predictions Pt are based on a vector of modifiable parameters w. Pt(xt ,w) • All learning procedures are expressed as rules for updating w. For each observation, an increment to w, denoted wt , is determined. After a complete sequence has been processed, w is changed by (the sum of) all the sequences increments: • The supervised-learning approach treats each sequence of observations and its outcome as a sequence of observation-outcome pairs: (x1 , z) , (x2 , z), …, (xm , z).In this case increments due to time t dependson the error between Ptand z, and on how to change w will affect Pt . Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • -Computational issues (cont.): • Then, a prototypical supervised-learning update procedure is • where is a positive parameter affecting the rate of learning, and the gradient wPt , is • the vector of partial derivatives of Ptwith respect to each component of w. • Special case: consider Pt a linear function of xt and w, that is Pt = wTxt = iw(i) xt(i), where w(i) and xt(i) are ith component of w and xt. wPt = xt. Thus, • which correspond to the Widrow-Hoff rule. • This equation depend critically on z, and thus cannot be determined until the end of the sequence when z becomes known. All observations and predictions made during a sequence must be remembered until its end: wtcannot be computed incrementally. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning -TD Procedure: There is a Temporal-Difference procedure that produces the same (exactly) result, and can be Computedincrementally. The keyis to represent the error z-Pt as a sum of the changes in predictions as follows Using this equation and the prototypical supervised-learning equation, we have This equation can be computed incrementally,because it depends only on a pair of successivepredictions and on the sum of all past values of thegradient.We refer to this procedure as TD(1). Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • The TD() family of learning procedures: • The “hallmark” of temporal-difference methods is their sensitivity to changes in successive predictions rather than overall error between predictions and the final outcome. • In response to an increase (decrease) in prediction from Ptto Pt+1 , an increment wtis determined that increases (decreases) the predictions for some or all of the preceding observations vectors x1, …,xt. • TD(1) is a special case where all the predictions are altered to an equal extent. • Now, consider the case where greater alterations are made to more recent predictions. We consider an exponential weighting with recency, in which alterations to the predictions of observation vectors occurring k steps in the past are weighted according to k for 0 1: t-k =1, TD(1) 1 =0TD(0) k increases 0 < < 1 t-k 0 Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • The TD() family of learning procedures: • For =0 we have the TD(0) procedure: • For =1 we have the TD(1) procedure, that is equivalent to the Widrow-Hoff rule, except that TD(1) is incremental: • Alterations of past predictions can be weighted in ways other than the exponential form given previously, let Also referred in literature as eligibility vectors. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • (Material taken from Neuro-Dynamic Programming, Chapter 5) • Monte Carlo Simulation: brief overview • Suppose v is a random variable with unknown mean m that we want to estimate. • Using Monte Carlo simulation to estimate m: generate a number of samples v1, v2, …, vN, and then estimate m by forming the sample meanAlso, we can compute the sample mean recursivelywith M1 = v1 . Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • Monte Carlo simulation: case of iid samples • Suppose N samples v1, v2, …, vN, independent and identically distributed, with mean m, and • variance 2. Then we have • where MN is said to be an unbiased estimator of m. Also, its variance is given by • As N the variance of MN converge to zero MNconverges to m. • Also, the strong law of large numbersprovides an additional property: the sequence MN • converges to m with probability one. The estimator is consistent. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • Policy Evaluation by Monte Carlo simulation • -Consider the stochastic shortest path problem, with state space {0, 1, 2, …, n} with 0 as an • absorbing state and cost-free. Let the cost-to-go from i to j g( i , j ) (given the control action • μ(i) , pij(μ(i))). Suppose that we have a fixed stationary policy μ (proper) and we want to • calculate, using simulation, the corresponding cost-to-go vectorJ μ’ = ( J μ (1) J μ (2) . . . J μ (n) ). • Approach: generate, starting from each i, many samples states trajectories and average the • corresponding costs to obtain an approximation to J μ(i). Instead of do this for each state i, • lets use each trajectory to obtain cost samples for all states visited by the trajectory, and • consider the cost of the trajectory portion that starts at each intermediate state. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning • Policy Evaluation by Monte Carlo simulation • Suppose that a number of simulation runs are performed, each ending at the termination state 0. • Consider the m-th time a given state i0is encountered, and let (i0 , i1 ,…, iN) be the remainder of • the corresponding trajectory, where iN = 0. • Then, let c( i0 , m ) the cumulative cost up to reaching state 0, then • Some assumptions: different simulated trajectories are statistically independent, and each • trajectory is generated according to the Markov process determined by the policy μ.Then we • have, Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning Policy Evaluation by Monte Carlo simulation The estimation of Jμ(i) is obtained by forming the sample mean subsequent to the Kth encounter with state i. The sample mean can be expressed in iterative form: where starting with J(i)=0. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning Policy Evaluation by Monte Carlo simulation Consider the trajectory (i0, i1, …, iN), and let k an integer, such as 1 k N. The trajectory contains the subtrajectory (ik, ik+1,…, iN). sample trajectory with initial state ik that can be used to update J(ik) using the iterative equation previously presented. Algorithm*: Run a simulation and generate the state trajectory (i0, i1, …,iN), update the estimates J(ik) for each k=0, 1, …, N-1, the formula The step size γ(ik) can change from one iteration to iteration. *Additional details Neuro-Dynamic Programming by Bertsekas and Tsitsiklis, chapter 5. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao
6. Temporal-Difference Learning Monte Carlo simulation using Temporal Differences Here we consider the implementation of the Monte Carlo policy evaluation algorithm that incrementally updates the cost-to-go estimates J(i). First, assume that for any trajectory i0, i1, …, iN , with iN = 0, and ik =0 for k > N, g( ik, ik+1) =0 for k N, and J(0)=0. Also, the policy under consideration is proper. Lets rewrite the previous formula in the following form: Note that we use the property J(iN)=0. Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao