570 likes | 831 Vues
BAMS 517 Introduction to Markov Decision Processes. Eric Cope (mostly!) and Martin L. Puterman Sauder School of Business. Markov Decision Processes (MDPs). We’ve been looking so far at decision problems that require a choice of only one or a few actions
E N D
BAMS 517Introduction to Markov Decision Processes Eric Cope (mostly!) and Martin L. Puterman Sauder School of Business
Markov Decision Processes (MDPs) • We’ve been looking so far at decision problems that require a choice of only one or a few actions • The complexity of these decisions was small enough so that we could write down a decision tree and solve it • Many decision problems require that actions be taken repeatedly over time. The more decisions and uncertain events we have to consider, the more tedious it becomes to write down a decision tree for the problem • Decision trees are not a very parsimonious way to represent or solve very complex decision problems • MDPs provide a rich analytical framework for studying complex decision problems • A convenient and economical method of representing a decision problem • Can be used to study problems involving infinite sequences of decisions • MDPs can be easily stored and solved on a computer • Allow us to further explore the structure of optimal decisions
General Approach • Formulate a problem as an MDP by identifying its decision epochs, states, actions, transition probabilities and rewards. • Determine an optimality criterion – initially expected total reward. • Solve it using backward induction. • Determine the optimal policy.
The parking problem • Suppose you’re driving to a theatre that is at the end of a one-way street • You’ll need to park in one of the parking spots along the street. Naturally, you want to park as close as possible to the theatre • If however you drive past the theatre, you’ll have to park in the pay lot, and pay $c • You are a distance of x from the theatre and you see an open spot. Do you take it, or try to get closer to the theatre?
The parking problem • Some simplifying modeling assumptions: • You can only see if the spot you’re driving past is occupied or not (it’s nighttime) • Each spot has a probability p (0 < p < 1) of being vacant, and vacant spaces occur independently • You assign a cost of $a∙x for parking a distance of x from the theatre (measured in parking spot widths) • There are N total spots • Suppose N = 100. Try to imagine what a decision tree would look like for this problem • The tree is extremely unwieldy and complex – long series of similar nodes forming a complex network of branches • Much of the redundancy in the tree that can be eliminated using the MDP representation
The parking problem • You might imagine that the optimal solution to the problem is of the form: drive past the first k spots, and then park in the first open spot after that • As it turns out, for some value of k this will be the optimal rule, regardless of the values of the other problem parameters (c, N, etc.) • The structure of MDPs often allows you to prove that such general properties hold of the optimal solution • The parking problem is an instance of an “optimal stopping” problem • Once you park, you stop having to decide • The problem is deciding when to park given your current limited information
An inventory management problem • Suppose you are the manager of a small retail fashion outlet. You sell a popular type of dress (that remains “in fashion” for a year) at a price $p • You have only limited shelf and storage space for dresses. Customers come at random intervals and buy the dress. If you have no dresses in stock, they leave the store and you lose the sale • You can order new dresses from your supplier at any time. Ordering costs $K plus $c per dress • When do you order new dresses, and how many do you order?
An inventory management problem • Some simplifying modeling assumptions • Every day, a random number D of customers arrive, where Dε{0, 1, 2, …} • Demand for the dresses is constant, and the number of customers wanting to buy is independent from day to day • You place orders first thing in the morning, and they arrive immediately • You can only carry N dresses at any time due to storage limitations • The dresses will be sold for a year, after which they will be replaced by a new line. Unsold dresses will be disposed of at the end of the year at a reduced price. • Objective: Maximize expected total profit over the year • What is the key information needed to make a decision? • Constant information: space limitations, probability distribution of customer arrivals, ordering process • Changing information: inventory on hand, number of days until new line arrives
An inventory management problem • Imagine the decision tree for this problem • It will be extremely large, but will include many redundant nodes • For example: consider the following scenarios for day 100: • Each of these scenarios leads to the same situation on day 101 • In a decision tree, you would have to write separate branches on for each of these scenarios, even though you would face essentially the same decision on day 101 in each case • Decisions only depend on the present state of affairs, and not on past events or decisions
An inventory management problem • It is better to simply consider the decision problem that you would face on day 101, with 10 units of inventory, only once • In the MDP modeling framework, we talk about the “state” of having 10 units of inventory on day 101, and consider the decision problem faced by someone in this state • We can fully consider the decision problem by considering all the possible “states” we might find ourselves in • There will be a state for every combination of day and inventory level • Note that the states here correspond to the possible values of the “changing information” we might have about the problem at any time that is relevant to the overall objective of maximizing total profit • Each state incorporates in its description all the problem information needed to make a good decision when in that state
An inventory management problem • Note that in each possible state, different sets of actions are available to us • If in the current state there is an inventory of n items, then we can only order up to N –n items, due to space limitations • Our choice of action will lead us into new states with different probabilities • Suppose the demand D realized on each day is such that P(D=0) = P(D=1) = 1/3, P(D=2) = P(D=3) = 1/6, P(D > 3) = 0 • Suppose the current state is 10 items in inventory on day 100. Here are the probabilities that the next state will be 12 items on day 101 for different order values:
An inventory management problem • In addition, different actions will cause us to gain different profits • Daily profits = min{D, s+a} ∙p – ca– K if a > 0, and equals min{D,s} ∙p if a =0 where a is the number ordered • In order to choose the best action in any particular state, we need to understand: • The possible future states that are attainable through each action, and the probabilities of reaching those states • The possible future profits that we gain from each action • If we have to consider the evolution of states and total profits gained over the entire future, this could be quite complicated • Instead, we’ll only consider, for any given state and action, what the next state could be, and how much profit could be gained before moving to the next state • From these “one-step” state transitions and profits, we can analyze the entire decision problem
Elements of MDPs • Decision epochs: the times at which decisions may be made. The time in between successive decision epochs are referred to as periods • We first consider problems with a finite number N of decision epochs • The Nth decision epoch is a terminal point – no decision is made at it • States: a state describes all the relevant available information necessary in order to take an optimal action at any given decision epoch. We denote the set of all possible current information by the set of states S • Action sets: For each state sεS, the action setA(s) denotes the set of allowable actions that can be taken in state s • Transition probabilities: For any given state and action, the probabilities of moving (transitioning) to any other state in the next decision epoch • If s is the current state at time t and action aεA(s) is taken, then the probability of transitioning to state s0 is denoted pt(s0|s, a) • Assume Markovian dynamics: transitions only depend on current state and action
Elements of MDPs; timeline • Rewards: For any given state and action, the random benefits (costs) that are incurred before or during the next state transition • The reward received after taking action a in state s and at time t and arriving in state s0 is denoted rt(s, a, s0) • Note that the random rewards may depend on the next state s0. Usually we will only consider the expected rewardrt(s, a) = sεSrt(s, a, s0)pt(s0|s, a) • There may be terminal rewardsrN(s) at the Nth decision epoch Actions: a1 a2 a3 a4 a5 a6 a7 aN-1 States: s1 s2 s3 s4 s5 s6 s7 sN-1 sN 1 1 2 2 3 3 4 4 5 5 6 6 7 7 N-1 N-1 N N Time Time Epochs: Epochs: r1 r2 r3 r4 r5 r6 r7 rN-1 rN Rewards: Transition probability: p1(s2|s1,a1) p3(s4|s3,a3) p5(s6|s5,a5) pN-1(sN|sN-1,aN-1) p2(s3|s2,a2) p4(s5|s4a4) p6(s7|s6,a6)
But Who’s Counting • http://www.youtube.com/watch?v=KjZJ3TV-MyM • This can be formulated as an MDP • States – unoccupied slots and number to be placed • Actions – which unoccupied slot to place the number in • Rewards – the value of placing the number in the space • Goal – maximize total
… … … … … … … … … MDPs as decision trees N-3 N-2 N-1 N
… … … … … … … … … MDPs as decision trees States (decision nodes) N-3 N-2 N-1 N
… … … … … … … … … MDPs as decision trees Terminal States N-3 N-2 N-1 N
… … … … … … … … … MDPs as decision trees Actions (decision branches) N-3 N-2 N-1 N
MDPs as decision trees Rewards / Transitions (uncertainty nodes & branches) … … … … … … … … … N-3 N-2 N-1 N
Specifying states • As we mentioned, the state descriptor should provide all the relevant problem information that is necessary for making a decision • Normally, we don’t include problem information that doesn’t change from epoch to epoch in the state description • For example, in the parking problem, the cost of parking in the parking lot, the total number of parking spaces, etc. is constant at all times. Therefore, we don’t bother including this information in the state description • We would, however, include information about the state of the current parking space (vacant, occupied) • The number of epochs remaining also changes from epoch to epoch (for finite-horizon problems). However, we often won’t include this information in the state description, because it is implicitly present in the specification of rewards and transition probabilities
Deterministic dynamic programs • A special type of MDP (which are sometimes also called dynamic programs) is one in which all transition probabilities are either 0 or 1 • These are known as deterministic dynamic programs (DDPs) • Such problems arise in several applications: • finding shortest paths in networks • critical path analysis • sequential allocation • inventory problems with known demands
Shortest path through a network 4 • Nodes represent states, arcs represent actions/transitions, numbers represent arc lengths / costs • Goal: find the shortest route from node 1 to node 8 2 5 5 1 2 5 4 2 6 1 3 6 8 1 3 6 2 4 7
Formulation of shortest path problem • Let u(s) denote the shortest distance from node s to node 8 • We compute u(s) just as we did previously for MDPs • For any state s, for each arc s→s0 out of state s, add the distance to state s0, plus the shortest distance u(s0) from s0 to node 8. Let u(s) be minimum such value for all these arcs • u(8) = 0 – “terminal state” • u(7) = 6 + u(8) = 6 • u(6) = 2 + u(8) = 2 • u(5) = 1 + u(8) = 1 • u(4) = min{ 4 + u(5), 5 + u(6) } = min{ 4 + 1, 5 + 2 } = 5 • u(3) = min{ 5 + u(5), 6 + u(6), 1 + u(7) } = min{ 5 + 1, 6 + 2, 1 + 6 } = 6 • u(2) = 2 + u(7) = 2 + 6 = 8 • u(1) = min{ 2 + u(2), 4 + u(3), 3 + u(4) } = min{ 2 + 8, 4 + 6, 3 + 5 } = 8
Critical path models • A critical path network is a graphical method of analyzing a complex project with many tasks, having precedence constraints • In the graph of this network, nodes represent states of completion, and arcs represent tasks to complete • The node from which an arc originates represents the project’s state of completion that is needed to begin that task • All other tasks that logically precede that task must be done first • The arcs are numbered according to the length of time they require for completion • The critical path is a list of tasks forming a path through the network from the project start node to the project end node • If the completion of any task on the critical path is delayed, then the overall project must be delayed as well • The critical path is the longest path through the network
Critical path: launching a new product • It was determined that in order to launch a new product, the following activities needed to be completed: ActivityDescriptionPredecessorDuration A Product Design -- 5 mos. B Market Research -- 1 mo. C Production Analysis A 2 mos. D Product Model A 3 mos. E Sales Brochure A 2 mos. F Cost Analysis C 3 mos. G Product Testing D 4 mos. H Sales Training B, E 2 mos. I Pricing H 1 mo. J Project Report F, G, I 1 mo.
Critical path: launching a new product D (3) 4 2 G (4) C (2) A (5) F (3) J (1) 5 7 8 1 E (2) B (1) I (1) 3 H (2) 6
Critical path: launching a new product u(4)=5 • We use the backward induction algorithm to find the longest path • u(s) = Longest path from node s to the project completion node 8 • The critical path is marked in red • This is not really a decision problem per se, but an illustration of the backward induction problem applied to a network similar to a DDP u(2)=max{8,6,6}=8 D (3) 4 2 G (4) u(1)=max{13,5}=13 C (2) u(7)=1 u(8)=0 u(5)=4 A (5) F (3) J (1) 5 7 8 1 E (2) B (1) I (1) u(6)=2 3 H (2) 6 u(3)=4
Backward induction algorithm • Set uN(s) = rN(s) for all s ε S. Set n = N. • Solve • If n-1 = 0 stop, otherwise replace n by n-1 and return to 2.
Solution to the inventory problem • N = 10, K = 30, c = 20, p = 40, • d0 = d1 = d2 = 1/4; d3 = d4 = 1/8, dk= 0 for k > 4 • The Bellman equations to solve for n = 1,…, 365; s = 0,…,10 are: • We set the terminal rewards u366(s) = 0 for all s and again solve by working backwards for u365(s),…,u1(s) ordering costs expected sales revenue expected value of next state
Optimal order quantities • The optimal order quantities are listed at left • In the last period (day 365) you don’t want to order anything • You never order anything if you have 3 or more items in stock • In days 1, …, 359, if you have less than 3 items in stock, you “order up” to a full inventory level • This is known as an (s,S) inventory policy: • If your inventory falls below a level s, you order up to level S. • This is well-known (Scarf, 1959) to be the form of an optimal inventory policy for this problem
An investment/consumption problem • We consider a (simplified) approach to investment planning for your life: • You will make M dollars per year until age 65 • Each year, you can choose to spend a portion of this money and invest the rest • Invested money brings a return of r% per year • Your utility for spending x dollars per year is log(x/10000) • You are currently d years old, and you will live to the age of D • Let cn be the amount of money you consume in year n. We require that cn < wn, your level of wealth in year n (which includes your year n income) • Your current level of wealth is wd • Your lifetime utility is u(x) = n=dD log(cn/10000) • The value of any remaining wealth at your death is 0
An investment/consumption problem • We formulate this as a DDP. The equations to solve are: • At right is a graph of the optimal spending policy (along with total wealth) for a problem with the following parameters: • d = 40, D = 80, r = 10%, M = $50K, initial wealth w40 = $50K
The time value of money • When sequential decisions are made over the course of many months or years, it is often desirable to consider the “time value of money” • Receiving a dollar today is worth more to you than receiving a dollar tomorrow, since you have the added option of spending that dollar today • It is customary to “discount” the values of dollars received in the future by an appropriate factor • Let (t) denote the discount factor applied to money received t periods in the future • 0 < (t) < 1 • Thus, $x received t periods in the future is worth (t) ∙$x to you now • Typically, we let (t) = t, for some fixed , 0 < < 1 • The choice of depends on the length of the period
Discount factors in Bellman’s equation • The choice of (t) = t is convenient because then we can easily include discounting into Bellman’s equation • un(s) = maxa{ rn(s, a) + s0p(s0|s, a)¢un+1(s0) } • We simply apply the discount factor to the expected value of the next state • We regard the expected value of the next state as a “certain equivalent” value of the next decision we will make one period in the future • This certain equivalent value is discounted by • Quick proof (optional): Let (n) = (dn, …, dN-1). Then un(s) = max(n)E(n)[t=nNt-nrt(st, dt(st)) |sn = s] = max(n)E (n)[rn(s, dn(s)) + ¢t=n+1Nt-n-1rt(st, dt(st)) |sn = s] = maxafrn(s, a) + ¢s0p(s0|s, a)¢max(n+1)E(n+1)[t=n+1Nt-n-1rt(st, dt(st)) |sn+1=s0]g = maxafrn(s, a) + ¢s0p(s0|s, a)¢un+1(s0)g
The secretary problem • You are hiring for the position of secretary. You will interview N prospective candidates • After each interview, you can either decide to hire the person, or interview the next candidate • If you don’t offer the job to the current interviewee, they go and find work elsewhere, and can no longer be hired • The goal is to find a decision policy which maximizes the probability of hiring the best person • You don’t know the rank order of the candidates • You can only rank the people you’ve interviewed so far • For example, you know if the current candidate is better than any of the previous candidates
The secretary problem • If the current interviewee is not the best one you’ve seen so far, then the probability that this person is the best is zero • If there are more people to interview, then you might as well – there is at least a chance that the best is yet to come • If the current interviewee is the best one so far, then you might consider hiring him or her • What is the probability that this person is the best of all? This depends on the number you have interviewed so far • If the nth candidate is the best that you have seen so far, then the probability that this person is the best out of all N candidates is
The secretary problem • Because the only information relevant to your hiring decision is whether the current person is the best you’ve seen so far, we let the state s2 {0,1}, according to whether the current person is the best so far or not • If you decide to hire the current candidate, the “reward” is the probability that that person is the best of all N • If you decide not to hire the current candidate, there is no immediate reward and you interview the next candidate • The probability that the next (n+1st) person will be the best you’ve seen so far is equal to 1/(n+1) • Let un(s) be the maximal probability of eventually selecting the best candidate if the current state is s at time n. Then • uN+1(s) = 0, s = 0,1 • For n = 1, …, N: un(0) = nun+1(0) / (n+1) + un+1(1) / (n+1) un(1) = max{ nun+1(0) / (n+1) + un+1(1) / (n+1) , n/N}
The secretary problem • The optimal policy is of the form “Interview the first t candidates, and then hire the first candidate that is better than all the previous ones” • The graph at right shows how the probability of selecting the best candidate and the optimal proportion t/N of candidates to initially interview varies with N • Both curves approach 1/e¼ 0.36788 in the limit
Controlled queueing systems • A queue is a model for a congested resource • “Jobs” line up in the queue waiting for “service” • New jobs arrive to the end of the queue at random times • First-come-first-serve • Each job requires a random amount of service • Servers complete job service requirements at a given rate • Example: lining up for a bank teller • Queueing models are useful for estimating average waiting times, queue lengths, server utilization, etc
Controlled queueing models server completed jobs queue buffer arriving jobs jobs in queue job in service • We model N discrete time periods • In each period, at most one job can arrive (with probability ) and at most one job can complete service (with probability ). Assume > • Imagine that we can control the rate at which the server can complete jobs • That is, we can choose within the range (, 1) • Cost c() associated with choosing rate (c increasing in ) • State s = # number of jobs in system • Total buffer size B; states sεS = {0,…,B+1} • Reward R for every completed job • Penalty b for every job blocked due to full buffer • Holding cost h(s) depending on number of jobs in system (h increasing)
Controlled queueing models • Timeline: a new jobs arrive; incur blocking penalty b∙max{0,s+a-B-1} time n state s incur holding cost h(s+min{a,B+1-s}-k) choose rate n; pay costs c(n) k jobs complete service; receive reward Rk time n+1 state s+min{a,B+1-s}-k
Controlled queueing models • Optimality equations: • 0 < s < B +1: • s = 0: • s = B +1
‘Two-armed bandit’ problems • A ‘one-armed bandit’ is another name for a slot machine • If you play it long enough, it’s as good as being robbed • There are two slot machines you can play. You plan to play the machines N times • Every time you play, you can choose which arm to pull • You pay a dollar, and either win $2 or nothing on any given pull • You don’t know the probabilities of winning on either machine • The more you play either machine, the more you learn about the probabilities of winning • How do you decide which arms to pull?
‘Two-armed bandit’ problems • To simplify the problem, suppose you already know that machine 1 has a probability of winning of 0.5 • You don’t know the probability p of winning on machine 1 • Recall the coin and thumbtack example: • You can choose to either flip the coin or the thumbtack in each of N plays • Every time the outcome is heads / pin up, you win $1, otherwise you lose $1 • You know the probability of heads is 0.5, but you are unsure of the probability of the tack landing pin up • The question then becomes if and for how long you should play machine 2 • You may suspect that machine 2 has a slightly worse chance of winning • However, it might be worthwhile trying machine 2 for a while to see if it appears to be better than machine 1 • If machine 2 doesn’t appear to be better, then you can revert to machine 1 and continue playing that until the end • As long as you play machine 1 you don’t learn anything about p
‘Two-armed bandit’ problems • Let the prior probability for p be P(p= x), where x can be any value in {0, 0.01, 0.02, …, 0.99, 1} • Suppose at some point in time you have played machine 2 a total of n times, and you have won k times out of those n • It is possible to show that the posterior probabilities for p can be determined just from knowing n and k • We don’t need to know the entire sequence of wins and losses, only the totals • Denote the posterior as P(p = x|n, k) • P(p = x|n, k) /P(k wins out of n|p = x) ∙ P(p = x) (Bayes’ rule) = (n! / k! (n-k)!) ∙xk∙(1-x)n-k∙P(p = x) • Let q(n, k) denote the probability that you assign to the tack landing pin up on the next throw, after observing k wins out of n flips • q(n, k) = xx ∙ P(p = x|n, k)
‘Two-armed bandit’ problems • We can formulate the problem as the following MDP: • States: s = (n, k), where nε{0, 1, …, N} and kε {0, 1, …, n} • Actions: aε{1, 2} – which machine you play • Rewards: • r((n, k), 1) = 0.5(1) + 0.5(-1) = 0 • r((n, k), 2) = q(n, k)(1) + (1 – q(n,k))(-1) = 2q(n,k) – 1 • All terminal rewards are 0 • Transitions: • p((n,k) | (n,k), 1) = 1 • p((n+1,k+1) | (n,k), 2) = q(n,k) • p((n+1,k) | (n,k), 2) = 1 – q(n,k) • Optimality equations: • ut(n, k) = max {ut+1(n, k), 2q(n, k) – 1 + q(n, k) ∙ut+1(n+1, k+1) + (1-q(n, k)) ∙ ut+1(n+1,k) }
‘Two-armed bandit’ problems • A sample optimal policy is pictured at right • N = 1000 • uniform prior probability • optimal policy at time 100 • applied discount factor = 0.98 • x axis = n • y axis = k • red region ) machine 1 • green region ) machine 2 • blue region ) infeasible play machine 2 play machine 1
‘Two-armed bandit’ problems • Bandit problems are canonical models for sequential choice problems • Research project selection • Oil exploration • Clinical trials • Sequential search • Etc • Bandit problems also capture a fundamental dilemma in problems of incomplete information • How best to balance learning with maximizing reward • “Exploration / Exploitation” trade-off
Structured policies • One of the advantages of the MDP framework is that it is often possible to prove that the optimal policy for a given problem type has a special structure, for example: • threshold policies of the sort we saw in the inventory problem: • if s ≤ s*, then order up to level S • if s > s*, then order 0 • monotone policies such as are optimal for the queueing problem: • the larger the value of s (more people in the queue), the higher the value of you should choose • Establishing such a structure for the optimal policy is desirable because: • It provides general managerial insight into the problem • A simple decision rule can be easier to implement • Computation of the optimal policy can often be simplified