1 / 58

MDP Problems and Exact Solutions I

MDP Problems and Exact Solutions I. Ryan Christiansen Department of Mechanical Engineering and Materials Science Rice University Slides adapted from Mausam and Andrey Kolobov. MDP Problems: At a Glance. MDP Definition (2.1) Solutions of an MDP (2.2) Solution Existence (2.3)

alaqua
Télécharger la présentation

MDP Problems and Exact Solutions I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MDP Problems and Exact Solutions I • Ryan Christiansen • Department of Mechanical Engineering and Materials Science • Rice University • Slides adapted from Mausam and Andrey Kolobov

  2. MDP Problems: At a Glance • MDP Definition (2.1) • Solutions of an MDP (2.2) • Solution Existence (2.3) • Stochastic Shortest-Path MDPs (2.4) • Complexity of Solving MDPs (2.6)

  3. MDP Definition • MDP: an MDP is a tuple <S, A, D, T, R> • S is a finite state space • A is a finite action set • D is a sequence of discrete decision epochs (time steps) • T: S x A x S x D → [0, 1] is a transition function (probability) • R: S x A x S x D → ℝis a reward function

  4. An MDP Problem • How does an MDP problem work? • Initial Conditions: the starting state • Actions: actions are chosen at each decision epoch to traverse the MDP • Termination: reach a terminating state or the final decision epoch • The goal is to end up with the highest net reward at termination

  5. The Policy • Policy: a rule for choosing actions • Global/Complete: a policy must always be applicable for the entire MDP • In general, policies will be • Probabilistic: able to choose between multiple actions randomly • History-Dependent: able to utilize the execution history, or the set of state and action pairs previously traversed • π: H x A → [0, 1]

  6. Markovian Policy • Markovian Policy: a history-dependent policy that only depends on the current state and time step • For any two histories hs,tand h′s,tboth of which end at the same state s and timestep t, and for any action a, a Markovian policy π will satisfy π(hs,t, a) = π(h′s,t, a) • In practice, it functions as a history-independent policy • π: S x D x A → [0, 1] • For several important types of MDPs, at least one optimal solution is necessarily Markovian

  7. Stationary Markovian Policy • Stationary Markovian Policy: a Markovian policy that does not depend on time • For any two timesteps t1and t2and state s, and for any action a, a stationary Markovian policy π will satisfy π(s, t1, a) = π(s, t2, a) • π: S x A → [0, 1]

  8. Evaluate a Policy with the Value Function • Value Function: a function mapping the domain of the policy excluding the action set to a scalar value. • History dependent: V: H → [-∞, ∞] • Markovian: V: S x D → [-∞, ∞] • Stationary Markovian: V: S → [-∞, ∞] • Value Function of a Policy: the utility function of the reward sequence returned from executing the policy, or the total utility of the total reward • Vπ(hs,t) = u(Rtπhs,t, Rt+1πhs,t, …)

  9. Solutions of an MDP • A solution to an MDP is an optimal policy, or a policy that maximizes utility. • Policy π* is optimal if the value function, V*, is greater than or equal to the value function of any other policy. • V*(h) ≥ Vπ(h) for all h and π • Need to be careful when defining utility u(R1, R2, …) • For the same h, utility can be different across policy executions • Existence and uniqueness are not guaranteed for many types of MDPs.

  10. Expected Linear Additive Utility (ELAU) • u(R1, R2, …) = E(R1 + γR2 + γ2R3 …)where γis the discount factor • Assume γ = 1 unless stated otherwise • 0 ≤ γ < 1 : more immediate rewards are more valuable • γ = 1 : rewards are equally valuable, independently of time • γ > 1 : more distant rewards are more valuable

  11. The Optimality Principle • The Optimality Principle: if every policy’s quality can be measured by this policy’s ELAU, there exists a policy that is optimal at every timestep • There are some situations where it may not apply: • When stuck in a repeating sequence of states (a loop) • Infinite decision epochs • Infinite utility

  12. The Optimality Principle does not Hold • Oscillating Utility • Unbounded Utility

  13. Further Utility Considerations • Risk averse, or risk taking (ELAU is risk neutral) • $1 million guaranteed (risk averse) • 50% chance of $2 million, 50% chance of $0 (risk taking) • Expected value is the same, so risk neutral would choose either

  14. 3 Models with Well-Defined Policy ELAU • Finite-horizon MDPs • Infinite-horizon discounted-reward MDPs • Stochastic shortest-path MDPs

  15. Finite-Horizon MDPs: Motivation • Assume the agent acts for a finite # of time steps, L • Example applications: • Inventory management“How much X to order from the supplier every day ‘til the end of the season?” • Maintenance scheduling“When to schedule disruptive maintenance jobs by their deadline?”

  16. Finite-Horizon MDPs: Definition Puterman, 1994 • FH MDP: an FH MDP is a tuple <S, A, D, T, R> • S is a finite state space • A is a finite action set • D is a sequence of discrete decision epochs (time steps) up to a finite horizon L • T: S x A x S x D → [0, 1] is a transition function (probability) • R: S x A x S x D → ℝis a reward function

  17. Finite-Horizon MDPs: Optimality Principle • For an FH MDP with horizon |D| = L < ∞, let: • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L • Vπ(hs,L +1) = 0 • Then • V* exists and is Markovian, π* exists and is det. Markovian • For all sand1 ≤ t ≤ L:V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]

  18. Finite-Horizon MDPs: Optimality Principle • For an FH MDP with horizon |D| = L < ∞, let: • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L • Vπ(hs,L +1) = 0 • Then • V* exists and is Markovian, π* exists and is det. Markovian • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] ELAU For every history, the value of every policy is well-defined } } Each E[Ri] is finite } # of terms in the series is finite

  19. Finite-Horizon MDPs: Optimality Principle • For an FH MDP with horizon |D| = L < ∞, let: • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L • Vπ(hs,L +1) = 0 • Then • V* exists and is Markovian, π* exists and is det. Markovian • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] ELAU For every history, the value of every policy is well-defined } } Each E[Ri] is finite } # of terms in the series is finite Immediate utility of the next action If you act optimally now { { } } } Highest utility derivable from the next state Highest utility derivable from s at time t In expectation

  20. Perks of the FH MDP Optimality Principle • If V* and π* are Markovian, then we only need to consider Markovian V and π • Easy to compute π* • For all s, compute V*(s, t) and π*(s, t) for t = L, …, 1

  21. To Infinity and Beyond • Why go beyond the finite horizon? • Autonomous agents with long lifespans (elevators, investments, airplanes, etc.) • Infinite Horizon • Known to be infinite (can continue indefinitely) • Indefinite Horizon • Known to be unbounded (finite processes that can be delayed or extended but will eventually reach a terminal state)

  22. Analyzing MDPs with In(de)finite Horizon • Due to the infinite nature of D, we must define stationary, or time independent functions: • T: S x A x S → [0, 1] is a transition function (probability) • R: S x A x S → ℝis a reward function • π: S → A (this is also Markovian) • V: S → [-∞, ∞] (this is also Markovian)

  23. Infinite-Horizon Discounted-Reward MDPs: Definition • IHDR MDP: an IHDR MDP is a tuple <S, A, T, R, γ> • S is a finite state space • A is a finite action set • T: S x A x S → [0, 1] is a transition function (probability) • R: S x A x S → ℝis a reward function • γ is a discount factor between 0 and 1 (favors immediate rewards) • Policy value = discounted ELAU over infinite time steps

  24. Infinite-Horizon Discounted-Reward MDPs: Optimality Principle • For an IHDR MDP, let: • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]

  25. Infinite-Horizon Discounted-Reward MDPs: Optimality Principle • For an IHDR MDP, let: • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] ELAU For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1 } } } All γE[Ri] are bounded by some finite K and converge geometrically

  26. Infinite-Horizon Discounted-Reward MDPs: Optimality Principle • For an IHDR MDP, let: • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] ELAU For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1 } } } All γE[Ri] are bounded by some finite K and converge geometrically Future utility is discounted Optimal utility is time independent { {

  27. Perks of the IHDR MDP Optimality Principle • If V* and π* are stationary Markovian, then we only need to consider stationary Markovian V and π

  28. The Meaning of γ • γcan affect optimal policy significantly • γ = 0 + ε: yields myopic policies for impatient agents • γ = 1 - ε: yields far-sighted policies, inefficient to compute • How to set it? • Sometimes suggested by data (inflation rate, interest rate, tax rate) • Often set arbitrarily to a value that gives a reasonable policy

  29. Stochastic Shortest-Path MDPs: Motivation • Assume the agent pays a cost to achieve a goal • Example applications: • Controlling a Mars rover“How to collect scientific data without damaging the rover?” • Navigation“What’s the fastest way to get to a destination, taking into account the traffic jams?” • Cost is often time or a physical resource

  30. Stochastic Shortest-Path MDPs : Definition • SSP MDP: an SSP MDP is a tuple <S, A, T, C, G> • S is a finite state space • A is a finite action set • T: S x A x S → [0, 1] is a stationary transition function (probability) • C: S x A x S → ℝis a stationary cost function (R = -C) • G ⊆ S is a set of absorbing cost-free goal states • Under two conditions: • There is at least one proper policy (reaches goal with P = 1 from all states) • Every improper policy incurs a cost of infinity from every state from which it does not reach the goal with P = 1

  31. SSP MDP Details • In SSP, maximizing ELAU = minimizing expected cost • Every cost-minimizing policy is proper • Thus, an optimal policy is the cheapest way to reach the goal • Why are SSP MDPs called “indefinite-horizon?” • If a policy is optimal, it will take a finite, but a priori unknown time to reach the goal (time to goal is dependent on the evaluation of the policy). • At the limit as t approaches infinity, the probability that a goal state has been reached approaches P = 1

  32. SSP MDP Example

  33. SSP MDP Example

  34. SSP MDP Example

  35. SSP MDP Example

  36. SSP MDP Example

  37. SSP MDP Example

  38. SSP MDPs: Optimality Principle • For an SSP MDP, let: • Vπ(h) = Eπh[C1 + C2 + …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]

  39. SSP MDPs: Optimality Principle • For an SSP MDP, let: • Vπ(h) = Eπh[C1 + C2 + …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] ELAU For every history, the value of a policy is well-defined } Every policy either takes a finite exp. # of steps to reach a goal, or has infinite cost

  40. The MDP Hierarchy • FH => SSP: turn all states (s, L) into goals • IHDR => SSP: add (1 – γ)-probability transitions to goal • Focusing on SSP allows us to develop one set of algorithms to solve all three classes of MDPs.

  41. Flat vs. Factored Representation of MDPs • We are only concerned with using flat representation • This is the name for the representation already introduced on the definition slides • It is easier to solve MDPs in flat representation, and it is much easier to describe larger MDPs in flat representation • If you are interested in factored representation, read Section 2.5

  42. Computational Complexity of MDPs • Solving IHDR, SSP in flat representation is P-complete • Solving FH in flat representation is P-hard • They don’t benefit from parallelization, but are solvable in polynomial time

  43. MDP Exact Solutions I: At a Glance • Brute-Force Alogrithm (3.1) • Policy Evaluation (3.2)

  44. Brute Force Algorithm • Go over all policies π • How many? |A||S|, a finite amount • Evaluate each policy • Vπ(s), the expected cost of reaching the goal from s • Choose the best, π* • SSP optimality principle tells us that a best exists • Vπ*(s) ≤ Vπ(s)

  45. Policy Evaluation • Given a policy π, compute Vπ • To start out, assume that π is proper • Execution of π reaches a goal from any state

  46. Deterministic SSPs • Policy graph for π • π(s0) = a0; π(s1) = a1 • Vπ(s1) = 1 • Vπ(s0) = 5 + 1 = 6

  47. Acyclic SSPs • Policy graph for π • Vπ(s1) = 1 • Vπ(s2) = 4 • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 4) = 6

  48. Cyclic SSPs • Policy graph for π • Vπ(s1) = 1 • Vπ(s2) = 0.7(4) + 0.3(3 + Vπ(s0)) • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 0.7(4) + 0.3(3 + Vπ(s0)))

  49. Cyclic SSPs • Generalized system of equations • Vπ(sg) = 0 • Vπ(s1) = 1 + Vπ(sg) • Vπ(s2) = 0.7(4 + Vπ(sg)) + 0.3(3 + Vπ(s0)) • Vπ(s0) = 0.6(5 + Vπ(s1)) + 0.4(2 + Vπ(s2))

  50. Policy Evaluation with a System of Equations • Constructing the system of equations • Vπ(s) = 0 if s ∈ G • Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)] • |S| variables • O(|S|3) running time

More Related