Create Presentation
Download Presentation

Download Presentation
## MDP Problems and Exact Solutions I

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**MDP Problems and Exact Solutions I**• Ryan Christiansen • Department of Mechanical Engineering and Materials Science • Rice University • Slides adapted from Mausam and Andrey Kolobov**MDP Problems: At a Glance**• MDP Definition (2.1) • Solutions of an MDP (2.2) • Solution Existence (2.3) • Stochastic Shortest-Path MDPs (2.4) • Complexity of Solving MDPs (2.6)**MDP Definition**• MDP: an MDP is a tuple <S, A, D, T, R> • S is a finite state space • A is a finite action set • D is a sequence of discrete decision epochs (time steps) • T: S x A x S x D → [0, 1] is a transition function (probability) • R: S x A x S x D → ℝis a reward function**An MDP Problem**• How does an MDP problem work? • Initial Conditions: the starting state • Actions: actions are chosen at each decision epoch to traverse the MDP • Termination: reach a terminating state or the final decision epoch • The goal is to end up with the highest net reward at termination**The Policy**• Policy: a rule for choosing actions • Global/Complete: a policy must always be applicable for the entire MDP • In general, policies will be • Probabilistic: able to choose between multiple actions randomly • History-Dependent: able to utilize the execution history, or the set of state and action pairs previously traversed • π: H x A → [0, 1]**Markovian Policy**• Markovian Policy: a history-dependent policy that only depends on the current state and time step • For any two histories hs,tand h′s,tboth of which end at the same state s and timestep t, and for any action a, a Markovian policy π will satisfy π(hs,t, a) = π(h′s,t, a) • In practice, it functions as a history-independent policy • π: S x D x A → [0, 1] • For several important types of MDPs, at least one optimal solution is necessarily Markovian**Stationary Markovian Policy**• Stationary Markovian Policy: a Markovian policy that does not depend on time • For any two timesteps t1and t2and state s, and for any action a, a stationary Markovian policy π will satisfy π(s, t1, a) = π(s, t2, a) • π: S x A → [0, 1]**Evaluate a Policy with the Value Function**• Value Function: a function mapping the domain of the policy excluding the action set to a scalar value. • History dependent: V: H → [-∞, ∞] • Markovian: V: S x D → [-∞, ∞] • Stationary Markovian: V: S → [-∞, ∞] • Value Function of a Policy: the utility function of the reward sequence returned from executing the policy, or the total utility of the total reward • Vπ(hs,t) = u(Rtπhs,t, Rt+1πhs,t, …)**Solutions of an MDP**• A solution to an MDP is an optimal policy, or a policy that maximizes utility. • Policy π* is optimal if the value function, V*, is greater than or equal to the value function of any other policy. • V*(h) ≥ Vπ(h) for all h and π • Need to be careful when defining utility u(R1, R2, …) • For the same h, utility can be different across policy executions • Existence and uniqueness are not guaranteed for many types of MDPs.**Expected Linear Additive Utility (ELAU)**• u(R1, R2, …) = E(R1 + γR2 + γ2R3 …)where γis the discount factor • Assume γ = 1 unless stated otherwise • 0 ≤ γ < 1 : more immediate rewards are more valuable • γ = 1 : rewards are equally valuable, independently of time • γ > 1 : more distant rewards are more valuable**The Optimality Principle**• The Optimality Principle: if every policy’s quality can be measured by this policy’s ELAU, there exists a policy that is optimal at every timestep • There are some situations where it may not apply: • When stuck in a repeating sequence of states (a loop) • Infinite decision epochs • Infinite utility**The Optimality Principle does not Hold**• Oscillating Utility • Unbounded Utility**Further Utility Considerations**• Risk averse, or risk taking (ELAU is risk neutral) • $1 million guaranteed (risk averse) • 50% chance of $2 million, 50% chance of $0 (risk taking) • Expected value is the same, so risk neutral would choose either**3 Models with Well-Defined Policy ELAU**• Finite-horizon MDPs • Infinite-horizon discounted-reward MDPs • Stochastic shortest-path MDPs**Finite-Horizon MDPs: Motivation**• Assume the agent acts for a finite # of time steps, L • Example applications: • Inventory management“How much X to order from the supplier every day ‘til the end of the season?” • Maintenance scheduling“When to schedule disruptive maintenance jobs by their deadline?”**Finite-Horizon MDPs: Definition**Puterman, 1994 • FH MDP: an FH MDP is a tuple <S, A, D, T, R> • S is a finite state space • A is a finite action set • D is a sequence of discrete decision epochs (time steps) up to a finite horizon L • T: S x A x S x D → [0, 1] is a transition function (probability) • R: S x A x S x D → ℝis a reward function**Finite-Horizon MDPs: Optimality Principle**• For an FH MDP with horizon |D| = L < ∞, let: • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L • Vπ(hs,L +1) = 0 • Then • V* exists and is Markovian, π* exists and is det. Markovian • For all sand1 ≤ t ≤ L:V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ]**Finite-Horizon MDPs: Optimality Principle**• For an FH MDP with horizon |D| = L < ∞, let: • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L • Vπ(hs,L +1) = 0 • Then • V* exists and is Markovian, π* exists and is det. Markovian • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] ELAU For every history, the value of every policy is well-defined } } Each E[Ri] is finite } # of terms in the series is finite**Finite-Horizon MDPs: Optimality Principle**• For an FH MDP with horizon |D| = L < ∞, let: • Vπ(hs,t) = Eπhs,t[R1 + … + RL – t] for all 1 ≤ t ≤ L • Vπ(hs,L +1) = 0 • Then • V* exists and is Markovian, π* exists and is det. Markovian • For all sand1 ≤ t ≤ L: V*(s,t) = maxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] π*(s,t) = argmaxa in A [Σs′ in S T(s, a, s′, t) [ R(s, a, s′, t) + V*(s′, t+1) ] ] ELAU For every history, the value of every policy is well-defined } } Each E[Ri] is finite } # of terms in the series is finite Immediate utility of the next action If you act optimally now { { } } } Highest utility derivable from the next state Highest utility derivable from s at time t In expectation**Perks of the FH MDP Optimality Principle**• If V* and π* are Markovian, then we only need to consider Markovian V and π • Easy to compute π* • For all s, compute V*(s, t) and π*(s, t) for t = L, …, 1**To Infinity and Beyond**• Why go beyond the finite horizon? • Autonomous agents with long lifespans (elevators, investments, airplanes, etc.) • Infinite Horizon • Known to be infinite (can continue indefinitely) • Indefinite Horizon • Known to be unbounded (finite processes that can be delayed or extended but will eventually reach a terminal state)**Analyzing MDPs with In(de)finite Horizon**• Due to the infinite nature of D, we must define stationary, or time independent functions: • T: S x A x S → [0, 1] is a transition function (probability) • R: S x A x S → ℝis a reward function • π: S → A (this is also Markovian) • V: S → [-∞, ∞] (this is also Markovian)**Infinite-Horizon Discounted-Reward MDPs: Definition**• IHDR MDP: an IHDR MDP is a tuple <S, A, T, R, γ> • S is a finite state space • A is a finite action set • T: S x A x S → [0, 1] is a transition function (probability) • R: S x A x S → ℝis a reward function • γ is a discount factor between 0 and 1 (favors immediate rewards) • Policy value = discounted ELAU over infinite time steps**Infinite-Horizon Discounted-Reward MDPs: Optimality**Principle • For an IHDR MDP, let: • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ]**Infinite-Horizon Discounted-Reward MDPs: Optimality**Principle • For an IHDR MDP, let: • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] ELAU For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1 } } } All γE[Ri] are bounded by some finite K and converge geometrically**Infinite-Horizon Discounted-Reward MDPs: Optimality**Principle • For an IHDR MDP, let: • Vπ(h) = Eπh[R1 + γR2 + γ2R3 …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ R(s, a, s′) + γV*(s′) ] ] ELAU For every history, the value of a policy is well-defined thanks to 0 ≤ γ < 1 } } } All γE[Ri] are bounded by some finite K and converge geometrically Future utility is discounted Optimal utility is time independent { {**Perks of the IHDR MDP Optimality Principle**• If V* and π* are stationary Markovian, then we only need to consider stationary Markovian V and π**The Meaning of γ**• γcan affect optimal policy significantly • γ = 0 + ε: yields myopic policies for impatient agents • γ = 1 - ε: yields far-sighted policies, inefficient to compute • How to set it? • Sometimes suggested by data (inflation rate, interest rate, tax rate) • Often set arbitrarily to a value that gives a reasonable policy**Stochastic Shortest-Path MDPs: Motivation**• Assume the agent pays a cost to achieve a goal • Example applications: • Controlling a Mars rover“How to collect scientific data without damaging the rover?” • Navigation“What’s the fastest way to get to a destination, taking into account the traffic jams?” • Cost is often time or a physical resource**Stochastic Shortest-Path MDPs : Definition**• SSP MDP: an SSP MDP is a tuple <S, A, T, C, G> • S is a finite state space • A is a finite action set • T: S x A x S → [0, 1] is a stationary transition function (probability) • C: S x A x S → ℝis a stationary cost function (R = -C) • G ⊆ S is a set of absorbing cost-free goal states • Under two conditions: • There is at least one proper policy (reaches goal with P = 1 from all states) • Every improper policy incurs a cost of infinity from every state from which it does not reach the goal with P = 1**SSP MDP Details**• In SSP, maximizing ELAU = minimizing expected cost • Every cost-minimizing policy is proper • Thus, an optimal policy is the cheapest way to reach the goal • Why are SSP MDPs called “indefinite-horizon?” • If a policy is optimal, it will take a finite, but a priori unknown time to reach the goal (time to goal is dependent on the evaluation of the policy). • At the limit as t approaches infinity, the probability that a goal state has been reached approaches P = 1**SSP MDPs: Optimality Principle**• For an SSP MDP, let: • Vπ(h) = Eπh[C1 + C2 + …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ]**SSP MDPs: Optimality Principle**• For an SSP MDP, let: • Vπ(h) = Eπh[C1 + C2 + …] for all h • Then • V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian • For all s:V*(s) = maxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] π*(s) = argmaxa in A [Σs′ in S T(s, a, s′) [ C(s, a, s′) + V*(s′) ] ] ELAU For every history, the value of a policy is well-defined } Every policy either takes a finite exp. # of steps to reach a goal, or has infinite cost**The MDP Hierarchy**• FH => SSP: turn all states (s, L) into goals • IHDR => SSP: add (1 – γ)-probability transitions to goal • Focusing on SSP allows us to develop one set of algorithms to solve all three classes of MDPs.**Flat vs. Factored Representation of MDPs**• We are only concerned with using flat representation • This is the name for the representation already introduced on the definition slides • It is easier to solve MDPs in flat representation, and it is much easier to describe larger MDPs in flat representation • If you are interested in factored representation, read Section 2.5**Computational Complexity of MDPs**• Solving IHDR, SSP in flat representation is P-complete • Solving FH in flat representation is P-hard • They don’t benefit from parallelization, but are solvable in polynomial time**MDP Exact Solutions I: At a Glance**• Brute-Force Alogrithm (3.1) • Policy Evaluation (3.2)**Brute Force Algorithm**• Go over all policies π • How many? |A||S|, a finite amount • Evaluate each policy • Vπ(s), the expected cost of reaching the goal from s • Choose the best, π* • SSP optimality principle tells us that a best exists • Vπ*(s) ≤ Vπ(s)**Policy Evaluation**• Given a policy π, compute Vπ • To start out, assume that π is proper • Execution of π reaches a goal from any state**Deterministic SSPs**• Policy graph for π • π(s0) = a0; π(s1) = a1 • Vπ(s1) = 1 • Vπ(s0) = 5 + 1 = 6**Acyclic SSPs**• Policy graph for π • Vπ(s1) = 1 • Vπ(s2) = 4 • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 4) = 6**Cyclic SSPs**• Policy graph for π • Vπ(s1) = 1 • Vπ(s2) = 0.7(4) + 0.3(3 + Vπ(s0)) • Vπ(s0) = 0.6(5 + 1) + 0.4(2 + 0.7(4) + 0.3(3 + Vπ(s0)))**Cyclic SSPs**• Generalized system of equations • Vπ(sg) = 0 • Vπ(s1) = 1 + Vπ(sg) • Vπ(s2) = 0.7(4 + Vπ(sg)) + 0.3(3 + Vπ(s0)) • Vπ(s0) = 0.6(5 + Vπ(s1)) + 0.4(2 + Vπ(s2))**Policy Evaluation with a System of Equations**• Constructing the system of equations • Vπ(s) = 0 if s ∈ G • Vπ(s) = ∑s′ ∈ S T(s, π(s), s′) [C(s, π(s), s′) + Vπ(s′)] • |S| variables • O(|S|3) running time