Reinforcement Learning : Dynamic Programming

Reinforcement Learning:Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/ TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

Reinforcement Learning RL = “Sampling based methods to solve optimal control problems” • Contents • Defining AI • Markovian Decision Problems • Dynamic Programming • Approximate Dynamic Programming • Generalizations (Rich Sutton)

Literature • Books • Richard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 • Dimitri P. Bertsekas, John Tsitsiklis: Neuro-Dynamic Programming, Athena Scientific, 1996 • Journals • JMLR, MLJ, JAIR, AI • Conferences • NIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI

Some More Books • Martin L. Puterman. Markov Decision Processes. Wiley, 1994. • Dimitri P. Bertsekas: Dynamic Programming and Optimal Control. Athena Scientific. Vol. I (2005), Vol. II (2007). • James S. Spall: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, 2003.

Resources • RL-Glue • http://rlai.cs.ualberta.ca/RLBB/top.html • RL-Library • http://rlai.cs.ualberta.ca/RLR/index.html • The RL Toolbox 2.0 • http://www.igi.tugraz.at/ril-toolbox/general/overview.html • OpenDP • http://opendp.sourceforge.net • RL-Competition (2008)! • http://rl-competition.org/ • June 1st, 2008: Test runs begin! • Related fields: • Operations research (MOR, OR) • Control theory (IEEE TAC, Automatica, IEEE CDC, ECC) • Simulation optimization (Winter Simulation Conference)

Abstract Control Model Environment Sensations (and reward) actions Controller = agent “Perception-action loop”

Zooming in.. memory reward external sensations agent state internal sensations actions

A Mathematical Model • Plant (controlled object): • xt+1 = f(xt,at,vt)xt : state, vt : noise • zt = g(xt,wt)zt : sens/obs, wt : noise • State: Sufficient statistics for the future • Independently of what we measure ..or.. • Relative to measurements • Controller • at = F(z1,z2,…,zt)at: action/control => PERCEPTION-ACTION LOOP “CLOSED-LOOP CONTROL” • Design problem:F = ? • Goal: t=1T r(zt,at)! max “Subjective State” “Objective State”

A Classification of Controllers • Feedforward: • a1,a2,… is designed ahead in time • ??? • Feedback: • Purely reactive systems: at = F(zt) • Why is this bad? • Feedback with memory: mt = M(mt-1,zt,at-1) ~interpreting sensations at = F(mt) decision making: deliberative vs. reactive

Feedback controllers • Plant: • xt+1 = f(xt,at,vt) • zt+1 = g(xt,wt) • Controller: • mt = M(mt-1,zt,at-1) • at = F(mt) • mt¼ xt: state estimation, “filtering” difficulties: noise,unmodelled parts • How do we compute at? • With a model (f’): model-based control • ..assumes (some kind of) state estimation • Without a model: model-free control

Markovian Decision Problems

Markovian Decision Problems • (X,A,p,r) • X – set of states • A – set of actions (controls) • p – transition probabilitiesp(y|x,a) • r – rewardsr(x,a,y), or r(x,a), or r(x) • ° – discount factor0 ·° < 1

The Process View • (Xt,At,Rt) • Xt – state at time t • At – action at time t • Rt – reward at time t • Laws: • Xt+1~p(.|Xt,At) • At ~ ¼(.|Ht) • ¼: policy • Ht = (Xt,At-1,Rt-1, .., A1,R1,X0) – history • Rt = r(Xt,At,Xt+1)

The Control Problem • Value functions: • Optimal value function: • Optimal policy:

Operations research Econometrics Optimal investments Replacement problems Option pricing Logistics, inventory management Active vision Production scheduling Dialogue control Control, statistics Games, AI Bioreactor control Robotics (Robocup Soccer) Driving Real-time load balancing Design of experiments (Medical tests) Applications of MDPs

Variants of MDPs • Discounted • Undiscounted: Stochastic Shortest Path • Average reward • Multiple criteria • Minimax • Games

MDP Problems • PlanningThe MDP (X,A,P,r,°) is known.Find an optimal policy ¼*! • LearningThe MDP is unknown. You are allowed to interact with it. Find an optimal policy ¼*! • Optimal learningWhile interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning

Solving MDPs – Dimensions • Which problem? (Planning, learning, optimal learning) • Exact or approximate? • Uses samples? • Incremental? • Uses value functions? • Yes: Value-function based methods • Planning: DP, Random Discretization Method, FVI, … • Learning: Q-learning, Actor-critic, … • No: Policy search methods • Planning: Monte-Carlo tree search, Likelihood ratio methods (policy gradient), Sample-path optimization (Pegasus), • Representation • Structured state: • Factored states, logical representation, … • Structured policy space: • Hierarchical methods

Dynamic Programming

Richard Bellman (1920-1984) • Control theory • Systems Analysis • Dynamic Programming:RAND Corporation, 1949-1955 • Bellman equation • Bellman-Ford algorithm • Hamilton-Jacobi-Bellman equation • “Curse of dimensionality” • invariant imbeddings • Grönwall-Bellman inequality

Bellman Operators • Let ¼:X ! A be a stationary policy • B(X) = { V | V:X! R, ||V||1<1 } • T¼:B(X)! B(X) • (T¼ V)(x) = yp(y|x,¼(x)) [r(x,¼(x),y)+° V(y)] • Theorem:T¼ V¼ = V¼ • Note: This is a linear system of equations: r¼ + ° P¼ V¼ = V¼ V¼ = (I-° P¼)-1 r¼

Proof of T¼ V¼ = V¼ • What you need to know: • Linearity of expectation: E[A+B] = E[A]+E[B] • Law of total expectation:E[ Z ] = x P(X=x) E[ Z | X=x ], andE[ Z | U=u ] = x P(X=x|U=u) E[Z|U=u,X=x]. • Markov property: E[ f(X1,X2,..) | X1=y,X0=x] = E[ f(X1,X2,..) | X1=y] • V¼(x) = E¼ [t=01°t Rt|X0 = x] = y P(X1=y|X0=x) E¼[t=01°t Rt|X0 = x,X1=y] (by the law of total expectation)= y p(y|x,¼(x)) E¼[t=01°t Rt|X0 = x,X1=y] (since X1~p(.|X0,¼(X0)))= y p(y|x,¼(x)) {E¼[ R0|X0=x,X1=y]+° E¼ [ t=01°t Rt+1|X0=x,X1=y]}(by the linearity of expectation)= y p(y|x,¼(x)) {r(x,¼(x),y) + ° V¼(y)}(using the definition of r, V¼)= (T¼ V¼)(x). (using the definition of T¼)

The Banach Fixed-Point Theorem • B = (B,||.||) Banach space • T: B1! B2is L-Lipschitz (L>0) if for any U,V,|| T U – T V || · L ||U-V||. • T is contraction if B1=B2, L<1; L is a contraction coefficient of T • Theorem [Banach]: Let T:B! B be a °-contraction. Then T has a unique fixed point V and 8 V02 B, Vk+1=T Vk, Vk! V and ||Vk-V||=O(°k)

An Algebra for Contractions • Prop: If T1:B1! B2is L1-Lipschitz, T2: B2! B3is L2-Lipschitz then T2 T1is L1 L2Lipschitz. • Def: If T is 1-Lipschitz, T is called a non-expansion • Prop: M: B(X£ A) ! B(X), M(Q)(x) = maxa Q(x,a) is a non-expansion • Prop: Mulc: B! B, Mulc V = c V is |c|-Lipschitz • Prop: Addr: B ! B, Add V = r + V is a non-expansion. • Prop: K: B(X) ! B(X), (K V)(x)=y K(x,y) V(y) is a non-expansion if K(x,y)¸ 0, y K(x,y) =1.

Policy Evaluations are Contractions • Def: ||V||1 = maxx |V(x)|, supremum norm; here ||.|| • Theorem: Let T¼the policy evaluation operator of some policy ¼. Then T¼is a °-contraction. • Corollary: V¼is the unique fixed point of T¼. Vk+1 = T¼ Vk! V¼, 8 V02 B(X) and ||Vk-V¼|| = O(°k).

The Bellman Optimality Operator • Let T:B(X)! B(X) be defined by(TV)(x) = maxay p(y|x,a) { r(x,a,y) + ° V(y) } • Def: ¼ is greedy w.r.t. V if T¼V =T V. • Prop: T is a °-contraction. • Theorem (BOE): T V* = V*. • Proof: Let V be the fixed point of T. T¼· T  V*· V. Let ¼ be greedy w.r.t. V. Then T¼ V = T V. Hence V¼= V  V · V* V = V*.

Value Iteration • Theorem: For any V02 B(X), Vk+1 = T Vk, Vk! V*and in particular ||Vk – V*||=O(°k). • What happens when we stop “early”? • Theorem: Let ¼ be greedy w.r.t. V. Then ||V¼ – V*|| · 2||TV-V||/(1-°). • Proof: ||V¼-V*||· ||V¼-V||+||V-V*|| … • Corollary: In a finite MDP, the number of policies is finite. We can stop when ||Vk-TVk|| ·¢(1-°)/2, where ¢ = min{ ||V*-V¼|| : V¼ V* } Pseudo-polynomial complexity

Policy Improvement [Howard ’60] • Def: U,V2 B(X), V ¸ U if V(x) ¸ U(x) holds for all x2 X. • Def: U,V2 B(X), V > U if V ¸ U and 9 x2 X s.t. V(x)>U(x). • Theorem (Policy Improvement): Let ¼’ be greedy w.r.t. V¼. Then V¼’¸ V¼. If T V¼>V¼ then V¼’>V¼.

Policy Iteration • Policy Iteration(¼) • V  V¼ • Do {improvement} • V’  V • Let ¼: T¼ V = T V • V  V¼ • While (V>V’) • Return ¼

Policy Iteration Theorem • Theorem: In a finite, discounted MDP policy iteration stops after a finite number of steps and returns an optimal policy. • Proof: Follows from the Policy Improvement Theorem.

Linear Programming • V ¸ T V V ¸ V* = T V*. • Hence, V* is the “largest” V that satisfies V ¸ T V. V ¸ T V , (*) V(x) ¸yp(y|x,a){r(x,a,y)+° V(y)}, 8x,a • LinProg(V): • x V(x) ! min s.t. V satisfies (*). • Theorem: LinProg(V) returns the optimal value function, V*. • Corollary: Pseudo-polynomial complexity

Variations of a Theme

Approximate Value Iteration • AVI: Vk+1 = T Vk + ²k • AVI Theorem:Let ² = maxk ||²k||. Thenlimsupk!1 ||Vk-V*|| · 2°² / (1-°). • Proof: Let ak = ||Vk –V*||.Then ak+1 = ||Vk+1 – V*|| = ||T Vk – T V* + ²k || ·° ||Vk-V*|| + ² = ° ak + ². Hence, ak is bounded. Take “limsup” of both sides: a·° a + ²; reorder.// (e.g., [BT96])

Fitted Value Iteration – Non-expansion Operators • FVI: Let A be a non-expansion, Vk+1 = A T Vk. Where does this converge to? • Theorem: Let U,V be such that A T U = U and T V = V. Then||V-U|| · ||AV –V||/(1-°). • Proof: Let U’ be the fixed point of TA. Then ||U’-V|| ·° ||AV-V||/(1-°). Since A U’ = A T (AU’), U=AU’. Hence, ||U-V|| =||AU’-V|| · ||AU’-AV||+||AV-V|| … [Gordon ’95]

Application to Aggregation • Let ¦ be a partition of X, S(x) be the unique cell that x belongs to. • Let A: B(X)! B(X) be(A V)(x) = z¹(z;S(x)) V(z), where ¹ is a distribution over S(x). • p’(C|B,a) = x2 B¹(x;B) y2 C p(y|x,a), r’(B,a,C) = x2 B¹(x;B) y2 C p(y|x,a) r(x,a,y). • Theorem: Take (¦,A,p’,r’), let V’ be its optimal value function, V’E(x) = V’(S(x)). Then ||V’E – V*|| · ||AV*-V*||/(1-°).

Action-Value Functions • L: B(X)! B(X£ A), (L V)(x,a) = y p(y|x,a) {r(x,a,y) + ° V(y)}.“One-step lookahead”. • Note: ¼ is greedy w.r.t. V if (LV)(x,¼(x)) = max a (LV)(x,a). • Def: Q* = L V*. • Def: Let Max: B(X£ A)! B(X), (Max Q)(x) = max a Q(x,a). • Note: Max L = T. • Corollary: Q* = L Max Q*. • Proof: Q* = L V* = L T V* = L Max L V* = L Max Q*. • T = L Max is a °-contraction • Value iteration, policy iteration, …

Changing Granularity • Asynchronous Value Iteration: • Every time-step update only a few states • AsyncVI Theorem: If all states are updated infinitely often, the algorithm converges to V*. • How to use? • Prioritized Sweeping • IPS [MacMahan & Gordon ’05]: • Instead of an update, put state on the priority queue • When picking a state from the queue, update it • Put predecessors on the queue • Theorem: Equivalent to Dijkstra on shortest path problems, provided that rewards are non-positive • LRTA* [Korf ’90] ~ RTDP [Barto, Bradtke, Singh ’95] • Focussing on parts of the state that matter • Constraints: • Same problem solved from several initial positions • Decisions have to be fast • Idea: Update values along the paths

Changing Granularity • Generalized Policy Iteration: • Partial evaluation and partial improvement of policies • Multi-step lookahead improvement • AsyncPI Theorem: If both evaluation and improvement happens at every state infinitely often then the process converges to an optimal policy. [Williams & Baird ’93]

Variations of a theme [SzeLi99] • Game against nature [Heger ’94]:infwt°t Rt(w) with X0 = x • Risk-sensitive criterion: log ( E[ exp(t°t Rt ) | X_0 = x ] ) • Stochastic Shortest Path • Average Reward • Markov games • Simultaneous action choices (Rock-paper-scissor) • Sequential action choices • Zero-sum (or not)

References • [Howard ’60] R.A. Howard: Dynamic Programming and Markov Processes, The MIT Press, Cambridge, MA, 1960. • [Gordon ’95] G.J. Gordon: Stable function approximation in dynamic programming. ICML, pp. 261—268, 1995. • [Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990. • [McMahan, Gordon ’05] H. B. McMahan and Geoffrey J. Gordon: Fast Exact Planning in Markov Decision Processes. ICAPS. • [Korf ’90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence 42, 189–211, 1990. • [Barto, Bradtke & Singh, ’95] A.G. Barto, S.J. Bradtke & S. Singh: Learning to act using real-time dynamic programming, Artificial Intelligence 72, 81—138, 1995. • [Williams & Baird, ’93] R.J. Williams & L.C. Baird: Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions. Northeastern University Technical Report NU-CCS-93-14, November, 1993. • [SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999. • [Heger ’94] M. Heger: Consideration of risk in reinforcement learning, ICML, 105—111, 1994.

Reinforcement Learning : Dynamic Programming

Reinforcement Learning : Dynamic Programming

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning : Dynamic Programming

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms

Reinforcement Learning