1 / 31

Lirong Xia

Reinforcement Learning (1). Lirong Xia. Tue, March 18, 2014. Reminder. Midterm mean 82/99 Problem 5 (b) 2 has 8 points now your grade on LMS Project 2 due/Project 3 out on this Friday. Last time . Markov decision processes Computing the optimal policy value iteration

fritz
Télécharger la présentation

Lirong Xia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning (1) Lirong Xia Tue, March 18, 2014

  2. Reminder • Midterm • mean 82/99 • Problem 5 (b) 2 has 8 points now • your grade on LMS • Project 2 due/Project 3 out on this Friday

  3. Last time • Markov decision processes • Computing the optimal policy • value iteration • policy iteration

  4. Grid World • The agent lives in a grid • Walls block the agent’s path • The agent’s actions do not always go as planned: • 80% of the time, the action North takes the agent North (if there is no wall there) • 10% of the time, North takes the agent West; 10% East • If there is a wall in the direction the agent would have taken, the agent stays for this turn • Small “living” reward each step • Big rewards come at the end • Goal: maximize sum of reward

  5. Markov Decision Processes • An MDP is defined by: • A set of statess∈S • A set of actionsa∈A • A transition function T(s,a,s’) • Prob that a from s leads to s’ • i.e., p(s’|s,a) • sometimes called the model • A reward function R(s, a, s’) • Sometimes just R(s) or R(s’) • A start state (or distribution) • Maybe a terminal state • MDPs are a family of nondeterministic search problems • Reinforcement learning (next class): MDPs where we don’t know the transition or reward functions

  6. Recap: Defining MDPs • Markov decision processes: • States S • Start state s0 • Actions A • Transition p(s’|s,a) (or T(s,a,s’)) • Reward R(s,a,s’) (and discount ϒ) • MDP quantities so far: • Policy = Choice of action for each (MAX) state • Utility (or return) = sum of discounted rewards

  7. Optimal Utilities • Fundamental operation: compute the values (optimal expectimax utilities) of states s • Define the value of a state s: • V*(s) = expected utility starting in s and acting optimally • Define the value of a Q-state (s,a): • Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally • Define the optimal policy: • π*(s) = optimal action from state s

  8. The Bellman Equations • One-step lookahead relationship amongst optimal utility values:

  9. Solving MDPs • We want to find the optimal policy • Proposal 1: modified expectimax search, starting from each state s:

  10. Value Iteration • Idea: • Start with V1(s) = 0 • Given Vi, calculate the values for all states for depth i+1: • Repeat until converge • Use Vias evaluation function when computing Vi+1

  11. Example: Value Iteration • Information propagates outward from terminal states and eventually all states have correct value estimates

  12. Policy Iteration • Alternative approach: • Step 1: policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) • Step 2: policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values • Repeat steps until policy converges

  13. Today: Reinforcement learning • Still have an MDP: • A set of states • A set of actions (per state) A • A model T(s,a,s’) • A reward function R(s,a,s’) • Still looking for an optimal policy π*(s) • New twist: don’t know T and/or R, but can observe R • Learn by doing • can have multiple episodes (trials)

  14. Example: animal learning • Studied experimentally for more than 60 years in psychology • Rewards: food, pain, hunger, drugs, etc. • Example: foraging • Bees learnnear-optimalforaging plan in field of artificial flowerswithcontrollednectarsupplies

  15. What can you do with RL • Stanford autonomous helicopter http://heli.stanford.edu/

  16. Reinforcement learning methods • Model-based learning • learn the model of MDP (transition probability and reward) • compute the optimal policy as if the learned model is correct • Model-free learning • learn the optimal policy without explicitly learning the transition probability • Q-learning: learn the Q-state Q(s,a) directly

  17. Model-Based Learning • Idea: • Learn the model empirically in episodes • Solve for values as if the learned model were correct • Simple empirical model learning • Count outcomes for each s,a • Normalize to give estimate of T(s,a,s’) • Discover R(s,a,s’) when we experience (s,a,s’) • Solving the MDP with the learned model • Iterative policy evaluation, for example

  18. Example: Model-Based Learning • Episodes: (1,1) up-1 (1,1) up-1 (1,2) up-1 (1,2) up-1 (1,2) up-1 (1,3) right-1 (1,3) right-1 (2,3) right-1 (2,3) right-1 (3,3) right-1 (3,3) right-1 (3,2) up-1 (3,2) up-1 (4,2) exit-100 (3,3) right-1 (done) (4,3) exit +100 (done) γ= 1 T(<3,3>, right, <4,3>) = 1/3 T(<2,3>, right, <3,3>) =2/2

  19. Model-based vs. Model-Free • Want to compute an expectation weighted by probability p(x) (e.g. expected utility): • Model-based: estimate p(x) from samples, compute expectation • Model-free: estimate expectation directly from samples • Why does this work? Because samples appear with the right frequencies!

  20. Example • Flip a biased coin • if head, you get $10 • if tail, you get $1 • you don’t know the probability of head/tail • What is your expected gain? • 8 episodes • h, t, t, t, h, t, t, h • Model-based: p(h)=3/8, so E(gain) = 10*3/8+1*5/8=35/8 • Model-free: (10+1+1+1+10+1+1+10)/8=35/8

  21. Sample-Based Policy Evaluation? • Approximate the expectation with samples (drawn from an unknown T!) Almost! But we cannot rewind time to get samples from s in the same episode

  22. Temporal-Difference Learning • Big idea: learn from every experience! • Update V(s) each time we experience (s,a,s’,R) • Likely s’ will contribute updates more often • Temporal difference learning • Policy still fixed

  23. Exponential Moving Average • Exponential moving average • Makes recent samples more important • Forgets about the past (distant past values were wrong anyway) • Easy to compute from the running average • Decreasing learning rate can give converging averages

  24. Problems with TD Value Learning • TD value learning is a model-free way to do policy evaluation • However, if we want to turn values into a (new) policy, we’re sunk: • Idea: learn Q-value directly • Makes action selection model-free too!

  25. Active Learning • Full reinforcement learning • You don’t know the transitions T(s,a,s’) • You don’t know the rewards R(s,a,s’) • You can choose any actions you like • Goal: learn the optimal policy • In this case: • Learner makes choices! • Fundamental tradeoff: exploration vs. exploitation • exploration: try new actions • exploitation: focus on optimal actions based on the current estimation • This is NOT offline planning! You actually take actions in the world and see what happens…

  26. Detour: Q-Value Iteration • Value iteration: find successive approx optimal values • Start with V0*(s)=0 • Given Vi*, calculate the values for all states for depth i+1: • But Q-values are more useful • Start with Q0*(s,a)=0 • Given Qi*, calculate the Q-values for all Q-states for depth i+1:

  27. Q-Learning • Q-Learning: sample-based Q-value iteration • Learn Q*(s,a) values • Receive a sample (s,a,s’,R) • Consider your old estimate: Q(s,a) • Consider your new sample estimate: • Incorporate the new estimate into a running average

  28. Q-Learning Properties • Amazing result: Q-learning converges to optimal policy • If you explore enough • If you make the learning rate small enough • …but not decrease it too quickly! • Basically doesn’t matter how you select actions (!) • Neat property: off-policy learning • Learn optimal policy without following it (some caveats)

  29. Q-Learning • Q-learning produces tables of Q-values:

  30. Exploration / Exploitation • Several schemes for forcing exploration • Simplest: random actions (ε greedy) • Every time step, flip a coin • With probability ε, act randomly • With probability 1-ε, act according to current policy • Problems with random actions? • You do explore the space, but keep thrashing around once learning is done • One solution: lower ε over time • Another solution: exploration functions

  31. Exploration Functions • When to explore • Random actions: explore a fixed amount • Better idea: explore areas whose badness is not (yet) established • Exploration function • Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important) sample =

More Related