Reinforcement Learning

Reinforcement Learning Slides for this part are adapted from those of Dan Klein@UCB

Main Dimensions Model-based vs. Model-free Passive vs. Active Passive vs. Active Passive: Assume the agent is already following a policy (so there is no action choice to be made; you just need to learn the state values and may be action model) Active: Need to learn both the optimal policy and the state values (and may be action model) • Model-based vs. Model-free • Model-based  Have/learn action models (i.e. transition probabilities) • Eg. Approximate DP • Model-free  Skip them and directly learn what action to do when (without necessarily finding out the exact model of the action) • E.g. Q-learning

Main Dimensions (contd) Extent of Backup • Full DP • Adjust value based on values of all the neighbors (as predicted by the transition model) • Can only be done when transition model is present • Temporal difference • Adjust value based only on the actual transitions observed

Does self learning through simulator. [Infants don’t get to “simulate” the world since they neither have T(.) nor R(.) of their world]

We are basically doing EMPIRICAL Policy Evaluation! But we know this will be wasteful (since it misses the correlation between values of neibhoring states!) Do DP-based policy evaluation!

10/7

Learning/Planning/Acting

World ---- Model (and Simulator)

Relating TD and Monte Carlo • Both Monte Carlo and TD learn from samples (traces) • Monte Carlo waits until the trace hits a sink state, and then (discount) adds all the rewards of the trace • TD on the other hand considers the current state s, and the next experienced state s0 • You can think of what TD is doing as “truncating” the experience and summarizing the aggregated reward of the entire trace starting from s0 in terms of the current value estimate of s0 • Why truncate at the very first state s’? How about going from s s0s1s2..sk and truncate the remaining trace (by assuming that its aggregate reward is just the current value of sk) • (sort of like how deep down you go in game trees before applying evaluation function) • In this generalized view, TD corresponds to k=0 and Monte Carlo corresponds to k=infinity

Generalizing TD to TD(l) Reason: After Tthstate the remaining infinite # states will all have the same aggregated Backup—but each is discounted inl. So, we have a 1/(1- l) factor that Cancels out the (1- l) • TD(l) can be thought of as doing 1,2…k step predictions of the value of the state, and taking their weighted average • Weighting is done in terms of l such that • l=0 corresponds to TD • l=1 corresponds to Monte Carlo • Note that the last backup doesn’t have (1- l) factor… No (1- l) factor!

10/12

Full vs. Partial Backups

. Dimensions of Reinforcement Learning

10/14 --Factored TD and Q-learning --Policy search (has to be factored..)

Large State Spaces • When a problem has a large state space we can not longer represent the V or Q functions as explicit tables • Even if we had enough memory • Never enough training data! • Learning takes too long • What to do?? [Slides from Alan Fern]

Function Approximation • Never enough training data! • Must generalize what is learned from one situation to other “similar” new situations • Idea: • Instead of using large table to represent V or Q, use a parameterized function • The number of parameters should be small compared to number of states (generally exponentially fewer parameters) • Learn parameters from experience • When we update the parameters based on observations in one state, then our V or Q estimate will also change for other similar states • I.e. the parameterization facilitates generalization of experience

Linear Function Approximation • Define a set of state features f1(s), …, fn(s) • The features are used as our representation of states • States with similar feature values will be considered to be similar • A common approximation is to represent V(s) as a weighted sum of the features (i.e. a linear approximation) • The approximation accuracy is fundamentally limited by the information provided by the features • Can we always define features that allow for a perfect linear approximation? • Yes. Assign each state an indicator feature. (I.e. i’th feature is 1 iff i’th state is present and i represents value of i’th state) • Of course this requires far to many features and gives no generalization.

10 Example • Consider grid problem with no obstacles, deterministic actions U/D/L/R (49 states) • Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features) • V(s) = 0 + 1 x + 2 y • Is there a good linear approximation? • Yes. • 0 =10, 1 = -1, 2 = -1 • (note upper right is origin) • V(s) = 10 - x - ysubtracts Manhattan dist.from goal reward 6 0 0 6

10 But What If We Change Reward … • V(s) = 0 + 1 x + 2 y • Is there a good linear approximation? • No. 0 0

10 But What If… + 3 z • Include new feature z • z= |3-x| + |3-y| • z is dist. to goal location • Does this allow a good linear approx? • 0 =10, 1 = 2 = 0, 0 = -1 • V(s) = 0 + 1 x + 2 y 0 3 0 3 Feature Engineering….

Linear Function Approximation • Define a set of features f1(s), …, fn(s) • The features are used as our representation of states • States with similar feature values will be treated similarly • More complex functions require more complex features • Our goal is to learn good parameter values (i.e. feature weights) that approximate the value function well • How can we do this? • Use TD-based RL and somehow update parameters based on each experience.

TD-based RL for Linear Approximators • Start with initial parameter values • Take action according to an explore/exploit policy(should converge to greedy policy, i.e. GLIE) • Update estimated model • Perform TD update for each parameter • Goto 2 What is a “TD update” for a parameter?

Aside: Gradient Descent • Given a function f(1,…, n) of n real values =(1,…, n) suppose we want to minimize f with respect to  • A common approach to doing this is gradient descent • The gradient of f at point , denoted by  f(), is an n-dimensional vector that points in the direction where f increases most steeply at point  • Vector calculus tells us that  f() is just a vector of partial derivativeswhere can decrease f by moving in negative gradient direction

Aside: Gradient Descent for Squared Error • Suppose that we have a sequence of states and target values for each state • E.g. produced by the TD-based RL loop • Our goal is to minimize the sum of squared errors between our estimated function and each target value: • After seeing j’th state the gradient descent rule tells us that we can decrease error by updating parameters by: squared error of example j target value for j’th state our estimated valuefor j’th state learning rate

Aside: continued depends on form of approximator • For a linear approximation function: • Thus the update becomes: • For linear functions this update is guaranteed to converge to best approximation for suitable learning rate schedule

Use the TD prediction based on the next state s’ • this is the same as previous TD method only with approximation TD-based RL for Linear Approximators • Start with initial parameter values • Take action according to an explore/exploit policy(should converge to greedy policy, i.e. GLIE) Transition from s to s’ • Update estimated model • Perform TD update for each parameter • Goto 2 What should we use for “target value” v(s)?

TD-based RL for Linear Approximators • Start with initial parameter values • Take action according to an explore/exploit policy(should converge to greedy policy, i.e. GLIE) • Update estimated model • Perform TD update for each parameter • Goto 2 • Step 2 requires a model to select greedy action • For applications such as Backgammon it is easy to get a simulation-based model • For others it is difficult to get a good model • But we can do the same thing for model-free Q-learning

Q-learning with Linear Approximators Features are a function of states and actions. • Start with initial parameter values • Take action a according to an explore/exploit policy(should converge to greedy policy, i.e. GLIE) transitioning from s to s’ • Perform TD update for each parameter • Goto 2 • For both Q and V, these algorithms converge to the closest linear approximation to optimal Q or V.

Reinforcement Learning