1 / 26

Passive Reinforcement Learning

Passive Reinforcement Learning. Ruti Glick Bar-Ilan university. Passive Reinforcement Learning. We will assume full observation Agent has a fix policy π Always executes π (s) Goal – to learn how good the policy is similar to policy evaluation But – doesn ’ t have all the knowledge

amead
Télécharger la présentation

Passive Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Passive Reinforcement Learning Ruti Glick Bar-Ilan university

  2. Passive Reinforcement Learning • We will assume full observation • Agent has a fix policy π • Always executes π(s) • Goal – to learn how good the policy is • similar to policy evaluation • But – doesn’t have all the knowledge • Doesn’t know transition model T(s, a, s’) • Doesn’t know the reward function R(s)

  3. +1 -1 start example • Our familiar 4x3 world • Policy is known: • Agent executes trails using the policy • Trail start at (1,1) and experience sequence of states till reach terminal state

  4. +1 -1 start Example (cont.) • Typical trails may be: • (1,1)-.04(1,2)-.04(1,3)-.04(1,2)-.04(1,3)-.04 (2,3)-.04(3,3)-.04(4,3)+1 • (1,1)-.04(1,2)-.04(1,3)-.04(2,3)-.04(3,3)-.04 (3,2)-.04(3,3)-.04(4,3)+1 • (1,1)-.04(2,1)-.04(3,1)-.04(3,2)-.04(4,2)-1

  5. The goal • Utility Uπ(s) : • Expected sum of discounted rewards obtain for policy π • May include learning model of environment

  6. algorithms • Direct utility estimation (DUE) • Adaptive Dynamic Programming (ADP) • Temporal Difference (DT)

  7. Direct utility estimation • Idea: • Utility of state is the expected total reward from that state onward • Each trail supply example/s of values for visited state • Reward to go(of a state) • the sum of the rewards from that state until a terminal state is reached

  8. Example • (1,1)-.04(1,2)-.04(1,3)-.04(1,2)-.04(1,3)-.04(2,3)-.04(3,3)-.04(4,3)+1 • U(1,1) = 0.72 • U(1,2) = 0.76, 0.84 • U(1,3) = 0.80, 0.88 • U(2,3) = 0.92 • U(3,3) = 0.96

  9. algorithm • Run over sequence of state (according to policy) • Calculate observed “reward to go” for visited states • Keeping average utility of each state in table

  10. properties • After infinity number of trails, the average will converge to true expectation • Advantage • Easy to compute • No need of special actions • disadvantage • This is actually instance of supervised learning

  11. disadvantage –expanding • Similarity to supervised learning • Each example has input (state) and output (observed reward to go) • Reduce reinforcement learning to inductive learning • lacking • Missed dependency of neighbor states • Utility of s = reward of s + expected utility of neighbors • Doesn’t use the connection between states for learning • Searches in hypothesis space larger than needs to • Algorithm converge very slowly

  12. example • Second trail: (1,1)-.04(1,2)-.04(1,3)-.04(2,3)-04 (3,3)-.04(3,2)-.04(3,3)-.04(4,3)+1 • (3,2) hasn’t been seen before • (3,3) has been visited before and got high utility • Learn about (3,2) only at the end of sequence • Search in too much options…

  13. Adaptive dynamic programming • take advantage of connection between states • Learn the transition model • Solve markov decision process • Running known policy • Learns from observed sequences T(s,π(s),s’) • Get R(s) from observed states • Calculate utilities of states • Use T(s,π(s),s’), R(s) in Bellman equation • Solve the linear equations • Instead might use simplified value iteration

  14. Example • In our three trails performs 3 times right in (1,3) • 2 of these cases the result is (2,3) • So T((1,3), right, (2,3)) estimates as 2/3

  15. The algorithm Function PASSIVE_ADP_AGENT (percept) returns an action input: percept, a percept indicating the current state s’ and reward signal r’ static: π, a fixed policy mdp, an MDP with model T, rewards R, discount γ U, a table of utilities, initially empty Nsa, a table of frequencies for state-action pairs, initially zero Nsas’,, a table of frequencies for state-action pairs, initially zero a, s, the previous state and action, initially null if s’ is new then doU[s’]r’; R[s’]r’ ifs is not null then do increment Nsa[s,a] and Nsas’[s,a,s’] for each t such that Nsas’[s,a,t] is nonzero do T[s,a,t]Nsas’[s,a,t]/Nsa[s,a] UVALUE_DETEMINATION(π,U,mdp) if TERMINAL?[s’] thens,anull elses,as’,π[s’] return a

  16. Properties • Might seen like supervised learning • input = state-action pair • Output = resulting state • Its Easy learning the model • The environment is fully observation • Algorithm does well as possible • Provide standard for measuring reinforcement learning algorithms • Good for large state spaces • In backgammon solves 1050 equations with 1050 unknowns • Disadvantage – a lot of work each time iteration

  17. Performance in 4x3 world

  18. Temporal difference learning • Best of two world • Allows approximate the constraint equations • No need of solving equations for all possible states • method • Run according to policy π • Use observed transitions to adjust utilities that they agree with the constraint equations.

  19. Example • As result of first trail • Uπ(1,3) = 0.84 • Uπ(2,3) = 0.92 • We hope to see that:U(1,3) = -0.04 + U(2,3) = 0.88 • So current estimate of 0.84 is a bit low and we must increase it.

  20. In practice • Watching transition occurs from s to s’ • update equation:Uπ(s)  Uπ(s) + α(R(s) + γ Uπ(s’) − Uπ(s)) • α is learning rate parameter. • This is called temporal difference learning because update rule uses difference in utilities between successive states.

  21. The algorithm Function PASSIVE_TD_AGENT (percept) returns an action input: percept, a percept indicating the current state s’ and reward signal r’ static: π, a fixed policy U, a table of utilities, initially empty Ns, a table of frequencies for states, initially zero a, s, r, the previous state, action and reward initially null if s’ is new then doU[s’]  r’ ifs is not null then do increment Ns[s] U[s]  U[s] + α(Ns[s])(r + γ U[s’] − U[s]) if TERMINAL?[s’] thens, a, r  null elses, a, r  s’,π[s’], r’ return a

  22. Properties • Update involves only observed successor s’ • Doesn’t take into account all possibilities • efficient over large number of transitions • Does not learn the model • Environment supply the connection between neighboring states in form of observed transitions • Average value of Uπ(s) will converge to correct value

  23. quality • Average value of Uπ(s) will converge to correct value • if defined as a function that decreases as the number of times a state is visited increases, then U(s) will converge to correct value. • We require: • The function (n) = 1/n satisfies these conditions.

  24. Performance in 4x3 world

  25. TD vs. ADP • TD: • Doesn’t learn as fast as ADP • Shows higher variability than ADP • Simpler than ADP • Much less computation per observation than ADP • Does not need a model to perform updates • Makes state updates to agree with observed successor (instead of all successors, like ADP) • TD can be viewed as a crude, yet efficient, first approximation to ADP.

  26. PASSIVE_ADP_AGENT if s’ is new then doU[s’]r’; R[s’]r’ ifs is not null then do increment Nsa[s,a] and Nsas’[s,a,s’] for each t such that Nsas’[s,a,t] is nonzero do T[s,a,t]Nsas’[s,a,t]/Nsa[s,a] UVALUE_DETEMINATION(π,U,mdp) if TERMINAL?[s’] thens,anull elses,as’,π[s’] return a TD vs ADP Function PASSIVE_TD_AGENT (percept) returns an action if s’ is new then doU[s’]  r’ ifs is not null then do increment Ns[s] U[s]  U[s] + α(Ns[s])(r + γ U[s’] − U[s]) if TERMINAL?[s’] thens, a, r  null elses, a, r  s’,π[s’], r’ return a

More Related