1 / 41

Reinforcement Learning

Introduction to Reinforcement Learning: Q-Learning and evolutions Aspect numérique d'Intelligence Artificielle. Reinforcement Learning. AI. ML. Reinforcement Learning. AI. Supervised. Unsupervised. ML. Reinforcement Learning. AI. Supervised. Unsupervised. ANNs data mining. ML.

lajos
Télécharger la présentation

Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Reinforcement Learning:Q-Learning and evolutionsAspect numérique d'Intelligence Artificielle

  2. Reinforcement Learning AI ML

  3. Reinforcement Learning AI Supervised Unsupervised ML

  4. Reinforcement Learning AI Supervised Unsupervised ANNs data mining ML

  5. Reinforcement Learning AI Active Learning Supervised RL ML

  6. Learning Autonomous Agent Perceptions Actions Environment

  7. Learning Autonomous Agent Perceptions Actions Environment

  8. Learning Autonomous Agent Delayed Reward Perceptions Actions Environment

  9. Learning Autonomous Agent Delayed Reward New behavior Perceptions Actions Environment

  10. 2-Armed Bandit Problem

  11. 2-Armed Bandit Problem $ 1º. Insert a coin,

  12. 2-Armed Bandit Problem 1º. Insert a coin, 2º. Pull an arm,

  13. 2-Armed Bandit Problem 1º. Insert a coin, 2º. Pull an arm, 3º. Receive a random reward. 0 or$

  14. ? ? Exploration vs. Exploitation $ $

  15. Q-learning and evolutions • Method of RL introduced by Watkins (89) • The world is modeled as a Markov Decision Process (MDP) • Evolutions : Q(λ)-learning, HQ-learning, Bayesian Q-learning, W-learning, fuzzy learning.

  16. Markov Decision Processes

  17. Markov Decision Processes state

  18. Markov Decision Processes state action

  19. Markov Decision Processes state new state action

  20. Markov Decision Processes state new state action

  21. Markov Decision Processes reward state new state action

  22. Markov Decision Processes reward state new state action

  23. Markov Decision Processes

  24. Markov Decision Processes • Transitions are probabilistic : y and r are drawn from stationary probability distribution Pxa(y) and Pxa(r) • We have : ΣrPxa(r) = 1 et ΣyPxa(y) = 1 • The Markov stationary property is expressed as: P(xt+1 = y | xt = x, at = a) = Pxa(y) for all t.

  25. Markov Decision Processes • Special case : Deterministic words • Pxa(y) =  • Pxa(r) =  • Special case : Different action sets • Pxa(x) = 1 for all actions a in the ‘unavailable set for x 1 if y = yxa 0 otherwise 1 if r = rxa 0 otherwise

  26. Markov Decision Processes • Expected reward : • When we take action a in state x, we expect to receive is : E(r) =Σr r Pxa(r) • Typically, r is a function of the transition x to y. Writing r = r(x,y), the probability of a particular reward is: Pxa(r) = ΣPxa(y) and the expected reward becomes : E(r)=Σy r(x,y) Pxa(y) • Reward r are bounded by rmin and rmax. Hence, for a given x, a, rmin ≤ E(r) ≤ rmax. {y|r(x,y)=r}

  27. Markov Decision Processes • The task : The agent acts according to a policy π. • Deterministic policy : unique action a = π(x) • Stochasticpolicy : distribution Pxwith probability Px(a) • Stationary or memory-less policy: no concept of time • Non-stationary policy: the agent must possess memory • Following a stationary deterministic policy π, at time t, the agent observes state xt, takes action at = π(xt), observes new state xt+1, and receives reward rt with expected value: E(rt) = Σr r Pxtat(r) { π π {

  28. Markov Decision Processes

  29. Markov Decision Processes S Delayed Reward

  30. Markov Decision Processes t-1 S g t+1 g t g Delayed Reward

  31. Markov Decision Processes • The agents is interested in the total discounted reward: R = rt+  rt+1 + ² rt+2+ … where 0 ≤ < 1. • Special case  = 0, where we only try to maximize immediat reward. • Low/high means pay little/great attention to the future. • The expected total discounted reward if we follow policy π, starting from xt, is : V (xt) = E(r) = E(rt)+  E(rt+1)+ ² E(rt+2)+ … = Σr r Pxtat(r) +  Σy V (y) Pxtat (y) π π

  32. Markov Decision Processes π • V (x) : value of state x under policy π. • The agent must find an optimal policy that maximize the total discounted expected reward. • DP theory assures us of the existence of a stationary and deterministic optimal policy πfor an MDP which satisfie : V (x)= max [Σr r Pxb(r) +  Σy V (y) Pxb(y)]for all x. • may be non-unique, but V (x) is unique and the best that an agent can do from x. • All optimal policies will have : V (x) = V (x). * * * π π b ε A * π * * * * π π

  33. Markov Decision Processes • The strategy: • Build up Q-value Q(x,a) for each pair (x,a). • The Q-learning agent must find an optimal policy when Pxa(y) and Pxa(r) are initially unknown and interact with the world to learn this probabilities. • In 1-step Q-learning, after each experience, we observe state y, receive reward r and update: • Q(x,a) := (r +  max Q(y,b)) b ε A

  34. Markov Decision Processes • In the discrete case, where we store each Q(x,a) explicitly in lookup table, we update: • Q(x,a) := (1- α) Q(x,a) + α (r +  max Q(y,b)) • 0≤α≤1: indicate the weight of the new experience. • Start with α = 1 which favorize exporation. • Finish with α→0 which favorize exploitation. b ε A

  35. if a is estimated optimal in s, otherwise. Undirected Exploration • Semi-uniform distribution: • Boltzmann law:

  36. if a is estimated optimal in s, otherwise. Undirected Exploration • Semi-uniform distribution: • Boltzmann law:

  37. Hierarchical Q-learning • As the complexity of problems scales up, both the size of statespace end the complexity of the reward function increase. • Lin (93) suggest breaking complex problem into sub-problems, having a collection of Q-L agents A1, …, An learn the sub-problems. • A single controlling Q-L agent learns Q(x,i), where i is which agent to choose in state x. • When the creature observe state x, each agent Ai suggest an action ai. The switch chooses winner k and executes ak.

  38. Q-learning and evolutions • Method of RL introduced by Watkins (89) • The world is modeled as a Markov Decision Process (MDP) • Q(λ)-learning (Peng and William 96) : It combines Q-learning (Watkins 89) and TD(λ) (Sutton 88 – Tesauro 92). • HQ-learning (Wiering and Schmidhuber 97): a hierarchical extension of Q-learning. • Bayesian Q-learning : use of belief and observation • W-learning : Q-learning with independents multiples agents.

  39. Speeding-up Q(λ)-learning • In the discrete Q-learning, we update : • Q(xt,at) := (1- αt) Q(xt,at) + αk (rt +  max Q(xt+1,b)) • Rewrite as : • Q(xt,at) := Q(xt,at) + αket with • et = rt +  max Q(xt+1,b))-Q(xt,at) • In the Q(λ)-learning, we update : • Q(xt,at) := Q(xt,at) + αk [etηt(s,a) + etιt(s,a)]with : • et= rt +  max Q(xt+1,b) - Q(xt,at) • et= rt +  max Q(xt+1,b) - max Q(xt,b) • ηt(s,a) returns 1 if (s,a) occurred at time t. • ιt (s,a) = γλ(ιt-1(s,a) + ηt-1(s,a) (eligibility trace) b ε A ‘ ‘ ‘ b ε A ‘ ‘ b ε A b ε A b ε A

  40. Speeding-up Q(λ)-learning • Peng and Williams’ algorithm for online Q(λ):

  41. Speeding-up Q(λ)-learning • Notes : • There are other possible variants, e.g. Rummery and Niranjan 94). • There are also a Fast Q(λ)-learning algorithm based on the fact that the only Q-values needed at any given time are those for the possible action given the current state. The algorithm relies on two procedure : • the local Update calculates exact Q-Values once tehy are required. • The Global Update procedure update the global variables and the current Q-values. • Use of “lazy learning” are possible.

More Related