1 / 32

Optimal Tuning of Continual Online Exploration in Reinforcement Learning

Optimal Tuning of Continual Online Exploration in Reinforcement Learning. Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information Systems Research Unit (ISYS) Université de Louvain Belgium. Outline. Introduction Mathematical concepts

Télécharger la présentation

Optimal Tuning of Continual Online Exploration in Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Tuning of Continual Online Exploration inReinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information Systems Research Unit (ISYS) Université de Louvain Belgium

  2. Outline • Introduction • Mathematical concepts • Modelling exploration by entropy • Optimal policy • Preliminary experiments • Conclusion and further work Achbany Youssef - UCL

  3. Introduction • One of the challenges of reinforcement learning is to manage: • The tradeoff between exploration and exploitation. • Exploitation • aims to capitalize on already well-established solutions. • Exploration: • aims to continually try new ways of solving the problem. • is relevant when the environment is changing. Achbany Youssef - UCL

  4. Introduction • Simple routing problem • The goal is • to reach a destination node (13) • From an initial node (1) • To minimize costs • For each node • Set of admissible actions • Weight (cost) associated • We define a probability distribution on the set of admissible actions Achbany Youssef - UCL

  5. Mathematical concepts • We have a set of states, S = {1, 2, …,n} • st = k means that the system is in state k at time t • In each state s = k, we have a set of admissible control actions, U(k) • So that u(k) ÎU(k) is a control action available at state k Achbany Youssef - UCL

  6. Mathematical concepts • When we choose actionu(st) at state st, • A bounded costC(u(st)| st) < ∞ is incurred • The system jumps to state st+1 = f(u(st)| st) • Where f is a function • We suppose the network of states does not contain any negative cycle Achbany Youssef - UCL

  7. Mathematical concepts • For each state s, we define a probability distribution on the set of admissible actions, P(u(s)| s) • Meaning that the choice is randomized • This introduces exploration – not only exploitation • This is the main contribution of our work Achbany Youssef - UCL

  8. uk1 P(uk1|k) uk2 k P(uk2|k) P(uk3|k) uk3 Mathematical concepts • For instance if, in state s = k, there are three admissible actions, • The probability distribution P(u(k)| s=k) involves three values Achbany Youssef - UCL

  9. Mathematical concepts • The policyp is defined as the set of all probability distributions for all states Achbany Youssef - UCL

  10. Mathematical concepts • The goal is to reach a destination state, s = d • From an initial state, s0 = k0 • While minimizing the total expected cost • The expectation is taken on the policy, that is, on all the random variables u(k) associated to the states Achbany Youssef - UCL

  11. Mathematical concepts • In other words, we have to determine the best policyp that minimizes Vp(k0) • That is, the best probability distributions • This is standard, except the fact that we introduce choice randomisation Achbany Youssef - UCL

  12. Mathematical concepts • We now introduce a way to control exploration • We introduce the degree of exploration, Ek, defined on each state k • Which is the entropy of the probability distribution of actions in this state k Achbany Youssef - UCL

  13. Modelling exploration by entropy • The degree of exploration, Ek, is defined as the entropy at state k • The minimum is 0 (no exploration) • The maximum is log(nk) where nk is the number of admissible actions in state k (full exploration) Achbany Youssef - UCL

  14. Modelling exploration by entropy • While the exploration rate is defined as • and takes its value between 0 (no exploration) • and 1 (full exploration). Achbany Youssef - UCL

  15. Modelling exploration by entropy • The goal now is to determine the optimal policy under exploration constraints • That is, seek the policy, p*, among • for which the expected cost, Vp(k0), is minimal • while guarantying a given degree of exploration (entropy) in each state k Achbany Youssef - UCL

  16. Modelling exploration by entropy • In other words, • where the Ek are provided/fixed by the user/designer • They control the degree of exploration at each node k Achbany Youssef - UCL

  17. Modelling exploration by entropy • Thus, we route the agents as fast as possible, while exploring the network Achbany Youssef - UCL

  18. Optimal policy • Here are the necessary optimality conditions (for a local minimum), very similar to Bellman’s equations • V*(k) is the optimal expected cost from state k • P(i|k) is the probability of chosing action i satisfying the entropy constraint through qk Achbany Youssef - UCL

  19. Optimal policy • Which lead to the following updating rules • Convergence has been proved in a stationary environment Achbany Youssef - UCL

  20. Optimal policy • This updating rule has a nice interpretation: • Route the agents preferably (with probability P(i|k)) to the state from which the expected cost is minimal • Including the direct cost for reaching this state Achbany Youssef - UCL

  21. Optimal policy • If qk is large (zero entropy: no exploration), we obtain • which is the common value iteration algorithm or Bellman’s equation • for finding the shortest path Achbany Youssef - UCL

  22. Optimal policy • If qk is zero (maximum entropy: full exploration), • We perform a blind exploration • We estimate the « average first passage time » • Without taking the costs into consideration: where nk is the number of admissible actions in state k Achbany Youssef - UCL

  23. Advantages of our algorithm • Our strategy could be interesting if the environment is changing • And there is a need for continuous exploration • Indeed, if no exploration is performed, • The agent will not notice the changes unless they occur on the shortest path • So that the policy will not be adjusted • In other words, we propose an optimal exploration/exploitation trade-off Achbany Youssef - UCL

  24. Simple Network routing Dynamic Uncertain Preliminary experiments Achbany Youssef - UCL

  25. Preliminary experiments • Exploration rate of 0% for all nodes (no exploration) Achbany Youssef - UCL

  26. Preliminary experiments • Entropy rate of 30% for all nodes Achbany Youssef - UCL

  27. Preliminary experiments • Entropy rate of 60% for all nodes Achbany Youssef - UCL

  28. Preliminary experiments • Entropy rate of 90% for all nodes Achbany Youssef - UCL

  29. Preliminary experiments • Other experimental simulations are provided in: • Tuning continual exploration in reinforcement learning (Technical report submitted for publication). • http://www.isys.ucl.ac.be/staff/francois/Articles/Achbany2005a.pdf Achbany Youssef - UCL

  30. Conclusion • In this work, • we presented a model integrating both exploration and exploitation in a common framework. • The exploration rate is controlled by the entropy of the choice probability distribution defined on the states of the system. • When no exploration is performed (zero entropy on each node), the model reduces to the common value iteration algorithm computing the minimum cost policy. • On the other hand, when full exploration is performed (maximum entropy on each node), the model reduces to a "blind" exploration, without considering the costs. Achbany Youssef - UCL

  31. Further work • This model has been extended to • Stochastic shortest paths problems • Discounted problems • Acyclic graphs • Edit-distances between string • Developing links with Q-learning Achbany Youssef - UCL

  32. Thank you !!! Achbany Youssef - UCL

More Related