1 / 9

Q-Learning for Policy Improvement in Network Routing

Q-Learning for Policy Improvement in Network Routing. W. Chen & S. Meyn Dept ECE & CSL University of Illinois. Q-learning Techniques for Net Opt W. Chen & S. Meyn. ACHIEVEMENT DESCRIPTION. Implementation – Consensus algorithms & Information distribution

john
Télécharger la présentation

Q-Learning for Policy Improvement in Network Routing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Q-Learning for Policy Improvement in Network Routing W. Chen & S. Meyn Dept ECE & CSL University of Illinois

  2. Q-learning Techniques for Net Opt W. Chen & S. Meyn ACHIEVEMENT DESCRIPTION • Implementation – Consensus algorithms & Information distribution • Adaptation – kernel based TD-learning and Q-learning • Integration with Network Coding projects: Code around network hot-spots What is the state of the art and what are its limitations? Control by learning: obtain a routing/scheduling policy for the network that is approximately optimal with respect to delay & throughput. Can these techniques be extended to wireless models? MAIN RESULT: Q-Learning for network routing: by observing the network behavior, we can approximate the optimal policy over a class of policies with restricted complexity, and restricted information. STATUS QUO IMPACT Near optimal performance with simple solution: Q-learning with steepest descent or Newton Raphson Method via stochastic approximation • Theoretical analysis for the convergence. • Simulation experiments are on-going. • KEY NEW INSIGHTS: • Extend to wireless? YES • Complexity is similar to MaxWeight. Policies are distributed and throughput optimal. • Learn the approximately optimal solution by Q-learning is feasible, even for complex networks • New application: Q-learning and TD-learning for power control HOW IT WORKS: Step 1: Generate a state trajectory. Step 2: Learn the best value function approximation by stochastic approximation. Step 3: Policy for routing: h-MW policy derived from value function approximation NEW INSIGHTS NEXT-PHASE GOALS • Un-consummated union challenge: Integrate coding and resource allocation • Generally, solutions to complex decision problems should offer insight Algorithms for dynamic routing: Visualization and Optimization

  3. Problem Decentralized resource allocation for a dynamic network is complex. How to achieve the near optimal performance (in delay or power) while maintaining throughput optimality and distributed implementation? Motivation Optimality control theory is intrinsically based on associated value functions. We can approximate the value function using either a relaxation technique, or other methods. Solution Learn an approximation of the value function over a parametric family (which can be chosen by prior knowledge) and get an approximate optimal routing policy for the network.

  4. h-MaxWeight Method (Meyn 2008): • h-MaxWeight uses a perturbation technique to generate a rich class of universally stabilizing policies. Any monotone and convex function can be used as the function h. • Relaxation based h-MaxWeight: • Find function h through relaxation of the fluid value function. • Q-learning based h-MaxWeight: • Approximate the optimal function h over a parametric family by local learning. How to approximatethe optimal function h?

  5. Find Function h by Local Learning Approximate value function by a general quadratic: Require in a sub-region of the state space. • Architecture: to choose , • Fluid value function approximation • Diffusion value function approximation • Shortest path information

  6. Q-learning based h-MaxWeight Method: Main Idea and Procedure Goal: Minimize the Bellman error over parametric family of value function approximation. Procedure: Step 1: Generate a state trajectory. Step 2: Learn the best value function approximation by stochastic approximation. Step 3: Policy for routing: h-MW policy derived from value function approximation

  7. Define value function in a subspace: Steepest descent method: Stochastic approximation: approximates Optimal parameter: -MaxWeight

  8. Example: local approximation for a simple network

  9. Summaries and challenges Challenges CAN WE IMPLEMET? How to construct implementable approximations for the optimal routing/scheduling policies? CAN WE CODE? How to combine the routing/scheduling policies with network coding in ITMANET? KEY CONCLUSION Approximate the optimal scheduling/routing policies for network by local learning. • References • S. Meyn. Stability and asymptotic optimality of generalized MaxWeight policies. SIAM J. Con Optim., 47(6):3259–3294, 2009 • W. Chen et. al. Approximate Dynamic Programming using Fluid and Diffusion Approximations with Applications to Power Management. Accepted to the 48th IEEE Conference on Decision and Control, 2009. • W. Chen, et. al. Coding and Control for Communication Networks. Invited paper to appear in the special issue of Queueing systems. • S. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. • P. Mehta. and S. Meyn. Q-learning and Pontryagin’s Minimum Principle. Accepted to the 48th IEEE Conference on Decision and Control, 2009

More Related