1 / 34

Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study

Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study. Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014. Agenda. Restless multi-arm bandits problem Algorithms and policies Numerical experiments Simulated problem instances

malina
Télécharger la présentation

Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Restless Multi-Arm Bandits Problem(RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014

  2. Agenda • Restless multi-arm bandits problem • Algorithms and policies • Numerical experiments • Simulated problem instances • Real application: the capacity management problem

  3. Restless Multi-arm Bandits Problem … … Active Passive

  4. Objective • Discounted rewards (finite, infinite horizon) • Time average • A general modeling framework • N-choose-M problem • Limited capacity (production capacity, service capacity) • Connection with Multi-arm bandit problem Passive arm:

  5. Exact Optimal Solution: Dynamic Programming • Markov decision process (MDP) • State: • Action: • Transition matrix: • Rewards: • Algorithm: • Finite horizon: backward induction • Infinite horizon (discounted): value iteration, policy iteration • Problem size: becomes a disaster quickly Active set : N-choose-M,

  6. Lagrangian Relaxation: Upper Bound • = number of active arms at time t • Original requirement • Relaxed requirement: an “average’’ version • Solve the upper bound • Occupancy measures • Using Dual LP formulation of MDP max

  7. Index Policies • Philosophy: Decomposition • 1 huge problem of states  small problems of states • Index policy • Compute the index for each arm separately • Rank the indices • Choose the arms with M smallest/largest indices • Easy to compute/implement • Intuitive structure

  8. The Whittle’s Index Policy (Discounted Rewards) • For a fixed arm, and a given state  “Subsidy” W • The Whittle’s Index: W(s) • The subsidy that makes passive and active arms indifferent • Closed form solution depends on specific models Passive rewards W-subsidy problem active passive-W W too small /large Active /Passive arm is better

  9. Numerical Algorithm for Solving Whittles’ Index STEP 1: Find the plausible range of W Initial , • Value iteration , Evaluate V(Passive)-V(Active) • Initial step size • Update W: • (reduce) • (increase) Yes No Reduce Increase ! STOP when reverses the sign for the first time Range of W identified: [L,U] STEP 2: Use binary search within the range [L,U]

  10. The Primal-Dual Index Policy • Solve the Lagrangian relaxation formulation • Input: • Optimal primal solutions (occupancy measures) • Optimal reduced costs • Policy: • (1) p=M, choose them! • (2) p<M, add (M-p) more arms • Among the rest arms, choose (M-p) arms with smallest • (3) p>M, choose M out of p arms • Among the p arms, kick out (p-M) arms with smallest Being active if >0 total expected discounted time spent selecting arm n in state rate of decrease in the obj-value as increases by 1 unit How harmful for passive  active How harmful for active passive rate of decrease in the obj-value as increases by 1 unit number of arms with

  11. Heuristic Index Policies • Absolute-greedy policy • Choose M arms with largest active rewards • Relative-greedy policy • Choose M arms with largest marginal rewards • Rolling-horizon policy (H-period look-ahead) • Choose m arms with largest marginal value-to-go where is the optimal value function in the following H periods

  12. Agenda • Restless multi-arm bandits problem • Algorithms and policies • Numerical experiments • Simulated problem instances • Real application: the capacity management problem

  13. Experiment Settings • Assume active rewards are larger than passive rewards • Non-identical arms • Structures in transition dynamics • Uniformly sampled transition matrix • IFR matrix with non-increasing rewards • P1 is stochastically smaller than P2 • Less-connected chain • Evaluation • Small instances: exact optimal solution • Large instances: upper bound & Monte-Carlo simulation • Performance measure • Average gaps from Optimality or Upper Bound

  14. 5 Questions of Interest • How do different policies compare under different problem structures? • How do different policies compare under various problem sizes? • How do different policies compare under different discount factors? • How does a multi-period look ahead improve a myopic policy? 5. How do different policies compare under different time horizons?

  15. Question 1: Does problem structure help? • Uniformly sampled transition matrix and rewards • Increasing failure rate matrix and non-increasing rewards • Less-connected Markov chain • P1 stochastically smaller than P2, non-increasing rewards

  16. Question 1: Does problem structure help?

  17. Question 2: Does problem size matter? • Optimality gap: Fixed N and M , increasing S

  18. Question 2: Does problem size matter? • Optimality gap: Fixed M and S , increasing N Decreasing

  19. Question 3: Does discount factor matter? • Infinite horizon: discount factors

  20. Question 4: Does look ahead help a myopic policy? • Greedy policies vs Rolling-horizon policies different H • Problem size: S=8, N=6, M=2, • Problem structure: Uniform vs. less-connected =0.4

  21. Question 4: Does look ahead help a myopic policy? • Greedy policies vs Rolling-horizon policies different H • Problem size: S=8, N=6, M=2, • Problem structure: Uniform vs. less-connected =0.4 =0.7

  22. Question 5: Does look ahead help a myopic policy? • Greedy policies vs Rolling-horizon policies different H • Problem size: S=8, N=6, M=2, • Problem structure: Uniform vs. less-connected =0.4 =0.9

  23. Question 4: Does look ahead help a myopic policy? • Greedy policies vs Rolling-horizon policies different H • Problem size: S=8, N=6, M=2, • Problem structure: Uniform vs. less-connected =0.4 =0.98

  24. Agenda • Restless multi-arm bandits problem • Algorithms and policies • Numerical experiments • Simulated problem instances • Real application: the capacity management problem

  25. Clinical Capacity Management Problem (Deo et al. 2013) • School-based asthma care for children Van capacity Scheduling Policy Medical records of patients Who to schedule (treat)? State (h,n), capacity M, population N Active set : choose M out of N h=health state at the last appointment n=the time since the last appointment OBJECTIVE: maximize total benefit of the community • Current guidelines (fixed duration policy) • Whittle’s index policy • Primal-dual index policy • Greedy (myopic) policy • Rolling-horizon policy • H-N priority policy, N-H priority policy • No-schedule [baseline] Improvement

  26. How Large Is It? • Horizon: 24 periods (2 years) • Population size N ~ 50 patients • State space: • Each arm: • In total: 1.3 X 1099 • Decision space: • Choose 10 out of 50: 1.2 X 1010 • Choose 15 out of 50: 2.3 X 1012 • Actual computation time • Whittle’s Indices: 96 states/arm * 50 arms = 4800 indices • 1.5 - 3 hours • Presolve the LP relaxation for Primal-Dual Indices: • 4 - 60 seconds 96 states each arm

  27. Performance of Policies Improvement

  28. Performance of Policies Improvement

  29. Performance of Policies Improvement

  30. Whittle’s Index vs. Gitten’s Index • (S,N,M=1) vs. (S,N,M=2) • Sample 20 instances for each problem size • Whittle’s Index policy vs. DP exact solution • Optimality tolerance = 0.002 Percentage of time when Whittle’s Index policy is NOT optimal

  31. Summary • Whittles’ Index and Primal-dual Index work well and efficiently • Relative greedy policy can work well depending on problem structure • Policies perform worse on the less-connected Markov chain • All policies tend to work better if capacity is tight • Look ahead policies have limited marginal benefit for small discount factor

  32. Q&A

  33. Question 5: Does decision horizon matter? • Finite horizon: # of periods

More Related