1 / 87

Response Regret

Response Regret. Martin Zinkevich AAAI Fall Symposium November 5 th , 2005 This work was supported by NSF Career Grant #IIS-0133689. Outline. Introduction Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger Traditional Regret Response Regret Conclusion. The Prisoner’s Dilemma.

beata
Télécharger la présentation

Response Regret

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Response Regret Martin Zinkevich AAAI Fall Symposium November 5th, 2005 This work was supported by NSF Career Grant #IIS-0133689.

  2. Outline • Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion

  3. The Prisoner’s Dilemma • Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime. • Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime. • Each has two options: • Cooperatewith (his/her) fellow prisoner, or • Defect from the deal.

  4. Bimatrix Game Bob Cooperates Bob Defects Alice Cooperates Alice: 1 year Bob: 1 year Alice: 6 years Bob: 0 years Alice Defects Alice: 0 years Bob: 6 years Alice: 5 years Bob: 5 years

  5. Bimatrix Game

  6. Nash Equilibrium

  7. The Problem • Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.

  8. A Better Model for Real Life • Consequences for misbehavior • These improve life • A better model: Infinitely repeated games

  9. The Goal • Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences? • Side effect: a goal for reinforcement learning in infinite POMDPs.

  10. Regret Versus Standard RL • Guarantees of performance during learning. • No guarantee for the “final” policy… …for now. 

  11. A New Measure of Regret • Traditional Regret • measures immediate consequences • Response Regret • measures delayed effects

  12. Outline • Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion

  13. Outline • Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion

  14. Repeated Bimatrix Game Bob Cooperates Bob Defects Alice Cooperates -1,-1 -6,0 Alice Defects 0,-6 -5,-5

  15. Finite State Machine (for Bob) Alice defects Bob cooperates Bob cooperates Alice cooperates Alice cooperates Alice defects Bob defects Alice *

  16. Alice defects Bob cooperates Bob defects Alice cooperates Alice * Grim Trigger

  17. Always Cooperate Bob cooperates Alice *

  18. Always Defect Bob defects Alice *

  19. Tit-for-Tat Alice defects Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates

  20. Pr[ ]=2/3 GO Discounted Utility Pr[ ]=1/3 STOP Alice defects Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates GO GO GO STOP STOP STOP GO GO GO GO STOP STOP STOP STOP GO GO GO GO STOP STOP STOP STOP GO C -1 C -1 D 0 C -6 D 0 C -6 GO GO C -6 D 0 GO STOP

  21. Discounted Utility • The expected value of that process • t=11 utt-1

  22. Optimal Value Functions for FSMs • V*(s) discounted utility of OPTIMAL policy from state s • V0*(s) immediate maximum utility at state s • V*(B) discounted utility of OPTIMAL policy given belief over states B • V0*(B) immediate maximum utility given belief over states B Pr[ ]=(1-) STOP Pr[ ]= GO

  23. Best Responses, Discounted Utility • If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger. Alice defects Bob cooperates Bob defects Alice cooperates Alice *

  24. Alice defects Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates Best Responses, Discounted Utility • Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat.

  25. Knowing Versus Learning • Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state. • However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.

  26. Grim Trigger or Always Cooperate? Grim Trigger Always Cooperate Alice defects Bob cooperates Bob defects Bob cooperates Alice cooperates Alice * Alice * For learning, optimality from the initial state is a bad goal.

  27. Deterministic Infinite SMs • represent any deterministic policy • de-randomization D D D C D C C

  28. New Goal • Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma? • In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).

  29. Outline • Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion

  30. Traditional Regret:Rock-Paper-Scissors

  31. Traditional Regret:Rock-Paper-Scissors

  32. Rock-Paper-Scissors Bob plays BR to Alice’s Last

  33. Rock-Paper-Scissors Bob plays BR to Alice’s Last

  34. Rock-Paper-Scissors Bob plays BR to Alice’s Last

  35. Utility of the Algorithm • Define ut to be the utility of ALG at time t. • Define u0ALG to be: u0ALG=(1/T)t=1T ut • Here: u0ALG=(1/5)(0+1+(-1)+1+0)=1/5 u0ALG=1/5

  36. Rock-Paper-Scissors Visit Counts for Bob’s Internal States 3 Visits 1 Visit u0ALG=1/5 1 Visit

  37. Rock-Paper-Scissors Frequencies 3/5 Visits 1/5 Visits u0ALG=1/5 1/5 Visits

  38. Rock-Paper-Scissors Dropped according to Frequencies 3/5 Visits 1/5 Visits u0ALG=1/5 0 1/5 Visits 2/5 -2/5

  39. Traditional Regret • Consider B to be the empirical frequency states were visited. • Define u0ALG to be the average utility of the algorithm. • Traditional regret of ALG is: R= V0*(B)-u0ALG R=(2/5)-(1/5) u0ALG=1/5 0 2/5 -2/5

  40. Traditional Regret • Goal: regret approach zero a.s. • Exists an algorithm that will do this for all opponents.

  41. What Algorithm? • Gradient Ascent With Euclidean Projection (Zinkevich, 2003): • (when pi strictly positive)

  42. What Algorithm? • Exponential Weighted Experts (Littlestone + Warmuth, 1994): • And a close relative:

  43. What Algorithm? • Regret Matching:

  44. What Algorithm? • Lots of them!

  45. Extensions to Traditional Regret (Foster and Vohra, 1997) • Into the past… • Have a short history • Optimal against BR to Alice’s Last.

  46. Extensions to Traditional Regret • (Auer et al) • Only see ut, not ui,t: • Use an unbiased estimator of ui,t:

  47. Outline • Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion

  48. This Talk • Do you want to? • Even then, is it possible?

  49. Traditional Regret:Prisoner’s Dilemma Alice defects CCDCDD DD DD DD DD DD DD DD Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates

  50. Traditional Regret:Prisoner’s Dilemma Alice defects Bob cooperates (0.2) Bob defects (0.8) Alice cooperates Alice defects Alice cooperates Alice defects: -4 Alice cooperates: -5

More Related