Response Regret

Response Regret Martin Zinkevich AAAI Fall Symposium November 5th, 2005 This work was supported by NSF Career Grant #IIS-0133689.

Outline • Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion

The Prisoner’s Dilemma • Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime. • Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime. • Each has two options: • Cooperatewith (his/her) fellow prisoner, or • Defect from the deal.

Bimatrix Game Bob Cooperates Bob Defects Alice Cooperates Alice: 1 year Bob: 1 year Alice: 6 years Bob: 0 years Alice Defects Alice: 0 years Bob: 6 years Alice: 5 years Bob: 5 years

Bimatrix Game

Nash Equilibrium

The Problem • Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.

A Better Model for Real Life • Consequences for misbehavior • These improve life • A better model: Infinitely repeated games

The Goal • Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences? • Side effect: a goal for reinforcement learning in infinite POMDPs.

Regret Versus Standard RL • Guarantees of performance during learning. • No guarantee for the “final” policy… …for now. 

A New Measure of Regret • Traditional Regret • measures immediate consequences • Response Regret • measures delayed effects

Repeated Bimatrix Game Bob Cooperates Bob Defects Alice Cooperates -1,-1 -6,0 Alice Defects 0,-6 -5,-5

Finite State Machine (for Bob) Alice defects Bob cooperates Bob cooperates Alice cooperates Alice cooperates Alice defects Bob defects Alice *

Alice defects Bob cooperates Bob defects Alice cooperates Alice * Grim Trigger

Always Cooperate Bob cooperates Alice *

Always Defect Bob defects Alice *

Tit-for-Tat Alice defects Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates

Pr[ ]=2/3 GO Discounted Utility Pr[ ]=1/3 STOP Alice defects Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates GO GO GO STOP STOP STOP GO GO GO GO STOP STOP STOP STOP GO GO GO GO STOP STOP STOP STOP GO C -1 C -1 D 0 C -6 D 0 C -6 GO GO C -6 D 0 GO STOP

Discounted Utility • The expected value of that process • t=11 utt-1

Optimal Value Functions for FSMs • V*(s) discounted utility of OPTIMAL policy from state s • V0*(s) immediate maximum utility at state s • V*(B) discounted utility of OPTIMAL policy given belief over states B • V0*(B) immediate maximum utility given belief over states B Pr[ ]=(1-) STOP Pr[ ]= GO

Best Responses, Discounted Utility • If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger. Alice defects Bob cooperates Bob defects Alice cooperates Alice *

Alice defects Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates Best Responses, Discounted Utility • Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat.

Knowing Versus Learning • Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state. • However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.

Grim Trigger or Always Cooperate? Grim Trigger Always Cooperate Alice defects Bob cooperates Bob defects Bob cooperates Alice cooperates Alice * Alice * For learning, optimality from the initial state is a bad goal.

Deterministic Infinite SMs • represent any deterministic policy • de-randomization D D D C D C C

New Goal • Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma? • In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).

Traditional Regret:Rock-Paper-Scissors

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Utility of the Algorithm • Define ut to be the utility of ALG at time t. • Define u0ALG to be: u0ALG=(1/T)t=1T ut • Here: u0ALG=(1/5)(0+1+(-1)+1+0)=1/5 u0ALG=1/5

Rock-Paper-Scissors Visit Counts for Bob’s Internal States 3 Visits 1 Visit u0ALG=1/5 1 Visit

Rock-Paper-Scissors Frequencies 3/5 Visits 1/5 Visits u0ALG=1/5 1/5 Visits

Rock-Paper-Scissors Dropped according to Frequencies 3/5 Visits 1/5 Visits u0ALG=1/5 0 1/5 Visits 2/5 -2/5

Traditional Regret • Consider B to be the empirical frequency states were visited. • Define u0ALG to be the average utility of the algorithm. • Traditional regret of ALG is: R= V0*(B)-u0ALG R=(2/5)-(1/5) u0ALG=1/5 0 2/5 -2/5

Traditional Regret • Goal: regret approach zero a.s. • Exists an algorithm that will do this for all opponents.

What Algorithm? • Gradient Ascent With Euclidean Projection (Zinkevich, 2003): • (when pi strictly positive)

What Algorithm? • Exponential Weighted Experts (Littlestone + Warmuth, 1994): • And a close relative:

What Algorithm? • Regret Matching:

What Algorithm? • Lots of them!

Extensions to Traditional Regret (Foster and Vohra, 1997) • Into the past… • Have a short history • Optimal against BR to Alice’s Last.

Extensions to Traditional Regret • (Auer et al) • Only see ut, not ui,t: • Use an unbiased estimator of ui,t:

This Talk • Do you want to? • Even then, is it possible?

Traditional Regret:Prisoner’s Dilemma Alice defects CCDCDD DD DD DD DD DD DD DD Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates

Traditional Regret:Prisoner’s Dilemma Alice defects Bob cooperates (0.2) Bob defects (0.8) Alice cooperates Alice defects Alice cooperates Alice defects: -4 Alice cooperates: -5

Response Regret

Response Regret

Presentation Transcript

Regret to the Best vs. Regret to the Average

Regret & decision making

Regret Minimization and Job Scheduling

No Regret Algorithms in Games

A NOTE OF REGRET

EXPRESSING REGRET

Adaptive Regret Minimization in Bounded Memory Games

God’s Regret

An Introduction to Counterfactual Regret Minimization

Regret Drunk Blocker

Resolve and Regret

Dissonance and Regret

Regret Minimization in Stochastic Games

On Routing without Regret

Regret

“No, I regret nothing, all I regret is having been born,

Regret to the Best vs. Regret to the Average

Regret vs. Disappointment theory

DON’T REGRET MISSING OUT ON VALLEYFAIR

Response Regret

Understand Keto Formation Before You Regret.

REGRET

Response Regret

Response Regret

Presentation Transcript

Regret to the Best vs. Regret to the Average

Regret &amp; decision making

Regret Minimization and Job Scheduling

No Regret Algorithms in Games

A NOTE OF REGRET

EXPRESSING REGRET

Adaptive Regret Minimization in Bounded Memory Games

God’s Regret

An Introduction to Counterfactual Regret Minimization

Regret Drunk Blocker

Resolve and Regret

Dissonance and Regret

Regret Minimization in Stochastic Games

On Routing without Regret

Regret

“No, I regret nothing, all I regret is having been born,

Regret to the Best vs. Regret to the Average

Regret vs. Disappointment theory

DON’T REGRET MISSING OUT ON VALLEYFAIR

Response Regret

Understand Keto Formation Before You Regret.

REGRET

Regret & decision making