Outline

Outline • MDP (brief) • Background • Learning MDP • Q learning • Game theory (brief) • Background • Markov games (2-player) • Background • Learning Markov games • Littman’s Minimax Q learning (zero-sum) • Hu & Wellman’s Nash Q learning (general-sum)

Stochastic games (SG) Partially observable SG (POSG) / SG / POSG

Expectation over next states Immediate reward Value of next state

Model-based reinforcement learning: • Learn the reward function and the state transition function • Solve for the optimal policy • Model-free reinforcement learning: • Directly learn the optimal policy without knowing the reward function or the state transition function

#times action a causes state transition s  s’ #times action a has been executed in state s Total reward accrued when applying a in s

v(s’)

Start with arbitrary initial values of Q(s,a), for all sS, aA • At each time t the agent chooses an action and observes its reward rt • The agent then updates its Q-values based on the Q-learning rule • The learning rate t needs to decay over time in order for the learning algorithm to converge

Famous game theory example

A co-operative game

Generalization of MDP Mixed strategy

Stationary: the agent’s policy does not change over time Deterministic: the same action is always chosen whenever the agent is in state s

Example State 2 State 1

v(s,*)  v(s,) for all s  S,  

Max V Such that: rock + paper + scissors = 1

Worst case Expectation over all actions Best response

Quality of a state-action pair Discounted value of all succeeding states weighted by their likelihood This learning rule converges to the correct values of Q and v Discounted value of all succeeding states

Expected reward for taking action a when opponent chooses o from state s eplor controls how often the agent will deviate from its current policy

Hu and Wellman general-sum Markov games as a framework for RL Theorem (Nash, 1951) There exists a mixed strategy Nash equilibrium for any finite bimatrix game

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: