Download Presentation
## Reinforcement Learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Reinforcement Learning**Chapter 21 Vassilis Athitsos**Reinforcement Learning**• In previous chapters: • Learning from examples. • Reinforcement learning: • Learning what to do. • Learning to fly (a helicopter). • Learning to play a game. • Learning to walk. • Learning based on rewards.**Relation to MDPs**• Feedback can be provided at the end of the sequence of actions, or more frequently. • Compare chess and ping-pong. • No complete model of environment. • Transitions may be unknown. • Reward function unknown.**Agents**• Utility-based agent: • Learns utility function on states. • Q-learning agent: • Learns utility function on (action, state) pairs. • Reflex agent: • Learns function mapping states to actions.**Passive Reinforcement Learning**• Assume fully observable environment. • Passive learning: • Policy is fixed (behavior does not change). • The agent learns how good each state is. • Similar to policy evaluation, but: • Transition function and reward function or unknown. • Why is it useful?**Passive Reinforcement Learning**• Assume fully observable environment. • Passive learning: • Policy is fixed (behavior does not change). • The agent learns how good each state is. • Similar to policy evaluation, but: • Transition function and reward function or unknown. • Why is it useful? • For future policy revisions.**Direct Utility Estimation**• For each state the agent ever visits: • For each time the agent visits the state: • Keep track of the accumulated rewards from the visit onwards. • Similar to inductive learning: • Learning a function on states using samples. • Weaknesses: • Ignores correlations between utilities of neighboring states. • Converges very slowly.**Adaptive Dynamic Programming**• Learns transitions and state utilities. • Plugs values into Bellman equations. • Solves equations with linear algebra, or policy iteration. • Problem:**Adaptive Dynamic Programming**• Learns transitions and state utilities. • Plugs values into Bellman equations. • Solves equations with linear algebra, or policy iteration. • Problem: • Intractable for large number of states. • Example: backgammon. • 1050 equations, with 1050 unknowns.**Temporal Difference**• Every time we make a transition from state s to state s’: • Update utility of s’: U[s’] = current observed reward. • Update utility of s: U[s] = (1-a)U[s] + a (r + g U[s’] ). a: learning rate r: previous reward g: discount factor**Properties of Temporal Difference**• What happens when an unlikely transition occurs?**Properties of Temporal Difference**• What happens when an unlikely transition occurs? • U[s] becomes a bad approximation of true utility. • However, U[s] is rarely a bad approximation. • Average value of U[s] converges to correct value. • If a decreases over time, U[s] converges to correct value.**Hybrid Methods**• ADP: • More accurate, slower, intractable for large numbers of states. • TD: • Less accurate, faster, tractable. • An intermediate approach:**Hybrid Methods**• ADP: • More accurate, slower, intractable for large numbers of states. • TD: • Less accurate, faster, tractable. • An intermediate approach: Pseudo-experiences: • Imagine transitions that have not happened. • Update utilities according to those transitions.**Hybrid Methods**• Making ADP more efficient: • Do a limited number of adjustments after each transition. • Use estimated transition probabilities to identify the most useful adjustments.**Active Reinforcement Learning**• Using passive reinforcement learning, utilities of states and transition probabilities are learned. • Those utilities and transitions can be plugged into Bellman equations. • Problem?**Active Reinforcement Learning**• Using passive reinforcement learning, utilities of states and transition probabilities are learned. • Those utilities and transitions can be plugged into Bellman equations. • Problem? • Bellman equations give optimal solutions given correct utility and transition functions. • Passive reinforcement learning produces approximate estimates of those functions. • Solutions?**Exploration/Exploitation**• The goal is to maximize utility. • However, utility function is only approximately known. • Dilemma: should the agent • Maximize utility based on current knowledge, or • Try to improve current knowledge.**Exploration/Exploitation**• The goal is to maximize utility. • However, utility function is only approximately known. • Dilemma: should the agent • Maximize utility based on current knowledge, or • Try to improve current knowledge. • Answer: • A little of both.**Exploration Function**• U[s] = R[s] + g max {f(Q(a,s), N(a, s))}. R[s]: current reward. g: discount factor. Q(a,s): estimated utility of performing action a in state s. N(a, s): number of times action a has been performed in state s. f(u, n): preference according to utility and degree of exploration so far for (a, s). • Initialization: U[s] = optimistically large value.**Q-learning**• Learning utility of state-action pairs. U[s] = max Q(a, s). • Learning can be done using TD: Q(a,s) = (1-b) Q(a,s) + b(R(s) + g max(Q(a’, s’)) b: learning factor g: discount factor s’: next state a’: possible action at next state**Generalization in Reinforcement Learning**• How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)?**Generalization in Reinforcement Learning**• How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)? • Solution similar to estimating probabilities of a huge number of events:**Generalization in Reinforcement Learning**• How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)? • Solution similar to estimating probabilities of a huge number of events: • Learn parametric functions, where parameters are features of each state. • Example: chess. • 20 features adequate for describing the current board.**Learning Parametric Utility Functions For Backgammon**• First approach: • Design weighted linear functions of 16 terms. • Collect training set of board states. • Ask human experts to evaluate training states. • Result: • Program not competitive with human experts. • Collecting training data was very tedious.**Learning Parametric Utility Functions For Backgammon**• Second approach: • Design weighted linear functions of 16 terms. • Let the system play against itself. • Reward provided at the end of each game. • Result (after 300,000 games, a few weeks): • Program competitive with best players in the world.