Tópicos Especiais em Aprendizagem

Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012

Objetivo desta Aula • Aprendizado por Reforço: • Traços de Elegibilidade. • Generalização e Aproximações de funções. • Aula de hoje: capítulos7e 8do Sutton & Barto.

Generalization and Function Approximation Capítulo 8 do Sutton e Barto.

Objetivos • Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. • Overview of function approximation (FA) methods and how they can be adapted to RL • Não tão profundamente como no livro (comentário do Bianchi)

Value Prediction with Function Approximation As usual: Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function In earlier chapters, value functions were stored in lookup tables.

Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Supervised Learning System Inputs Outputs Training example = {input, target output} Error = (target output – actual output)

Backups as Training Examples As a training example: input target output

Any FA Method? • In principle, yes: • artificial neural networks • decision trees • multivariate regression methods • etc. • But RL has some special requirements: • usually want to learn while interacting • ability to handle nonstationarity • other?

Gradient Descent Methods transpose

Performance Measures • A common and simple one is the mean-squared error (MSE) over a distribution P : • Let us assume that P is always the distribution of states at which backups are done. • The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.

Gradient Descent Iteratively move down the gradient:

Gradient Descent Cont. For the MSE given above and using the chain rule:

Gradient Descent Cont. Use just the sample gradient instead: Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if a decreases appropriately with t.

But We Don’t have these Targets

What about TD(l) Targets?

On-Line Gradient-Descent TD(l)

Linear Methods

Nice Properties of Linear FA Methods • The gradient is very simple: • For MSE, the error surface is simple: quadratic surface with a single minumum. • Linear gradient descent TD(l) converges: • Step size decreases appropriately • On-line sampling (states sampled from the on-policy distribution) • Converges to parameter vector with property:

Linear methods mais usados • Coarse Coding • Tile Coding (CMAC) • Radial Basis Functions • Kanerva Coding

Coarse Coding Generalization from state X to state Y depends on the number of their features whose receptive fields

Coarse Coding Generalization in linear function approximation methods is determined by the sizes and shapes of the features' receptive fields. All three of these cases have roughly the same number and density of features.

Coarse Coding

Learning and Coarse Coding Example of feature width's strong effect on initial generalization (first row) and weak effect on accuracy

Tile Coding • Binary feature for each tile • Number of features present at any one time is constant • Binary features means weighted sum easy to compute • Easy to compute indices of the freatures present

Tile Coding

Exemplo: Simulated Soccer • How does agent decide what to do with the ball? • Complexities • Continuous inputs • High dimensionality • Do Artigo: Reinforcement Learning in Simulated Soccer with Kohonen Networks , de Chris White and David Brogan (University of Virginia)

Problems • State space explodes exponentially in terms of dimensionality • Current methods of managing state space explosion lack automation RL does not scale well to problems with complexities of simulated soccer… Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

Quantization • Divide State Space into regions of interest • Tile Coding (Sutton & Barto, 1998) • No automated method for regions • granularity • Heterogeneity • location • Prefer a learned abstraction of state space Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

Kohonen Networks • Clustering algorithm • Data driven No nearby opponents Agent near opponent goal Teammate near opponent goal Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

State Space Reduction • 90 continuous valued inputs describe state of a soccer game • Naïve discretization  290 states • Filter out unnecessary inputs  still 218 states • Clustering algorithm  only 5000 states • Big Win!!! Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

Two Pass Algorithm • Pass 1: • Use Kohonen Network and large training set to learn state space • Pass 2: • Use Reinforcement Learning to learn utilities for states (SARSA) Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

Fragility of Learned Actions What happens to attacker’s utility if goalie crosses dotted line? Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

Results • Evaluate three systems • Control – Random action selection • SARSA • Forcing Function • Evaluation criteria • Goals scored • Time of possession Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

Cumulative Score Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

Team with Forcing Functions Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

Can you beat the “curse of dimensionality”? • Can you keep the number of features from going up exponentially with the dimension? • “Lazy learning” schemes: • Remember all the data • To get new value, find nearest neighbors and interpolate • e.g., locally-weighted regression

Can you beat the “curse of dimensionality”? • Function complexity, not dimensionality, is the problem. • Kanerva coding: • Select a bunch of binary prototypes • Use hamming distance as distance measure • Dimensionality is no longer a problem, only complexity

Algorithms using Function Approximators • We now extend value prediction methods using function approximation to control methods, following the pattern of GPI. • First we extend the state-value prediction methods to action-value prediction methods, then we combine them with policy improvement and action selection techniques. • As usual, the problem of ensuring exploration is solved by pursuing either an on-policy or an off-policy approach.

Control with FA • Learning state-action values: • The general gradient-descent rule: Training examples of the form:

Control with FA • Gradient-descent Sarsa(l) (backward view):

Linear Gradient Descent Sarsa(l)

Linear Gradient Descent Q()

Mountain-Car Task

Mountain-Car Results The effect of alpha, lambda and the kind of traces on early performance on the mountain-car task. This study used five 9 x 9 tilings.

Summary • Generalization • Adapting supervised-learning function approximation methods • Gradient-descent methods • Linear gradient-descent methods • Radial basis functions • Tile coding • Kanerva coding

Summary • Nonlinear gradient-descent methods? Backpropation? • Subleties involving function approximation, bootstrapping and the on-policy/off-policy distinction

Conclusion

Conclusão • Vimos dois métodos importantes na aula de hoje: • Traços de elegibilidade, que faz umageneralização temporal do aprendizado. • Aproximadores de função, que generalizam a função valor aprendida. • Generalizam o aprendizado.

Fim.

Tópicos Especiais em Aprendizagem

Tópicos Especiais em Aprendizagem

Presentation Transcript