Searching in the Right Space

Searching in the Right Space Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst Barto@cs.umass.edu

Computational Reinforcement Learning Artificial Intelligence (machine learning) “Reinforcement learning (RL) bears a tortuous relationship with historical and contemporary ideas in classical and instrumental conditioning.” —Dayan 2001 Control Theory and Operations Research Psychology Computational Reinforcement Learning (RL) Neuroscience Artificial Neural Networks

The Plan • High-level intro to RL • Part I: The personal odyssey • Part II: The modern view • Part III: Intrinsically Motivated RL

The View from Machine Learning • Unsupervised Learning • recode data based on some given principle • Supervised Learning • “Learning from examples”, “Learning with a teacher”, related to Classical (or Pavlovian) Conditioning • Reinforcement Learning • “Learning with a critic”, related to Instrumental (or Thorndikian) Conditioning

Classical Conditioning Pavlov, 1927 Tone (CS: Conditioned Stimulus) Food (US: Unconditioned Stimulus) Salivation (UR: Unconditioned Response) • • • Anticipatory salivation (CR: Conditioned Response)

Edward L. Thorndike (1874-1949) Learning by “Trial-and-Error” puzzle box

Trial-and-Error = Error Correction Artificial Neural Network: learns from a set of examples viaerror-correction

x 1 x 2 x n “Least-Mean-Square” (LMS) Learning Rule “delta rule”, Adaline, Widrow and Hoff, 1960 w 1 input pattern V w actual output 2 w n – z + desired output + adjust weights V x z [ ] a w - D = i i

Trial-and-Error? • “The boss continually seeks a better worker by trial and error experimentation with the structure of the worker. Adaptation is a multidimensional performance feedback process. The `error’ signal in the feedback control sense is the gradient of the mean square error with respect to the adjustment.” Widrow and Hoff, “Adaptive Switching Circuits” 1960 IRE WESCON Conventional Record

MENACE Michie 1961“Matchbox Educable Noughts and Crosses Engine” x o o x x o o x x o o x o o o x o x o x x o o o x x o x x x o o o o o x x x x x o o o o o x o x o o o o x o x x o x o x o o x o o x x o o x o x o x o x x o x x o o o o x x o o x x o o x x x o o o x o x x x o x x o o o x o o x o o o x o o x x x o o x x o o o o o o o x

Essence of RL (for me at least!): Search + Memory • Search: Trial-and-Error, Generate-and-Test, Variation-and-Selection, . . . • Memory: remember what worked best for each situation and start from there next time RL is about caching search results (so you don’t have to keep searching!)

Generate-and-Test • Generator should be smart: • Generate lots things that are likely to be good based on prior knowledge and prior experience • But also take chances … • Tester should be smart too: • Evaluate based on real criteria, not convenient surrogates • But be able to recognize partial success

The Plan • High-level intro to RL • Part I: The personal odyssey • Part II: The modern view • Part III: Intrinsically Motivated RL

Key Players • Harry Klopf • Rich Sutton • Me

Arbib, Kilmer, and Spinelliin Neural Mechanisms of Learning and Memory, Rosenzweig and Bennett, 1974 “Neural Models and Memory”

A. Harry Klopf “Brain Function and Adaptive Systems -- A Heterostatic Theory” Air Force Cambridge Research Laboratories Technical Report 3 March 1972 “…it is a theory which assumes that living adaptive systems seek, as their primary goal, a maximal condition (heterostasis), rather than assuming that the primary goal is a steady-state condition (homeostasis). It is further assumed that the heterostatic nature of animals, including man, derives from the heterostatic nature of neurons. The postulate that the neuron is a heterostat (that is, a maximizer) is a generalization of a more specific postulate, namely, that the neuron is a hedonist.”

Klopf’s theory (very briefly!) • Inspiration: The nervous system is a society of self-interested agents. • Nervous Systems = Social Systems • Neuron = Man • Man = Hedonist • Neuron = Hedonist • Depolarization = Pleasure • Hyperpolarization = Pain • A neuronal model: • A neuron “decides” when to fire based on comparing a spatial and temporal summation of weighted inputs with a threshold. • A neuron is in a condition of heterostasis from time t to t+ if it maximizes the amount of depolarization and minimizes the amount of hyperpolarization over this interval. • Two ways to adapt weights to do this: • Push excitatory weights to upper limits; zero out inhibitory weights • Make neuron control its input.

Heterostatic Adaptation • When a neuron fires, all of its synapses that were active during the summation of potentials leading to the response become eligible to undergo changes in their transmittances. • The transmittances of an eligible excitatory synapse increases if the generation of an action potential is followed by further depolarization for a limited time after the response. • The transmittances of an eligible inhibitory synapse increases if the generation of an action potential is followed by further hyperpolarization for a limited time after the response. • Add a mechanism that prevents synapses that participate in the reinforcement from undergoing changes due to that reinforcement (“zerosetting”).

Key Components of Klopf’s Theory • Eligibility • Closed-loop control by neurons • Extremization (e.g., maximization) as goal instead of zeroing something • “Generalized Reinforcement”: reinforcement is not delivered by a specialized channel The Hedonistic Neuron A Theory of Memory, Learning, and Intelligence A. Harry Klopf Hemishere Publishing Corporation 1982

Eligibility Traces Klopf, 1972 a histogram of the lengths of feedback pathways in which the neuron is embedded Optimal ISI The same curve as the reinforcement- effectiveness curve in conditioning: max at 400 ms; 0 after approx 4 s.

Later Simplified Eligibility Traces visits to state s accumulating trace replace trace TIME

Rich Sutton • BA Psychology, Stanford, 1978 • As an undergrad, discovered Klopf’s 1972 tech report • Two unpublished undergraduate reports: • “Learning Theory Support for a Single Channel Theory of the Brain” 1978 • “A Unified Theory of Expectation in Classical and Instrumental Conditioning” 1978 (?) • Rich’s first paper: • “Single Channel Theory: A Neuronal Theory of Learning” Brain Theory Newsletter, 1978.

Sutton’s Theory • Aj: level of activation of mode j at time t • Vij: sign and magnitude of association from mode i to mode j at time t • Eij: eligibility of Vij for undergoing changes at time t. It is proportional to the average of the product Ai(t)Aj(t) over some small past time interval (or an average of the logical AND). • Pj: expected level of activation of mode j at time t (a prediction of level of activation of mode j) • Cij a constant depending on particular association being changed

What exactly is Pj? • Based on recent activation of the mode: The higher the activation within the last few seconds, the higher the level expected for the present . . . • Pj(t) is proportional to the average of the activation level over some small time interval (a few seconds or less) before t.

Sutton’s theory • Contingent Principle: based on reinforcement a neuron receives after firings and the synapses which were involved in the firings, the neuron modifies its synapses so that they will cause it to fire when the firing causes an increase in the neuron’s expected reinforcement after the firing. • Basis of Instrumental, or Thorndikian, conditioning • Predictive Principle: if a synapse’s activity predicts (frequently precedes) the arrival of reinforcement at the neuron, then that activity will come to have an effect on the neuron similar to that of reinforcement. • Basis of Classical, or Pavlovian, conditioning

Sutton’s Theory • Main addition to Kopf’s theory: addition of the difference term — a temporal difference term • Showed relationship to the Rescorla-Wagner model (1972) of Classical Conditioning • Blocking • Overshadowing • Sutton’s model was a real-time model of both classical and instrumental conditioning • Emphasized conditioned reinforcement

Rescorla Wagner Model, 1972 “Organisms only learn when events violate their expectations.” : change in associative strength of CS A a: parameter related to CS intensity l: parameter related to US intensity : sum of associative strengths of all CSs present (“composite expectation”) A “trial-level” model

Conditioned Reinforcement • Stimuli associated with reinforcement take on reinforcing properties themselves • Follows immediately from the predictive principle: “By the predictive principle we propose that the neurons of the brain are learning to have predictors of stimuli have the same effect on them as the stimuli themselves” (Sutton, 1978) • “In principle this chaining can go back for any length …” (Sutton, 1978) • Equated Pavlovian conditioned reinforcement with instrumental higher-order conditioning

Where was I coming from? • Studied at the University of Michigan: at the time a hotbed of genetic algorithm activity due to John Holland’s influence (PhD in 1975) • Holland talked a lot about the exploration/exploitation tradeoff • But I studied dynamic system theory, relationship between state-space and input/output representations of systems, convolution and harmonic analysis, finally cellular automata • Fascinated by how simple local rules can generate complex global behavior: • Dynamic systems • Cellular automata • Self-organization • Neural networks • Evolution • Learning

Sutton and Barto, 1981 • “Toward a Modern Theory of Adaptive Networks: Expectation and Prediction” Psych Review 88, 1981 • Drew on Rich’s earlier work, but clarified the math and simplified the eligibility term to be non-contingent: just a trace of x instead of xy. • Emphasized anticipitory nature of CR • Related to “Adaptive System Theory”: • Other neural models (Hebb, Widrow & Hoff’s LMS, Uttley’s “Informon”, Anderson’s associative memory networks) • Pointed out relationship between Rescorla-Wagner model and Adaline, or LMS algorithm • Studied algorithm stability • Reviewed possible neural mechanisms: e.g. eligibility = intracellular Ca ion concentrations

“SB Model” of Classical Conditioning

Temporal Primacy Overrides Blocking in SB model our simulation Kehoe, Schreurs, and Graham 1987

Intratrial Time Courses (part 2 of blocking)

Adaline Learning Rule LMS rule, Widrow and Hoff, 1960 target output input pattern

“Rescorla–Wagner Unit” US to UR CS to CR vector of “associative strengths” “composite expectation”

Important Notes • The “target output” of LMS corresponds to the US input to Rescorla-Wagner model • In both cases, this input is specialized in that it does not directly activate the unit but only directs learning • The SB model is different, with the US input activating the unit and directing learning • Hence, SB model can do secondary reinforcement • SB model stayed with Klopf’s idea of “generalized reinforcement”

One Neural Implementation of S-B Model

A Major Problem: US offset • e.g., if a CS has same time course as US, weights would change so US will be cancelled out. US CS Final result Why? Because trying to zero out yt – yt–1

Associative Memory Networks Kohonen et al. 1976, 1977; Anderson et al. 1977

Associative Search Network Barto, Sutton, & Brouwer 1981

Associative Search Network Barto, Sutton, Brouwer, 1981 Problem of context transitions: add a predictor “one-step-ahead LMS predictor”

Relation to Klopf/Sutton Theory Did not include generalized reinforcement since z(t) is a specialized reward input Associative version of the ALOPEX algorithm of Harth & Tzanakou, and later Unnikrishnan

Associative Search Network

“Landmark Learning” Barto & Sutton 1981 An illustration of associative search

“Landmark Learning”

“Landmark Learning” swap E and W landmarks

y y y 1 2 3 Note: Diffuse Reward Signal reward x 1 x 2 x 3 Units can learn different things despite receiving identical inputs . . .

Provided there is variability • ASN just used noisy units to introduce variability • Variability drives the search • Needs to have an element of “blindness”, as in “blind variation”: i.e. outcome is not completely known beforehand • BUT does not have to be random • IMPORTANT POINT: Blind Variation does not have to be random, or dumb

Pole Balancing Barto, Sutton, & Anderson 1984 Widrow & Smith, 1964 “Pattern Recognizing Control Systems” Michie & Chambers, 1968 “Boxes: An Experiment in Adaptive Control”

MENACE Michie 1961“Matchbox Educable Noughts and Crosses Engine” x o o x x o o x x o o x o o o x o x o x x o o o x x o x x x o o o o o x x x x x o o o o o x o x o o o o x o x x o x o x o o x o o x x o o x o x o x o x x o x x o o o o x x o o x x o o x x x o o o x o x x x o x x o o o x o o x o o o x o o x x x o o x x o o o o o o o x

Searching in the Right Space

Searching in the Right Space

Presentation Transcript

The right time. The right space. Right here.

Searching the Literature and selecting the right references

Literature Searching and Databases - Choosing the right resources

Can you fit the words in the right space?

Searching for the Origins of Life in Interstellar Space

Searching for the right information

Parsimony and searching tree-space

Searching in All the Right Places

Searching for the Right Wholesale Children’s Clothing UK

The Right Desk for Any Space

Choose the right office space - Jagaha.com

Find The Right Coworking Space In Mumbai

Searching For The Right Drug Rehabilitation Center

Searching For The Right Attestation Services?

Searching For Suggestions About Golf? You're In The Right Place!

Searching Office Space for Leasing in Noida

How to Choose the Right Office Space in Rosebery?

How To Choose The Right Office Space in Balcatta?

Creating the Right Home-Space for Yoga

How To Choose The Right Choose the right co-working space