340 likes | 510 Vues
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 1: Course Logistics, Introduction. August 23, 2012. Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2012. But first, a quick anonymous survey ….
E N D
ECE-517: Reinforcement Learning in Artificial IntelligenceLecture 1: Course Logistics, Introduction August 23, 2012 Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2012
Outline • Course logistics and requirements • Course roadmap & outline • Introduction
Course Objectives • Introduce the concepts & principles governing reinforcement- based machine learning systems • Review fundamental theory • Markov Decision Processes (MDPs) • Dynamic Programming (DP) • Practical systems; role of Neural Networks in NDP • RL learning schemes (Q-Learning, TD-Learning, etc.) • Limitations of existing techniques and how they can be improved • Discuss software and hardware implementation considerations Long-term goal: to contribute to your understanding of the formalism, trends and challenges in constructing RL-based agents
Course Prerequisites • A course on probability theory or background in probability theory is required • Matlab/C/C++ competency • A Matlab tutorial has been posted on the course website (under the schedule page) • Open mindedness & imagination …
Course Assignments • 2 small projects • Main goal – provide students with basic hands-on experience in ML behavioral simulation and result interpretation • Analysis complemented by simulation • MATLAB programming oriented • Reports should include all background, explanations and results • 5 problem assignment sets • Will cover the majority of the topics discussed in class • Assignments should be handed in before the beginning of the class • Final project • Each student/group is assigned a topic • Project report & in-class presentation
Sony AIBO Lab • Located at MK 630 • 6 Sony AIBO dog robots (3rd generation) • Local wireless network (for communicating with the dogs) • Code for lab project/s will be written in Matlab • Interface has been prepared • Time slots should be coordinated withInstructor & TA
Textbooks & Reference Material • Lecture notes will be posted weekly on the course website (web.eecs.utk.edu/~itamar/courses/ECE-517) as well as … • Updated schedule • Assignment sets, sample codes • Grades • General announcements • Reading assignments will be posted on schedule page • Textbook: • R. Sutton and A. Barto, “Reinforcement Learning: An Introduction,” 1998. (available online!)
Grading policy & office hours • 2 Small Projects – 25% (12.5 points each) • 5 Assignment sets – 25% (5 points each) • Midterm (in-class) – 20% • Final project – 30% • Instructor: Dr. Itamar Arel • Office Hours: T/Tr 2:00 – 3:00 PM (MK 608) • TA: Derek Rose (derek@utk.edu) – office @ MK 606 • Office hours: contact TA • My email: itamar@eecs.utk.edu Students are strongly encouraged to visit the course website (web.eecs.utk.edu/~itamar/courses/ECE-517) for announcements, lecture notes, updates etc.
UTK Academic Honesty Statement An essential feature of the University of Tennessee, Knoxville, is a commitment to maintaining an atmosphere of intellectual integrity and academic honesty. As a student of the university, I pledge that I will neither knowingly give nor receive any inappropriate assistance in academic work, thus affirming my own personal commitment to honor and integrity. Bottom line: DO YOUR OWN WORK!
Understanding and constructing intelligence • What is intelligence? How do we defined/evaluate it? • How can we design adaptive systems that optimize their performance as time goes by? • What are the limitations of RL based algorithms? • How can artificial Neural Networks help us scale intelligent systems? • In what ways can knowledge be efficiently represented? • This course is NOT about … • Robotics • Machine learning (in the general sense) • Legacy AI – symbolic reasoning, logic, etc. • Image/vision/signal processing • Control systems theory • Why the course name “RL in AI” ?
Course Outline & Roadmap • Introduction • Review of basic probability theory • Discrete-time/space probability theory • Discrete Markov Chains • Dynamic Programming • Markov Decision Processes (MDPs) • Partially Observable Markov Decision Processes (POMDPs) • Approximate Dynamic Programming (a.k.a. Reinforcement Learning) • Temporal Difference (TD) Learning, Planning • Midterm – Tuesday, Oct 9, 2012 • Neuro-Dynamic Programming • Feedfoward & Recurrent Neural Networks • Neuro-dynamic RL architecture • Applications and case studies • Final project presentations – Nov 27– Dec 4, 2012 A detailed schedule is posted at the course website
Outline • Course logistics and requirements • Course outline & roadmap • Introduction
What is Machine Learning? • Discipline focusing on computer algorithms that learn to perform “intelligent” tasks • Learning is based on observation of data • Generally: learning to do better in the future based on what has been observed/experienced in the past • ML is a core subarea of AI, which also intersects with physics, statistics, theoretical CS, etc. • Examples of “ML” Problems • Optical character recognition • Face detection • Spoken language understanding • Customer segmentation • Weather prediction, etc.
Introduction • Why do we need good ML technology? • Human beings are lazy creatures … • Service robotics • $10B market in 2015 – Japan only! • Pattern recognition (speech, vision) • Data mining • Military applications • … many more • Many ML problems can be formulated as RL problems …
Introduction (cont.) • Learning by interacting with our environment is probably the first to occur to us when we think about the nature of learning • Humans have no direct teachers • We do have direct sensormotor connection to the environment • We learn as we go along • Interaction with environment teaches us what “works” and what doesn’t • We construct a “model” of our environment This course explores a computational approach to learning from interaction with the environment
What is Reinforcement Learning Reinforcement learning is learning what to do – how to map situations to actions – in order to maximize a long-term objective function driven by rewards. • It is a form of unsupervised learning • Two key components at the core of RL: • Trial-and-error – adapting internal representation, based on experience, to improve future performance • Delayed reward – actions are produced so as to yield long- term (not just short-term) rewards • The “agent” must be able to: • Sense its environment • Produce actions that can affect the environment • Have a goal (“momentary” cost metric) relating to its state
What is Reinforcement Learning (cont’) • RL attempts to solve the Credit Assignment problem what is the long-term impact of an action taken now • Unique to RL systems - major challenge in ML • Necessitates an accurate model of the environment being controlled/interacted-with • Something animals and humans do very well, and computers do very poorly • We’ll spend most of the semester formalizing solutions to this problem • Philosophically • Computationally • Practically (implementation considerations)
The Big Picture Artificial Intelligence Machine Learning Reinforcement Learning • Types of Machine Learning • Supervised Learning: learn from labeled examples • Unsupervised Learning: process unlabeled examples • Example: clustering data into groups • Reinforcement Learning: learn from interaction • Defined by the problem • Many approaches are possible (including evolutionary) • Here we will focus on a particular family of approaches • Autonomous learning
Software vs. Hardware • Historically, ML has been in CS turf • Confinement to the Von Neumann architecture • Software limits scalability … • Human brain has 1011 processors operatingat once • However, each runs at ~150 Hz • It’s the massive parallelism that gives it power • Even 256 processors is not “massive parallelism” • “Computer Engineering” perspective • FPGA devices (reconfigurable computing) • GPUs • ASIC Prospect • UTK/MIL group focus
Exploration vs. Exploitation • A fundamental trade-off in RL • Exploitation of what worked in the past (to yield high reward) • Exploration of new, alternative action paths so as to learn how to make better action selections in the future • The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task • On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward • We will review mathematical methods proposed to address this basic issue
Some Examples or RL • A master chess player • Planning-anticipating replies and counter-replies • Immediate, intuitive judgment • A mobile robot decides whether to enter a room or try to find its way back to a battery-charging station • Playing backgammon • Obviously, a strategy is necessary • Some luck involved (stochastic game) • In all cases, the agent tries to achieve a goal despite uncertainty about its environment • The affect of an action cannot be fully predicted • Experience allows the agent to improve its performance over time
Origins of Reinforcement Learning • Artificial Intelligence • Control Theory (MDP) • Operations Research • Cognitive Science and Psychology • More recently, Neuroscience • RL has solid foundations and is a well-established research field
Elements of Reinforcement Learning • Beyond the agent and the environment, we have the following four main elements in RL • Policy - defines the learning agent's way of behaving at any given time. Roughly speaking, a policy is a mapping from (perceived) states of the environment to actions to be taken when in those states. • Usually stochastic (adapts as you go along) • Enough to determine the agent’s behavior • Reward function - defines the goal in a RL learning problem. Roughly speaking, it maps each perceived state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state • Agent’s goal is to maximize the reward over time • May be stochastic • Drive the policy employed and its adaptation
Elements of Reinforcement Learning (cont.) • Value function Whereas a reward function indicates what is good in an immediate sense, a value function specifies what is good in the long run. • Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. • It allows the agent to look over the “horizon” • Actions are derived from value estimations, not rewards • We measure rewards, but we estimate and act upon values – corresponds to strategic/long-term thinking • Intuitively a prerequisite for intelligence/intelligent-control (plants vs. animals) • Obtaining a good value function is a key challenge in designing good RL systems
Elements of Reinforcement Learning (cont.) • Model – an observable entity that mimics the behavior of the environment. • For example, given a state and action, the model might predict the resultant next state and next reward • As we will later discuss, predictability and auto-associative memory, are key attributes of the mammal brain • Models are used for planning – any way of deciding on a course of action by considering possible future scenarios prior to them actually occurring • Note that RL can work (sometimes very well) with an incomplete model • We’ll go over a range of model platforms to achieve the above As a side note: RL is essentially an optimization problem. However, it is one of the many optimization problems that are extremely hard to (optimally) solve.
An Extended Example: Tic-Tac-Toe • Consider a classical tic-tac-toe game, whereby the winner places three marks in a row, horizontally, vertically or diagonally • Let’s assume: • We are playing against an imperfect player • Draws and losses are equally bad for us • Q: Can we design a player that’ll find imperfections in the opponent’s play and learn to maximize chances of winning? • Classical machine learning schemes would never visit a state that has the potential to lead to a loss • We want to exploit the weaknesses of the opponent, so we may decide to visit a state that has the potential of leading to a loss
An Extended Example: Tic-Tac-Toe (cont.) • Using dynamic programming (DP), we can compute an optimal solution for any opponent • However, we would need specifications of the opponent (e.g. state-action probabilities) • Such information is usually unavailable to us • In RL we estimate this information from experience • We later apply DP, or other sequential decision making schemes, based on the model we obtained by experience • A policy tells the agent how to make its next move based on the state of the board • Winning probabilities can be derived by knowing opponent
An Extended Example: Tic-Tac-Toe (cont.) • How do we solve this in RL … • Set up a table of numbers – one for each state of the game • This number will reflect the probability of winning from that particular state • This is treated as the state’s value, and the entire learned table denotes the value function • If V(a)>V(b) state a is preferred over state b • All states with three X’s in a row have win prob. of 1 • All states with three O’s in a row have win prob. of 0 • All other states are preset to prob. 0.5 • When playing the game, we make a move that we predict would result in a state with the highest value (exploitation) • Occasionally, we chose randomly among the non-zero valued states (exploratory moves)
An Extended Example: Tic-Tac-Toe (cont.) • While playing, we update the values of the states • The current value of the earlier state is adjusted to be closer to the value of the later state V(s) V(s) + a[V(s’) –V(s)] where: 0<a<1 is a learning parameter (step-size param.) s – state before move / s’ – state after move • This update rule is an example of Temporal-Difference Learning method • This method performs quite well – converges to the optimal policy (for a fixed opponent) • Can be adjusted to allow for slowly-changing opponents
An Extended Example: Tic-Tac-Toe (cont.) • RL features encountered … • Emphasis on learning from interaction – in this case with the opponent • Clear goal – correct planning takes into account delayed rewards • For example setting up traps for a shortsighted opponent • No model of opponent exists a priori • Although this example is a good one, RL methods can … • be applied to infinite horizon problems (not state-terminal) • be applied to cases where there is no external adversary (e.g. “game against nature”) • Backgammon example 1020 states using Neural Nets
For next class … • Read chapter 1 in SB (up to and including section 1.5) • Read chapter 2 in SB