410 likes | 416 Vues
Active & Reinforcement Learning. Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn. Students. Active. Passive. Lazy. Active Learning. What is AL? Techniques that help classifiers learn better with less training samples. Why: Data are cheap but labeling can be expensive.
E N D
Active & Reinforcement Learning Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn
Students Active Passive Lazy
Active Learning • What is AL? • Techniques that help classifiers learn better with less training samples. • Why: • Data are cheap but labeling can be expensive. • For example: Speech Recognition, Information Extraction … • Key idea: • Let the classifier choose the samples from which it learns. Original Data Supervised Learning Active Learning
Example 1 Supervised Learning: N labels for ε<=1/N w a b Active Learning: log(N) labels for ε<=1/N w a b
Query Points • Many samples are redundant or irrelevant to the decision boundary. • AL typically works by: • Randomly querying a few samples. • Heuristically querying additional samples. • Synthesized Queries Points • The learner may request labels for any points in the space. • Just like students may pop-up all kinds of questions. • Could be awkward. • Pool-Based Sampling • Samples are cheap and are collected at once. • Informativeness measure is employed to do the selection. • The key question is which samples should be labeled?
Uncertainty Sampling • Find out what we are not sure about. • Create an initial classifier. • While the teacher is willing to label samples • (a) Apply the current classifier to each sample. • (b) Find samples that the classifier is least certain of the membership. • (c) Have the teacher label the selected samples. • (d) Train a new classifier on all labeled samples. • The classifier needs to output membership and certainty. • KNN, NB, NN … • Extensions for Multi-Class Problems • Margin • Entropy
Sampling Bias w* w S. Dasgupta and D. Hsu: Hierarchical Sampling for Active Learning. ICML 2008
Exploiting Clustering Structure • Find a clustering of the data. • Sample a few points from each cluster randomly. • Assign each cluster its majority label. • Use this fully labeled data to build a classifier.
Summary • Data Cheap, Labeling Expensive. • AL improves the efficiency of training by selectively querying the most informative samples. • Two Flavors: • Explore the hypothesis space efficiently. • Exploit the clustering structure. • Related Areas: • Semi-Supervised Learning • Design of Experiments • Optimization
Reinforcement Learning • An agent makes a series of actions and receives awards from the environment. • For example: a robot walking along a maze • Delayed Reward • Usually the reward is given after a number of actions. • Lots of unrewarded intermediate actions • For example: win or lose of a game • Goal • To learn to choose actions to maximize long term rewards. • Supervised? Unsupervised? • Learning Based on Experience • Credit Assignment • Apportion credit and blame to each action.
Terminology • State (S): Room • Action (A): Moving from one room to another • Reward Table (R):
Matrix Q • The agent learns from experience or training by exploring the environment. • Q is the memory of the agent about the environment. • Given a state diagram, find the minimum path from any initial state to the goal state.
Q Learning • Set the learning rate and R. • Initialize Q as a zero matrix. • For each trial • Select a random initial state. • While the goal state is not reached • Select one possible action A for the current state S. • Get the maximum Q value of the new state S′. • Update Q (S, A). • Set the new state as the current state. • End While • End For • After learning, the agent will move by selecting the action with the maximum value in Q.
Examples • Initial State: B • Next Possible States: D & F • From F, the agent can go to B, E & F • Stop!
Examples • Initial State: D • Next State: B, C & E • From B, the agent can go to D & F • Continue …
Temporal Difference Learning Agent’s Move Opponent’s Move Agent’s Move
Summary • RL trains an agent to make a series of appropriate actions to achieve long term goals based on trial and error. • Credit assignment is implemented largely through the back propagation of rewards with decay. • What are the most important choices that lead you to success? • RL takes into account the details of the interaction. • By contrast: How to use GAs + ANN to learn chess? • There is a trade off between exploration and exploitation. • Human still perform better by truly understanding the problems via analyzing and reasoning.