A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting By Yoav Freund and Robert E. Schapire Presented by David Leach Original Slides by Glenn Rachlin

Outline: • Background • On-line allocation of resources • Introduction • The Problem • The Hedge Algorithm • Analysis • Boosting • Introduction • The Problem • The AdaBoost Algorithm • Analysis • Applications • Extensions • Conclusions • Questions for Final exam

Useful Definitions: • On-Line Learning – Information comes in one step at a time, learner must apply model, make prediction, observe true value, then adjust model accordingly. • Weak Learner – Algorithm has higher accuracy than random guessing, but is impractical by itself for most real-world applications. • PAC – Probably Approximately Correct; most of the time the prediction returned will be close to the actual result.

Ensemble Learning: A machine learning paradigm where multiple learners are used to solve the problem Ensemble: Previously: Problem Problem … ... Learner … ... Learner Learner Learner • The generalization ability of the ensemble is usually significantly better than that of an individual learner • Boosting is one of the most important families of ensemble methods

Boosting: a background • Significant advantages: • Solid theoretical foundation • High level of accuracy • Simple to implement • Wide range of applications • R. Schapire and Y. Freund won the 2003 Godel Prize (one of the most prestigious awards in theoretical computer science) • Prize winning paper (which introduced AdaBoost): "A decision theoretic generalization of on-line learning and an application to Boosting,“ Journal of Computer and System Sciences, 1997, 55: 119-139.

How was Adaboost born? • In 1988, M. Kearns and L.G. Valiant posed an interesting question: • Can a “weak” learning algorithm that performs just slightly better than random guess can be “boosted” into an arbitrarily accurate “strong” learning algorithm? • More simply, can we transform one or more weak learners into a single strong learner?

How was Adaboost born? • In R. Schapire’s MLJ90 paper, Rob said “yes” and gave a proof to the question. The proof is a construction, which is the first Boosting algorithm (“Recursive Majority Gate Formulation”) • Then, in Y. Freund’s Phd thesis (1993), Yoav gave a scheme of combining weak learners by “Majority Vote” • Though theoretically strong, both algorithms relied on knowledge of each weak learner’s accuracy • Later, at AT&T Bell Labs, they published the 1997 paper (in fact the work was done in 1995), which proposed the AdaBoost algorithm, a practical, “adaptable” algorithm

Boosting Timeline • 1990 – Boost-by-majority algorithm (Freund) • 1995 – AdaBoost (Freund &Schapire) • 1997 – Generalized version of AdaBoost (Schapire& Singer) • 2001 – AdaBoost in Face Detection (Viola & Jones)

On-line Allocation of Resources: Introduction • Problem: “... dynamically apportioning resources among a set of options...” • In other words, “Given a set of individual predictions, how much should we value each one?” • The Gambler Example (A recurring theme)

The Gambler: A Gambler wants to make money on horse-racing by consulting a group of experts. • He discovers that experts tend to use certain “rules of thumb” for races that dictate results to some degree (“Horse with the best odds”, etc.). • Hard to find one particular rule that works for multiple circumstances. • How can he use the network of various predictions (each of which tends to use a given rule of thumb) to win money? • More specifically, how should he split his money among the experts?

On-line Allocation of Resources: Problem Formulation The on-line allocation model: • Allocation agent: A - the gambler • A certain strategy: i – one expert’s behavior • # of options/strategies: {1,2,3, ... ,N} - the # of experts to choose from • # of time steps: {1,2,3, ... ,T} - the # of races • distribution over strategies: pt -how much money he spends on each expert • loss: l –money lost (or not gained)

On-line Allocation of Resources: Hedge(β) • Basis: “The algorithm and its analysis are direct generalizations of Littlestone and Warmuth’s weighted majority algorithm” • Assumptions: • The loss suffered by any strategy be bounded • All weights be nonnegative • Initial weights sum up to 1 (optional)

Algorithm Hedge (β) • Parameters: β∈[0,1] • initial weight vector: ω1∈ [0,1]N with • number of trials T • Do for t = 1,2, ..., T • Choose allocation from environment • Receive loss vector • Suffer loss • Set new weights vector to be Goal: minimize difference between expected total loss and minimal total loss of repeating one action

The Gambler Revisited • The gambler uses his fancy new algorithm as follows: • 1. The gambler splits his money evenly between 3 experts, giving $5 to each • p1 = <.33,.33,.33> • 2. The gambler records the loss to each expert • Expert 1 loses $2 • Expert 2 loses $1 • Expert 3 loses $4 • loss vector lt = <2,1,4> • total loss = .33x2 + .33x1 + .33x4 = 2.33

The Gambler Revisited • 3. The gambler sets new weights using this data and a beta of .5 • Expert 1 is weighted .33 x .52= .083 • Expert 2 is now weighted .33 x .51 = .167 • Expert 3 is now weighted .33 x .54 = .063 • Total weight = .083 + .167 + .063 = .313 • 4. The gambler repeats the process, now “hedging” his bets as follows: • p2 = <.083/.313, .167/.313, .063/.313> = <.265, .533, .202>

On-line Allocation of Resources: Bounds of Hedge(β)

Bounds, Con’t • β = given parameter • T = total number of time steps or trials • N = number of options

Choosing beta Set Where L~ is the bound on the best strategy, and Then: And if we know T:

On-line Allocation of Resources: Evaluation • The authors show that the Hedge(β) algorithm “yield[s] bounds that are slightly weaker in some cases, [than those produced by the algorithm proposed by Littlestone and Warmuth, 1994] but applicable to a considerably more general class of learning problems.” • Not only binary decisions • Not only discrete loss

More Gambling • The gambler now wants to avoid the experts, and opts to write a program that predicts the winner. • He must take input data • Odds • Previous results • Track conditions • And predict the outcome • Win or loss • He notices that “rules of thumb” once again emerge, where simple heuristics can provide some predictive accuracy, but not enough • How can he use this information to make money?

Boosting: Introduction • Aim: “.. converting a weak learning algorithm that performs just slightly better than random guessing into one with arbitrarily high accuracy.” • Example: Constructing an expert computer program • Two problems: • Choosing data • Combining rules • “Boosting refers to this general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumb.”

Traditional Boosting • Split a training data set into multiple overlapping subsets. • Train a weak learner on one equally weighted example set, until accuracy is > 50%. • Train a weak learner on a new example set, now weighted to focus on errors. • Repeat until all example sets are exhausted. • Apply all learners to test set to determine final hypothesis.

The Problem • Previous algorithms by the same authors “work by calling a given weak learning algorithm WeakLearn multiple times, each time presenting it with a different distribution [of examples], and finally combining all the generated hypotheses into a single hypothesis.” • Problems • Too much has to be known in advance • Improvement of the overall performance depends on the weakest rules

Adaboost: Adaptive Boosting • Instead of sampling, re-weight • Can be used to train weak classifiers • Final classification based on weighted vote of weak classifiers

AdaBoost: If the underlying classifiers are linear networks, then AdaBoost builds multilayer perceptrons one node at a time. However, the underlying classifier can be anything, decision trees, neural networks, etc…

The AdaBoost Algorithm:

Initial Analysis: • The weight of each example is adjusted so that the multiplier will be beta if correct (<1), or 1 if incorrect (beta^0). Remember that the weight will be normalized, so no decrease is effectively an increase. • Each learner gets a vote inversely proportional to the logarithm of its beta, which in turn was proportional to its error.

Beta • What implications does this have? • 1. If the error is .5, or equivalent to random guessing, no information is gained and the time step isn’t used. • 2. For a common error <.5, we can weight examples proportionally to error, and weight votes inversely proportional to error. • 3. In the final hypothesis generation, if error was > .5, the time step will actually have an inverse vote.

Still More Gambling • The gambler now has a pretty good scheme to make money, and downloads the entire race history from the track’s database • 1. He finds that odds are a PAC predictor, and comes up with hypotheses accordingly. • 2. He calculates the error of using this predictor. • 3. He looks at the data in a different way, focusing on examples that odds could not easily predict, and comes up with a new heuristic (when the track is muddy, the horse with the most experienced jockey wins) • 4. He repeats this process until no more viable heuristics can be determined. • 5. When enough of the heuristics indicate a win for a given horse, he places a bet.

Thus, if each base classifier is slightly better than random so that for some , then the training error drops exponentially fast in T since the above bound is at most Theoretical Properties: • Y. Freund and R. Schapire[JCSS97] have proved that the training error of AdaBoost is bounded by: where

Theoretical Properties Con’t • Y. Freund and R. Schapire[JCSS97] have tried to bound the generalization error as: where denotes empirical probability on training sample, s is the sample size, d is the VC-dim of base learner The above bounds suggest that Boosting will overfit if T is large. However, empirical studies show that Boosting often does not overfit • R. Schapire et al. [AnnStat98] gave a margin-based bound: for any > 0 with high probability where

+1 ( ) yt = -1 ( ) Toy Example – taken from Antonio Torralba @MIT Each data point has a class label: and a weight: wt =1 Weak learners from the family of lines h => p(error) = 0.5 it is at chance

+1 ( ) yt = -1 ( ) Toy example Each data point has a class label: and a weight: wt =1 This one seems to be the best This is a ‘weak classifier’: It performs slightly better than chance.

+1 ( ) yt = -1 ( ) Toy example Each data point has a class label: We update the weights: wtwtexp{-ytHt} We set a new problem for which the previous weak classifier performs better than chance again

+1 ( ) yt= -1 ( ) Toy example Each data point has a class label: We update the weights: wtwtexp{-ytHt} We set a new problem for which the previous weak classifier performs better than chance again

+1 ( ) yt = -1 ( ) Toy example Each data point has a class label: We update the weights: wt wt exp{-yt Ht} We set a new problem for which the previous weak classifier performs at chance again

+1 ( ) yt = -1 ( ) Toy example Each data point has a class label: We update the weights: wt wt exp{-yt Ht} We set a new problem for which the previous weak classifier performs better than chance again

Toy example f1 f2 f4 f3 The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.

Formal Procedure of AdaBoost

Procedure of Adaboost:

Error on Training Set:

Overfitting • Will Adaboost screw up with a fat complex classifier finally? Occam’s razor – simple is the best Over fitting Shall we stop before over fitting? If only over fitting happens.

Actual Typical Run

An explanation by margin • This margin is not the margin in SVM

Margin Distribution Although final classifier is getting larger, margins are still increasing Final classifier is actually getting to simpler classifer

Practical Advantages of AdaBoost: • Simple and easy to program. • No parameters to tune (except T). • Effective, provided it can consistently find rough rules of thumb. • Goal is to find hypothesis barely better than guessing. • Can combine with any (or many) classifiers to find weak hypotheses: neural networks, decision trees, simple rules of thumb, nearest-neighbor classifiers, etc.

Extensions • AdaBoost.M1: First Multiclass • AdaBoost.M2: Second Multiclass • AdaBoost.R: Weak Regression

AdaBoost.M1 • We modify error calculation as follows: • With the caveat: • And come up with a final hypothesis by:

Concluding Remarks • This paper was the introduction of AdaBoost, an award-winning, widely used algorithm featured in the top 10 algorithms. • The paper also included the Hedge algorithm, a much less widely known algorithm. • It should be noted that Hedge was a solution to a problem that prevented adaptable boosting for a long time, and has therefore had a significant impact on data mining since.

Exam Questions: • What are we seeking to minimize in resource allocation? • What is the goal of boosting? • What makes Adaboost adaptable?

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Presentation Transcript

Information Theoretic Learning

A Comparison of On-Line and Classroom Learning

On-Line Application Processing

Application of an On-Line Diagnostic English Language Test

Creativity and Collaboration: An on-line learning experience

Application of Stacked Generalization to a Protein Localization Prediction Task

On-line learning and Boosting

Group Theoretic Decision Problems

Decision Trees and Boosting

Line Drawing and Generalization

Application of machine learning to RCF decision procedures

On-Line Application Processing

On-Line Learning

A decision-theoretic view of image retrieval

Welcome to an on-line

Welcome to an on-line

On-line learning and Boosting

Welcome to an on-line

Decision Trees and Boosting

Welcome to an on-line