Dynamics of AdaBoost: Convergence & Margin Theory

Dynamics of AdaBoost Cynthia Rudin, PhD NSF Postdoc, BIO Division Center for Neural Science, NYU Joint work with Ingrid Daubechies and Robert E. Schapire Princeton University May 2005, TTI

A Story about AdaBoost • AdaBoost was introduced in 1997 by Yoav Freund and Robert E. Schapire. It is a classification algorithm. • AdaBoost often tends not to overfit. (Breiman 96, Cortes and Drucker 97, etc.) • As a result, the margin theory (Schapire, Freund, Bartlett and Lee 98) developed, which is based on loose generalization bounds. • Note: margin for boosting is not the same as margin for svm. • Remember, AdaBoost was invented before the margin theory. The question remained (until recently): Does AdaBoost maximize the margin? Margin is between -1 and 1

The question remained (until recently): Does AdaBoost maximize the margin? • Empirical results on convergence of AdaBoost: • AdaBoost seemed to maximize the margin in the limit (Grove and Schuurmans 98, and others) Seems very much like “yes”…

The question remained (until recently): Does AdaBoost maximize the margin? • Theoretical results on convergence of AdaBoost: • AdaBoost generates a margin that is at least ½ρ, where ρ is the maximum margin. (Schapire, Freund, Bartlett, and Lee 98) • seems like “yes” ρ AdaBoost’s margin is at least this much: ρ/2 (Schapire et al. 98) true margin

The question remained (until recently): Does AdaBoost maximize the margin? • Theoretical results on convergence of AdaBoost: • 2) AdaBoost generates a margin that is at least Υ(ρ) ≥ ½ρ. (Rätsch and Warmuth 02). • even closer to “yes” ρ Y(ρ) (Rätsch&Warmuth 02) AdaBoost’s margin is at least this much: ρ/2 (Schapire et al. 98) true margin

The question remained (until recently): Does AdaBoost maximize the margin? 2) AdaBoost generates a margin that is at least Υ(ρ) ≥ ½ρ. (Rätsch and Warmuth 02). • Two cases of interest: • “optimal case”: • the weak learning algorithm chooses the best weak classifier at each iteration. • e.g., BoosTexter • “non-optimal case”: • the weak learning algorithm is only required to choose a “sufficiently good” weak classifier at each iteration, not necessarily the best one. • e.g., weak learning algorithm is a decision tree or neural network

The question remained (until recently): Does AdaBoost maximize the margin? 2) AdaBoost generates a margin that is at least Υ(ρ) ≥ ½ρ. (Rätsch and Warmuth 02). This bound conjectured to be tight for the “non-optimal case” (based on numerical evidence). (Rätsch and Warmuth 02). perhaps “yes” for the optimal case, but “no” for non-optimal case

The question remained (until recently): Does AdaBoost maximize the margin? • Hundreds of papers were published using AdaBoost between 1997-2004, even though fundamental convergence properties were not understood! Even after 7 years, this problem was still open! • AdaBoost is difficult to analyze because the margin does not increase at every iteration… the usual tricks don’t work! • A new approach was needed in order to understand the convergence of this algorithm.

The question remained (until recently): Does AdaBoost maximize the margin? The answer is… Theorem (R, Daubechies, Schapire 04) AdaBoost may converge to a margin that is significantly below maximum. The answer is “no”. It’s the opposite of what everyone thought! ☺ Theorem (R, Daubechies, Schapire 04) The bound of (Rätsch and Warmuth 02) is tight, i.e., non- optimal AdaBoost will converge to a margin of Υ(ρ) whenever lim rt= ρ. (Note: this is a specific case of a more general theorem.)

Overview of Talk • History of Margin Theory for Boosting (done) • Introduction to AdaBoost • Proof of the Theorem: Reduce AdaBoost to a dynamical system to understand its convergence!

A Sample Problem

) ) ) ) ( ( ( ( ( ( ( ( ) ) ) ) ) ) ) ) ( ( ( ( ) ) ) ) ( ( ( ( , +1 , +1 , +1 , +1 , +1 , +1 , +1 , +1 , -1 , -1 , -1 , -1 , -1 , -1 , -1 , -1 Say you have a database of news articles… where articles are labeled ‘+1’ if the category is “entertainment”, and ‘-1’ otherwise. Your goal is: Given a new article , find its label.

Examples of classification algorithms: • SVM’s (Support Vector Machines – large margin classifiers) • Neural Networks • Decision Trees / Decision Stumps (CART) • RBF Networks • Nearest Neighbors • Bayes Net • Boosting – • used by itself via stumps (e.g. BoosTexter), • or • as a wrapper for another algorithm • (e.g boosted Decision Trees, boosted Neural Networks)

Training Data: {(xi,yi)}i=1..m where (xi,yi) is chosen iid from an unknown probability distribution on X{-1,1}. “space of all possible articles” “labels” X + + _ _ + ? _

How do we construct a classifier? • Divide the space X into two sections, based on the sign of a function f : X→R. • Decision boundary is the zero-level set of f. f(x)=0 + - X + + _ _ + ? _

h1( ) = +1 if contains the term “movie”, -1 otherwise h2( ) = +1 if contains the term “actor”, -1 otherwise h3( ) = +1 if contains the term “drama”, -1 otherwise Say we have a “weak” learning algorithm: • A weak learning algorithm produces weak classifiers. • (Think of a weak classifier as a “rule of thumb”) Examples of weak classifiers for “entertainment” application: Wouldn’t it be nice to combine the weak classifiers?

f( ) = sign[.4 h1 ( ) + .3 h2 ( ) + .3 h3 ( )] Boosting algorithms combine weak classifiers in a meaningful way (Schapire ‘89). Example: A boosting algorithm takes as input: - the weak learning algorithm which produces the weak classifiers - a large training database So if the article contains the term “movie”, and the word “drama”, but not the word “actor”: The value of f is sign[.4-.3+.3] = sign[.4]=+1, so we label it +1. and outputs: - the coefficients of the weak classifiers to make the combined classifier

AdaBoost (Freund and Schapire ’96) • Start with a uniform distribution (“weights”) over training examples. • (The weights tell the weak learning algorithm which examples are important.) • Obtain a weak classifier from the weak learning algorithm, hjt:X→{-1,1}. • Increase the weights on the training examples that were misclassified. • (Repeat) At the end, make (carefully!) a linear combination of the weak classifiers obtained at all iterations.

AdaBoost Define: :=matrix of weak classifiers and data Enumerate every possible weak classifier which can be produced by weak learning algorithm h1 hj hn “movie” “actor” “drama” 1 i m Mij # of training examples The matrix M has too many columns to actually be enumerated. M acts as the only input to AdaBoost.

= [ .25 .3 .2 .25 ] 1 2 3 4 AdaBoost Define: := distribution (“weights”) over examples at time t

AdaBoost Define: := coeffs of weak classifiers for the linear combination

M AdaBoost matrix of weak classifiers and training instances coefficients for final combined classifier weights on the training instances coefficients on the weak classifiers to form the combined classifier

M AdaBoost matrix of weak classifiers and training examples coefficients for final combined classifier weights on the training examples rt = the “edge” coefficients on the weak classifiers to form the combined classifier d’s cycle, lambda’s converge

Does AdaBoost choose λfinalso that the margin µ( f ) is maximized? That is, does AdaBoost maximize the margin? No! + - + X + _ _ + _

The question remained (until recently): Does AdaBoost maximize the margin? The answer is… Theorem (R, Daubechies, Schapire 04) AdaBoost may converge to a margin that is significantly below maximum. The answer is “no”. It’s the opposite of what everyone thought! ☺

About the proof… • AdaBoost is difficult to analyze… • We use a dynamical systems approach to study this problem. • Reduce AdaBoost to a dynamical system • Analyze the dynamical system in simple cases… • remarkably findstable cycles! • Convergence properties can be completely understood in • these cases.

The key to answering this open question: A set of examples where AdaBoost’s convergence properties can be completely understood.

Analyzing AdaBoost using Dynamical Systems • Reduced Dynamics Compare to AdaBoost… Iterated map for directly updating dt. Reduction uses the fact that M is binary. The existence of this map enables study of low-dim cases.

Smallest Non-Trivial Case t=1 ○○○○○○ t=50

t=3 t=6 t=1 t=2 t=4 t=5

Smallest Non-Trivial Case To solve: simply assume 3-cycle exists. Convergence to 3-cycle is * really* strong.

Two possible stable cycles! x Maximum margin solution is attained! x x t=1 ○○○○○○ t=50 To solve: simply assume 3-cycle exists. AdaBoost achieves max margin here, so conjecture true in at least one case. The edge, r_t, is the golden ratio minus 1.

Generalization of smallest non-trivial case • Case of m weak classifiers, each misclassifies one point • Existence of at least (m-1)! stable cycles, each yields a • maximum margin solution. Can’t solve for cycle exactly, but can prove our equation has a unique solution for each cycle.

Generalization of smallest non-trivial case • Stable manifolds of 3-cycles.

Empirically Observed Cycles

Empirically Observed Cycles t=1 ○○○○○○ t=300

Empirically Observed Cycles t=1 ○○○○○○ t=5500 (only plotted every 20th iterate)

If AdaBoost cycles, we can calculate the margin it will asymptotically converge to in terms of the edge values

The question remained (until recently): Does AdaBoost maximize the margin? AdaBoost doesnotalways produce a maximum margin classifier! Proof: There exists an 8x8 matrix M where AdaBoost provably converges to a non-maximum margin solution. • Convergence to a manifold of strongly attracting • stable 3-cycles. • Margin produced by AdaBoost is 1/3, • but maximum margin is 3/8!

Approximate Coordinate Ascent Boosting (R, Schapire, Daubechies, 04) AdaBoost

Dynamics of AdaBoost: Convergence & Margin Theory