600 likes | 638 Vues
Learn about the convergence properties of AdaBoost and the margin theory, challenging popular beliefs. Explore key theoretical insights and practical implications in classification algorithms such as boosting.
E N D
Dynamics of AdaBoost Cynthia Rudin, PhD NSF Postdoc, BIO Division Center for Neural Science, NYU Joint work with Ingrid Daubechies and Robert E. Schapire Princeton University May 2005, TTI
A Story about AdaBoost • AdaBoost was introduced in 1997 by Yoav Freund and Robert E. Schapire. It is a classification algorithm. • AdaBoost often tends not to overfit. (Breiman 96, Cortes and Drucker 97, etc.) • As a result, the margin theory (Schapire, Freund, Bartlett and Lee 98) developed, which is based on loose generalization bounds. • Note: margin for boosting is not the same as margin for svm. • Remember, AdaBoost was invented before the margin theory. The question remained (until recently): Does AdaBoost maximize the margin? Margin is between -1 and 1
The question remained (until recently): Does AdaBoost maximize the margin? • Empirical results on convergence of AdaBoost: • AdaBoost seemed to maximize the margin in the limit (Grove and Schuurmans 98, and others) Seems very much like “yes”…
The question remained (until recently): Does AdaBoost maximize the margin? • Theoretical results on convergence of AdaBoost: • AdaBoost generates a margin that is at least ½ρ, where ρ is the maximum margin. (Schapire, Freund, Bartlett, and Lee 98) • seems like “yes” ρ AdaBoost’s margin is at least this much: ρ/2 (Schapire et al. 98) true margin
The question remained (until recently): Does AdaBoost maximize the margin? • Theoretical results on convergence of AdaBoost: • 2) AdaBoost generates a margin that is at least Υ(ρ) ≥ ½ρ. (Rätsch and Warmuth 02). • even closer to “yes” ρ Y(ρ) (Rätsch&Warmuth 02) AdaBoost’s margin is at least this much: ρ/2 (Schapire et al. 98) true margin
The question remained (until recently): Does AdaBoost maximize the margin? 2) AdaBoost generates a margin that is at least Υ(ρ) ≥ ½ρ. (Rätsch and Warmuth 02). • Two cases of interest: • “optimal case”: • the weak learning algorithm chooses the best weak classifier at each iteration. • e.g., BoosTexter • “non-optimal case”: • the weak learning algorithm is only required to choose a “sufficiently good” weak classifier at each iteration, not necessarily the best one. • e.g., weak learning algorithm is a decision tree or neural network
The question remained (until recently): Does AdaBoost maximize the margin? 2) AdaBoost generates a margin that is at least Υ(ρ) ≥ ½ρ. (Rätsch and Warmuth 02). This bound conjectured to be tight for the “non-optimal case” (based on numerical evidence). (Rätsch and Warmuth 02). perhaps “yes” for the optimal case, but “no” for non-optimal case
The question remained (until recently): Does AdaBoost maximize the margin? • Hundreds of papers were published using AdaBoost between 1997-2004, even though fundamental convergence properties were not understood! Even after 7 years, this problem was still open! • AdaBoost is difficult to analyze because the margin does not increase at every iteration… the usual tricks don’t work! • A new approach was needed in order to understand the convergence of this algorithm.
The question remained (until recently): Does AdaBoost maximize the margin? The answer is… Theorem (R, Daubechies, Schapire 04) AdaBoost may converge to a margin that is significantly below maximum. The answer is “no”. It’s the opposite of what everyone thought! ☺ Theorem (R, Daubechies, Schapire 04) The bound of (Rätsch and Warmuth 02) is tight, i.e., non- optimal AdaBoost will converge to a margin of Υ(ρ) whenever lim rt= ρ. (Note: this is a specific case of a more general theorem.)
Overview of Talk • History of Margin Theory for Boosting (done) • Introduction to AdaBoost • Proof of the Theorem: Reduce AdaBoost to a dynamical system to understand its convergence!
) ) ) ) ( ( ( ( ( ( ( ( ) ) ) ) ) ) ) ) ( ( ( ( ) ) ) ) ( ( ( ( , +1 , +1 , +1 , +1 , +1 , +1 , +1 , +1 , -1 , -1 , -1 , -1 , -1 , -1 , -1 , -1 Say you have a database of news articles… where articles are labeled ‘+1’ if the category is “entertainment”, and ‘-1’ otherwise. Your goal is: Given a new article , find its label.
Examples of classification algorithms: • SVM’s (Support Vector Machines – large margin classifiers) • Neural Networks • Decision Trees / Decision Stumps (CART) • RBF Networks • Nearest Neighbors • Bayes Net • Boosting – • used by itself via stumps (e.g. BoosTexter), • or • as a wrapper for another algorithm • (e.g boosted Decision Trees, boosted Neural Networks)
Training Data: {(xi,yi)}i=1..m where (xi,yi) is chosen iid from an unknown probability distribution on X{-1,1}. “space of all possible articles” “labels” X + + _ _ + ? _
How do we construct a classifier? • Divide the space X into two sections, based on the sign of a function f : X→R. • Decision boundary is the zero-level set of f. f(x)=0 + - X + + _ _ + ? _
h1( ) = +1 if contains the term “movie”, -1 otherwise h2( ) = +1 if contains the term “actor”, -1 otherwise h3( ) = +1 if contains the term “drama”, -1 otherwise Say we have a “weak” learning algorithm: • A weak learning algorithm produces weak classifiers. • (Think of a weak classifier as a “rule of thumb”) Examples of weak classifiers for “entertainment” application: Wouldn’t it be nice to combine the weak classifiers?
f( ) = sign[.4 h1 ( ) + .3 h2 ( ) + .3 h3 ( )] Boosting algorithms combine weak classifiers in a meaningful way (Schapire ‘89). Example: A boosting algorithm takes as input: - the weak learning algorithm which produces the weak classifiers - a large training database So if the article contains the term “movie”, and the word “drama”, but not the word “actor”: The value of f is sign[.4-.3+.3] = sign[.4]=+1, so we label it +1. and outputs: - the coefficients of the weak classifiers to make the combined classifier
AdaBoost (Freund and Schapire ’96) • Start with a uniform distribution (“weights”) over training examples. • (The weights tell the weak learning algorithm which examples are important.) • Obtain a weak classifier from the weak learning algorithm, hjt:X→{-1,1}. • Increase the weights on the training examples that were misclassified. • (Repeat) At the end, make (carefully!) a linear combination of the weak classifiers obtained at all iterations.
AdaBoost Define: :=matrix of weak classifiers and data Enumerate every possible weak classifier which can be produced by weak learning algorithm h1 hj hn “movie” “actor” “drama” 1 i m Mij # of training examples The matrix M has too many columns to actually be enumerated. M acts as the only input to AdaBoost.
= [ .25 .3 .2 .25 ] 1 2 3 4 AdaBoost Define: := distribution (“weights”) over examples at time t
AdaBoost Define: := coeffs of weak classifiers for the linear combination
M AdaBoost matrix of weak classifiers and training instances coefficients for final combined classifier weights on the training instances coefficients on the weak classifiers to form the combined classifier
M AdaBoost matrix of weak classifiers and training examples coefficients for final combined classifier weights on the training examples rt = the “edge” coefficients on the weak classifiers to form the combined classifier d’s cycle, lambda’s converge
Does AdaBoost choose λfinalso that the margin µ( f ) is maximized? That is, does AdaBoost maximize the margin? No! + - + X + _ _ + _
The question remained (until recently): Does AdaBoost maximize the margin? The answer is… Theorem (R, Daubechies, Schapire 04) AdaBoost may converge to a margin that is significantly below maximum. The answer is “no”. It’s the opposite of what everyone thought! ☺
About the proof… • AdaBoost is difficult to analyze… • We use a dynamical systems approach to study this problem. • Reduce AdaBoost to a dynamical system • Analyze the dynamical system in simple cases… • remarkably findstable cycles! • Convergence properties can be completely understood in • these cases.
The key to answering this open question: A set of examples where AdaBoost’s convergence properties can be completely understood.
Analyzing AdaBoost using Dynamical Systems • Reduced Dynamics Compare to AdaBoost… Iterated map for directly updating dt. Reduction uses the fact that M is binary. The existence of this map enables study of low-dim cases.
Smallest Non-Trivial Case t=1 ○○○○○○ t=50
t=3 t=6 t=1 t=2 t=4 t=5
Smallest Non-Trivial Case To solve: simply assume 3-cycle exists. Convergence to 3-cycle is * really* strong.
Smallest Non-Trivial Case To solve: simply assume 3-cycle exists. Convergence to 3-cycle is * really* strong.
Smallest Non-Trivial Case To solve: simply assume 3-cycle exists. Convergence to 3-cycle is * really* strong.
Smallest Non-Trivial Case To solve: simply assume 3-cycle exists. Convergence to 3-cycle is * really* strong.
Smallest Non-Trivial Case To solve: simply assume 3-cycle exists. Convergence to 3-cycle is * really* strong.
Smallest Non-Trivial Case To solve: simply assume 3-cycle exists. Convergence to 3-cycle is * really* strong.
Smallest Non-Trivial Case To solve: simply assume 3-cycle exists. Convergence to 3-cycle is * really* strong.
Two possible stable cycles! x Maximum margin solution is attained! x x t=1 ○○○○○○ t=50 To solve: simply assume 3-cycle exists. AdaBoost achieves max margin here, so conjecture true in at least one case. The edge, r_t, is the golden ratio minus 1.
Generalization of smallest non-trivial case • Case of m weak classifiers, each misclassifies one point • Existence of at least (m-1)! stable cycles, each yields a • maximum margin solution. Can’t solve for cycle exactly, but can prove our equation has a unique solution for each cycle.
Generalization of smallest non-trivial case • Stable manifolds of 3-cycles.
Empirically Observed Cycles t=1 ○○○○○○ t=300
Empirically Observed Cycles t=1 ○○○○○○ t=400
Empirically Observed Cycles t=1 ○○○○○○ t=400
Empirically Observed Cycles t=1 ○○○○○○ t=300
Empirically Observed Cycles t=1 ○○○○○○ t=5500 (only plotted every 20th iterate)
Empirically Observed Cycles t=1 ○○○○○○ t=400
If AdaBoost cycles, we can calculate the margin it will asymptotically converge to in terms of the edge values
The question remained (until recently): Does AdaBoost maximize the margin? AdaBoost doesnotalways produce a maximum margin classifier! Proof: There exists an 8x8 matrix M where AdaBoost provably converges to a non-maximum margin solution. • Convergence to a manifold of strongly attracting • stable 3-cycles. • Margin produced by AdaBoost is 1/3, • but maximum margin is 3/8!
Approximate Coordinate Ascent Boosting (R, Schapire, Daubechies, 04) AdaBoost