460 likes | 777 Vues
Boosting. LING 572 Fei Xia 02/01/06. Outline. Basic concepts Theoretical validity Case study: POS tagging Summary. Basic concepts. Overview of boosting. Introduced by Schapire and Freund in 1990s. “Boosting”: convert a weak learning algorithm into a strong one.
E N D
Boosting LING 572 Fei Xia 02/01/06
Outline • Basic concepts • Theoretical validity • Case study: • POS tagging • Summary
Overview of boosting • Introduced by Schapire and Freund in 1990s. • “Boosting”: convert a weak learning algorithm into a strong one. • Main idea: Combine many weak classifiers to produce a powerful committee. • Algorithms: • AdaBoost: adaptive boosting • Gentle AdaBoost • BrownBoost • …
Bagging ML Random sample with replacement f1 ML f2 f ML fT Random sample with replacement
Boosting Weighted Sample ML f1 Training Sample ML Weighted Sample f2 f … ML fT
Main ideas • Train a set of weak hypotheses: h1, …., hT. • The combined hypothesis H is a weighted majority vote of the T weak hypotheses. • Each hypothesis ht has a weight αt. • During the training, focus on the examples that are misclassified. At round t, example xi has the weight Dt(i).
Algorithm highlight • Training time: (h1, 1), …., (ht, t), … • Test time: for x, • Call each classifier ht, and calculate ht(x) • Calculate the sum: tt * ht(x)
Basic Setting • Binary classification problem • Training data: • Dt(i): the weight of xi at round t. D1(i)=1/m. • A learner L that finds a weak hypothesis ht: X Y given the training set and Dt • The error of a weak hypothesis ht:
The basic AdaBoost algorithm • For t=1, …, T • Train weak learner ht : X {-1, 1}using training data and Dt • Get the error rate: • Choose classifier weight: • Update the instance weights:
The new weights When When
An example o + o + +
Two iterations Initial weights: 1st iteration: 2nd iteration:
The basic and general algorithms • In the basic algorithm, it can be proven that • The hypothesis weight αt is decided at round t • Di (The weight distribution of training examples) is updated at every round t. • Choice of weak learner: • its error should be less than 0.5: • Ex: DT (C4.5), decision stump
Experiment results(Freund and Schapire, 1996) Error rate on a set of 27 benchmark problems
Training error of H(x) Final hypothesis: Training error is defined to be It can be proved that training error
Training error for basic algorithm Let Training error Training error drops exponentially fast.
Generalization error (expected test error) • Generalization error, with high probability, is at most T: the number of rounds of boosting m: the size of the sample d: VC-dimension of the base classifier space
Selecting weak hypotheses • Training error • Choose ht that minimize Zt. • See “case study” for details.
Two ways • Converting a multiclass problem to binary problem first: • One-vs-all • All-pairs • ECOC • Extending boosting directly • AdaBoost.M1 • AdaBoost.M2 Prob 2 in Hw5
Overview(Abney, Schapire and Singer, 1999) • Boosting applied to Tagging and PP attachment • Issues: • How to learn weak hypotheses? • How to deal with multi-class problems? • Local decision vs. globally best sequence
Weak hypotheses • In this paper, a weak hypothesis h simply tests a predicate (a.k.a. feature), Φ: h(x) = p1 if Φ(x) is true, h(x)=p0 o.w. h(x)=pΦ(x) • Examples: • POS tagging: Φ is “PreviousWord=the” • PP attachment: Φ is “V=accused, N1=president, P=of” • Choosing a list of hypotheses choosing a list of features.
Finding weak hypotheses • The training error of the combined hypothesis is at most where choose ht that minimizes Zt. • ht corresponds to a (Φt, p0, p1) tuple.
Schapire and Singer (1998) show that given a predicate Φ, Zt is minimized when where
Finding weak hypotheses (cont) • For each Φ, calculate Zt Choose the one with min Zt.
Sequential model • Sequential model: a Viterbi-style optimization to choose a globally best sequence of labels.
Main ideas • Boosting combines many weak classifiers to produce a powerful committee. • Base learning algorithms that only need to be better than random. • The instance weights are updated during training to put more emphasis on hard examples.
Strengths of AdaBoost • Theoretical validity: it comes with a set of theoretical guarantee (e.g., training error, test error) • It performs well on many tasks. • It can identify outliners: i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize.
Weakness of AdaBoost • The actual performance of boosting depends on the data and the base learner. • Boosting seems to be especially susceptible to noise. • When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance. “Gentle AdaBoost”, “BrownBoost”
Other properties • Simplicity (conceptual) • Efficiency at training • Efficiency at testing time • Handling multi-class • Interpretability
Bagging vs. Boosting (Freund and Schapire 1996) • Bagging always uses resampling rather than reweighting. • Bagging does not modify the weight distribution over examples or mislabels, but instead always uses the uniform distribution • In forming the final hypothesis, bagging gives equal weight to each of the weak hypotheses
Relation to other topics • Game theory • Linear programming • Bregman distances • Support-vector machines • Brownian motion • Logistic regression • Maximum-entropy methods such as iterative scaling.
Sources of Bias and Variance • Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data • Variance arises when the classifier overfits the data • There is often a tradeoff between bias and variance
Effect of Bagging • If the bootstrap replicate approximation were correct, then bagging would reduce variance without changing bias. • In practice, bagging can reduce both bias and variance • For high-bias classifiers, it can reduce bias • For high-variance classifiers, it can reduce variance
Effect of Boosting • In the early iterations, boosting is primary a bias-reducing method • In later iterations, it appears to be primarily a variance-reducing method
How to choose αt for ht with range [-1,1]? • Training error • Choose αt that minimize Zt.
Issues • Given ht, how to choose αt? • How to select ht?