1 / 46

Boosting

Boosting. LING 572 Fei Xia 02/01/06. Outline. Basic concepts Theoretical validity Case study: POS tagging Summary. Basic concepts. Overview of boosting. Introduced by Schapire and Freund in 1990s. “Boosting”: convert a weak learning algorithm into a strong one.

terrene
Télécharger la présentation

Boosting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boosting LING 572 Fei Xia 02/01/06

  2. Outline • Basic concepts • Theoretical validity • Case study: • POS tagging • Summary

  3. Basic concepts

  4. Overview of boosting • Introduced by Schapire and Freund in 1990s. • “Boosting”: convert a weak learning algorithm into a strong one. • Main idea: Combine many weak classifiers to produce a powerful committee. • Algorithms: • AdaBoost: adaptive boosting • Gentle AdaBoost • BrownBoost • …

  5. Bagging ML Random sample with replacement f1 ML f2 f ML fT Random sample with replacement

  6. Boosting Weighted Sample ML f1 Training Sample ML Weighted Sample f2 f … ML fT

  7. Main ideas • Train a set of weak hypotheses: h1, …., hT. • The combined hypothesis H is a weighted majority vote of the T weak hypotheses. • Each hypothesis ht has a weight αt. • During the training, focus on the examples that are misclassified.  At round t, example xi has the weight Dt(i).

  8. Algorithm highlight • Training time: (h1, 1), …., (ht, t), … • Test time: for x, • Call each classifier ht, and calculate ht(x) • Calculate the sum: tt * ht(x)

  9. Basic Setting • Binary classification problem • Training data: • Dt(i): the weight of xi at round t. D1(i)=1/m. • A learner L that finds a weak hypothesis ht: X  Y given the training set and Dt • The error of a weak hypothesis ht:

  10. The basic AdaBoost algorithm • For t=1, …, T • Train weak learner ht : X  {-1, 1}using training data and Dt • Get the error rate: • Choose classifier weight: • Update the instance weights:

  11. The new weights When When

  12. An example o + o + +

  13. Two iterations Initial weights: 1st iteration: 2nd iteration:

  14. The general AdaBoost algorithm

  15. The basic and general algorithms • In the basic algorithm, it can be proven that • The hypothesis weight αt is decided at round t • Di (The weight distribution of training examples) is updated at every round t. • Choice of weak learner: • its error should be less than 0.5: • Ex: DT (C4.5), decision stump

  16. Experiment results(Freund and Schapire, 1996) Error rate on a set of 27 benchmark problems

  17. Theoretical validity

  18. Training error of H(x) Final hypothesis: Training error is defined to be It can be proved that training error

  19. Training error for basic algorithm Let Training error  Training error drops exponentially fast.

  20. Generalization error (expected test error) • Generalization error, with high probability, is at most T: the number of rounds of boosting m: the size of the sample d: VC-dimension of the base classifier space

  21. Selecting weak hypotheses • Training error • Choose ht that minimize Zt. • See “case study” for details.

  22. Multiclass boosting

  23. Two ways • Converting a multiclass problem to binary problem first: • One-vs-all • All-pairs • ECOC • Extending boosting directly • AdaBoost.M1 • AdaBoost.M2  Prob 2 in Hw5

  24. Case study

  25. Overview(Abney, Schapire and Singer, 1999) • Boosting applied to Tagging and PP attachment • Issues: • How to learn weak hypotheses? • How to deal with multi-class problems? • Local decision vs. globally best sequence

  26. Weak hypotheses • In this paper, a weak hypothesis h simply tests a predicate (a.k.a. feature), Φ: h(x) = p1 if Φ(x) is true, h(x)=p0 o.w.  h(x)=pΦ(x) • Examples: • POS tagging: Φ is “PreviousWord=the” • PP attachment: Φ is “V=accused, N1=president, P=of” • Choosing a list of hypotheses  choosing a list of features.

  27. Finding weak hypotheses • The training error of the combined hypothesis is at most where  choose ht that minimizes Zt. • ht corresponds to a (Φt, p0, p1) tuple.

  28. Schapire and Singer (1998) show that given a predicate Φ, Zt is minimized when where

  29. Finding weak hypotheses (cont) • For each Φ, calculate Zt Choose the one with min Zt.

  30. Boosting results on POS tagging?

  31. Sequential model • Sequential model: a Viterbi-style optimization to choose a globally best sequence of labels.

  32. Previous results

  33. Summary

  34. Main ideas • Boosting combines many weak classifiers to produce a powerful committee. • Base learning algorithms that only need to be better than random. • The instance weights are updated during training to put more emphasis on hard examples.

  35. Strengths of AdaBoost • Theoretical validity: it comes with a set of theoretical guarantee (e.g., training error, test error) • It performs well on many tasks. • It can identify outliners: i.e. examples that are either mislabeled or that are inherently ambiguous and hard to categorize.

  36. Weakness of AdaBoost • The actual performance of boosting depends on the data and the base learner. • Boosting seems to be especially susceptible to noise. • When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance.  “Gentle AdaBoost”, “BrownBoost”

  37. Other properties • Simplicity (conceptual) • Efficiency at training • Efficiency at testing time • Handling multi-class • Interpretability

  38. Bagging vs. Boosting (Freund and Schapire 1996) • Bagging always uses resampling rather than reweighting. • Bagging does not modify the weight distribution over examples or mislabels, but instead always uses the uniform distribution • In forming the final hypothesis, bagging gives equal weight to each of the weak hypotheses

  39. Relation to other topics • Game theory • Linear programming • Bregman distances • Support-vector machines • Brownian motion • Logistic regression • Maximum-entropy methods such as iterative scaling.

  40. Additional slides

  41. Sources of Bias and Variance • Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data • Variance arises when the classifier overfits the data • There is often a tradeoff between bias and variance

  42. Effect of Bagging • If the bootstrap replicate approximation were correct, then bagging would reduce variance without changing bias. • In practice, bagging can reduce both bias and variance • For high-bias classifiers, it can reduce bias • For high-variance classifiers, it can reduce variance

  43. Effect of Boosting • In the early iterations, boosting is primary a bias-reducing method • In later iterations, it appears to be primarily a variance-reducing method

  44. How to choose αt for ht with range [-1,1]? • Training error • Choose αt that minimize Zt. 

  45. Issues • Given ht, how to choose αt? • How to select ht?

  46. How to choose αt when ht has range {-1,1}?

More Related