Ensemble Classification Methods
Ensemble Classification Methods. Rayid Ghani. IR Seminar – 9/26/00. What is Ensemble Classification?. Set of Classifiers Decisions combined in ”some” way Often more accurate than the individual classifiers What properties should the base learners have?. Why should it work?.
Ensemble Classification Methods
E N D
Presentation Transcript
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00
What is Ensemble Classification? • Set of Classifiers • Decisions combined in ”some” way • Often more accurate than the individual classifiers • What properties should the base learners have?
Why should it work? • More accurate ONLY if the individual classifiers disagree • Error rate < 0.5 and errors are independent • Error rate is highly correlated with the correlations of the errors made by the different learners (Ali & Pazzani)
Averaging Fails! • Use Delta-functions as classifiers (predict +1 at a point and –1 everywhere else) • For training sample size m, construct a set of at most 2m classifiers s.t. the majority vote is always correct • Associate 1 delta function with every example • Add M+ (# of +ve examples) copies of the function that predicts +1 everywhere and M- (# of -ve examples) copies of the function that predicts -1 everywhere • Applying boosting to this results in zero training error but bad generalizations • Applying the margin analysis results in zero training error but margin is small O(1/m)
Ideas? • Subsampling training examples • Bagging , Cross-Validated Committees, Boosting • Manipulating input features • Choose different features • Manipulating output targets • ECOC and variants • Injecting randomness • NN(different initial weights), DT(pick different splits), injecting noise, MCMC
Combining Classifiers • Unweighted Voting • Bagging, ECOC etc. • Weighted Voting • Weight accuracy (training or holdout set), LSR (weights 1/variance) • Bayesian model averaging
BMA • All possible models in the model space used weighted by their probability of being the “Correct” model • Optimal given the correct model space and priors • Not widely used even though it was said not to overfit (Buntine, 1990)
BMA - Equations prior likelihood noise model
Equations • Posterior • Uniform Noise Model • Pure classification model • Model space too large – approximation required • Model with highest posterior, Sampling
BMA of Bagged C4.5 Rules • Bagging as a form of importance sampling where all samples are weighed equally • Experimental Results • Every version of BMA performed worse than bagging on 19 out of 26 datasets • Posteriors skewed – dominated by a single rule model – model selection rather than averaging
BMA of various learners • RISE Rule sets with partitioning • 8 databases from UCI • BMA worse than RISE in every domain • Trading Rules • Intuition (there is no single right rule so BMA should help) • BMA similar to choosing the single best rule
Overfitting in BMA • Issue of overfitting is usually ignored (Freund et al. 2000) • Is overfitting the explanation for the poor performance of BMA? • Preferring a hypothesis that does not truly have the lowest error of any hypothesis considered, but by chance has the lowest error on training data. • Overfitting is the result of the likelihood’s exponential sensitivity to random fluctuations in the sample and increases with # of models considered
To BMA or not to BMA? • Net effect will depend on which effect prevails? • Increased overfitting (small if few models are considered) • Reduction in error obtained by giving some weight to alternative models (skewed weights => small effect) • Ali & Pazzani (1996) report good results but bagging wasn’t tried • Domingos (2000) used bootstrapping before BMA so the models were built from less data
Why they work? • Bias / Variance Decomposition • Training data insufficient for choosing a single best classifier • Learning algorithms not “smart” enough! • Hypothesis space may not contain the true function
Definitions • Bias is the persistent/systematic error of a learner independent of the training set. Zero for a learner that always makes the optimal prediction • Variance is the error incurred by fluctuations in response to different training sets. Independent of the true value of the predicted variable and zero for a learner that always predicts the same class regardless of the training set
Bias–Variance Decomposition • Kong & Dietterich (1995) – variance can be negative and noise is ignored • Breiman (1996) – undefined for any given example and variance can be zero even when the learners predictions fluctuate • Tibshirani (1996) • Hastie (1997) • Kohavi & Wolpert (1996) allows the bias of the Bayes optimal classifier to be non-zero • Friedman (1997) leaves bias and variance for zero-one loss undefined
Domingos (2000) • Single definition of bias and variance • Applicable to “any” loss function • Explains the margin effect (Schapire et al. 1997) using the decomposition • Incorporates variable misclassification costs • Experimental study
Unified Decomposition • Loss functions • Squared L(t,y)=(t-y)2 • Absolute L(t,y)=|t-y| • Zero-One L(t,y)=0 if y=t else 1 • Goal = Minimize average L(t,y) over all weighted examples c1N(x) + B(x) + c2V(x)
Properties of the unified decomposition • Relation to Order-correct learner • Relation to Margin of a learner • Maximizing margins is a combination of reducing the number of biased examples, decreasing variance on unbiased examples, and increasing it on biased ones.
Experimental Study • 30 UCI datasets • Methodology • 100 bootstrap samples – averaged over the test set with uniform weights • Estimate bias, variance, zero-one loss • DT, kNN, boosting
Boosting C4.5 - Results • Decreases both bias and variance • Bulk of bias reduction happens in the first few rounds • Variance reduction is more gradual and the dominant effect
kNN results • kNN bias increases with k dominates variance reduction however increasing k has the effect of reducing variance on unbiased examples while increasing it on biased ones.
Issues • Does not work with “Any” loss function e.g. absolute loss • Decomposition is not purely additive unlike the original one for squared-loss
Spectrum of ensembles Overfitting Boosting Bagging BMA Asymmetry of weights
Open Issues concerning ensembles • Best way to construct ensembles? • No extensive comparison done • Computationally expensive • Not easily comprehensible
Bibliography • Overview • T. Dietterich • Bauer & Kohavi • Averaging • Domingos • Freund, Mansour, Schapire • Ali, Pazzani • Bias – Variance Decomposition • Kohavi & Wolpert • Domingos • Friedman • Kong & Dietterich