Committee Machines the art and science of combining learning machines

Committee Machinesthe art and science of combining learning machines Gavin Brown University of Birmingham g.brown@cs.bham.ac.uk http://www.cs.bham.ac.uk/~gxb/

What will we cover? • Some theory: • Reasons to combine learning machines • Statistics  : Bias, Variance, and Co-variance • Error diversity? Art? … Science? • Some current methods: • Ensemble Methods : Bagging and Boosting • Mixtures of Experts • Dyn-Co : somewhere in-between… • General rules of thumb • Some open research questions and some conclusions

Some theory > Reasons to Combine Learning Machines f1 f2 f3 input Final output f4 f5 Lots of different machines, lots of different combination methods…. Most popular are averaging and majority voting. Intuitively, it seems as though it should work. We have parliaments of people who vote, and that works (sometimes) We average guesses of a quantity, and we’ll probably be closer… But intuition is not enough for us! Let’s formalise….

Some theory > Reasons to Combine Learning Machines When designing a learning machine, we generally make some choices: parameters of machine, training data, representation, etc… This implies some sort of variance in performance – forms an error distribution (hopefully around the target) If we sample widely enough from this distribution, won’t we be closer? Why not keep all machines and average? Still not formal enough? Then let’s keep going… sin(2)=0.909

After a little algebra, we can prove: ensemble error = average component error – “ambiguity” Error of combination is guaranteed to be lower than the average error. (Krogh & Vedelsby 1995) Some theory > Reasons to Combine Learning Machines …why?

Binomial theorem says… Some theory > Reasons to Combine Learning Machines …why? …but only if they are independent! • One theory paper… • Tumer & Ghosh 1996 • “Error Correlation and Error Reduction in Ensemble Classifiers” • (makes some assumptions, like equal variances)

and if f is a linear combination of other functions : MSE = bias2 + var + covar but if f is a voted combination (zero-one loss function) of other functions : MSE = ? Some theory > Statistics  : Bias, Variance, and Co-variance MSE = bias2 + var

f1 f1 f2 f2 input f3 input f3 f4 f4 f5 Some theory > Error Diversity? Art?… Science? • What (not the end of the talk) conclusions can we make? • In practice, many success stories. • Heuristics, left, right, and centre… • Our theories have gaps. • Need a unifying framework. • Some efforts in this direction (e.g. Kleinberg’s Stochastic Discrimination) …So given all this theory, how do we make use of it?

Error rates on UCI datasets (10-fold cross validation) Source: Opitz & Maclin, 1999 Some current methods > Ensemble Methods Each example may appear more than once in a given dataset Bagging take a training set D, of size N for each network / tree / k-nn / etc… - build a new training set by sampling N examples, randomly, with replacement, from D - train your machine with the new dataset endfor output is average/vote from all machines trained

Error rates on UCI datasets (10-fold cross validation) Source: Opitz & Maclin, 1999 Some current methods > Ensemble Methods General method – different types in literature, by filtering, sub-sampling or re-weighing, see Haykin Ch.7 for details. Still not agreed exactly why it works. Boosting take a training set D, of size N do M times train a network on D find all examples in D that the network gets wrong emphasize those patterns, de-emphasize the others, in a new dataset D2 set D=D2 loop output is average/vote from all machines trained

f1 f2 input Combine Final output f3 f4 f5 Some current methods > Mixtures of Experts (Jacobs et al, 1991) • Gating net learns the combination weights. • Gating net uses SoftMax activation so weights sum to one. • Input space is ‘carved-up’ between the experts. • Has a nice probabilistic interpretation as a mixture model. • Many variations in literature: Gaussian mixtures; training • with the Expectation-Maximization algorithm, etc.

f1 f2 Gating-Err(x) = input Combine Final output f3 f4 …‘pure’ Mixture of Experts f5 …‘pure’ Ensemble Ensembles Dyn-Co Mixtures Some current methods > Dyn-Co: somewhere in-between… (Hansen, 2000)

Some current methods > General Rules of Thumb • Components should exhibit low correlation - understood well for regression, not so well for classification. “Overproduce-and-choose” is a good strategy. • Unstable estimators (e.g. networks) benefit most from ensemble methods. Stable estimators like k-NN tend not to benefit. • Techniques manipulate either training data, architecture of learner, initial configuration, or learning algorithm. Training data is seen as most successful route. Initial configuration is least successful. • Uniform weighting is almost never optimal. Good strategy is to set the weighting for a component proportional to its error on a validation set. • Boosting tends to suffer on noisy data.

Some open questions and some conclusions Open Questions • What taxonomy best describes the structures / methods? • need to be careful not to get lost in making combinations of combinations of combinations… • What is ‘diversity’ for classification problems? BVC for 0-1 loss? • How to match components to combination rules? • How to match data to methods? Conclusions • Powerful, easy to implement, easy to be naïve • Dominating techniques are still heuristic based – theory is slowly growing… • Current research emphasis on random perturbations (cf. bagging) – need more deterministic methods like Mixtures of Experts

References Thomas G. Dietterich Ensemble Methods in Machine Learning (2000) Proceedings of First International Workshop on Multiple Classifier Systems David Opitz and Richard Maclin Popular Ensemble Methods: An Empirical Study (1999) Journal of Artificial Intelligence Research, volume 11, pages 169-198 R. A. Jacobs and M. I. Jordan and S. J. Nowlan and G. E. Hinton Adaptive Mixtures of Local Experts (1991) Neural Computation, volume 3, number 1, pages 79-87 Stuart Haykin Neural Networks: A Comprehensive Foundation (Chapter 7) Other players of the game: Lucy Kuncheva, Nathan Intrator, Tin Kam Ho, David Wolpert, Amanda Sharkey, David Opitz, Jakob Hansen Ensemble bibliography: http://www.cs.bham.ac.uk/~gxb/ensemblebib.php Int’l Workshop on Multiple Classifier Systems: http://www.diee.unica.it/mcs/ Boosting resources: http://www.boosting.org Citeseer: http://citeseer.nj.nec.com/cs

Committee Machines the art and science of combining learning machines

Committee Machines the art and science of combining learning machines

Presentation Transcript

Committee Machines and Mixtures of Experts

Machines

MACHINES

Machines

Universal Learning Machines (ULM)

Machines

MACHINES

Machines

The Rise of Machines

Rise of the Machines

Learning Machines and Teaching Machines

Machines

Machines that Make machines

Machines

Combining Turing Machines

Meta-Analysis: The Art and Science of Combining Information

Machines

Rise of the Machines

The Differences between Milling Machines and Drilling Machines

Properties of EDM Machines and Milling Machines

Neural network and learning machines

State of the Art Machines Generate Quality Products