Diverse Ensembles for Active Learning

Diverse Ensembles for Active Learning Prem Melville and Raymond J. Mooney June 27, 2004

Motivation • Actively selecting most useful training examples is an important approach to reducing amount of supervision • Pool-based sample selection is the most popular approach • Learner chooses best instance for labeling from a set of unlabeled examples • Query by Committee (QBC) is a theoretically well motivated approach to sample selection [Seung et al. 92] • Committee of consistent hypotheses is learned • Examples that cause maximum disagreement amongst this committee are selected for labeling • Bagging and AdaBoost have been used to learn effective committees for QBC [Abe & Mamitsuka 98] • Known as Query by Bagging (QBag) and Query by Boosting (QBoost)

Motivation • A good ensemble for QBC should be diverse • i.e., consistent hypotheses that are very different from each other • Only a committee that effectively samples the version space is productive for sample selection [Cohn 94] • Decorate is a recently-developed ensemble method that explicitly builds diverse ensembles [Melville & Mooney 03,04] • It’s more accurate than Bagging & AdaBoost when training data is limited • And does at least as well as AdaBoost when training sets are large • How effective are Decorate ensembles for sample selection? • Can the added diversity help select more informative examples than QBag and QBoost?

Outline • Background on DECORATE • Active-DECORATE • Experimental Evaluation • Additional Experiments • Future Work and Conclusions

Ensemble Diversity • Combining classifiers is only useful if they disagree on some inputs • Diversity refers to a measure of disagreement (ambiguity) • Increasing diversity while maintaining error of ensemble members → decreases ensemble error [Krogh & Vedelsby 95] • We use disagreement with ensemble prediction as a measure of diversity • If Ci(x) is the prediction of the i-th classifier for the label of x • C*(x) is the prediction of the entire ensemble • Diversity of the i-th classifier on example x is given by • Div. of ensemble of size m, on training set of size n: • Our approach: build ensembles consistent with training data while maximizing diversity

DECORATE: Basic Approach • The ensemble is generated iteratively • Artificially constructed examples are added to training set when building new members • Artificial examples are given labels that disagree with current ensemble’s decisions • The new classifier is trained on this augmented data • Thereby forcing it to differ from the current ensemble • Adding it to the ensemble will therefore increase diversity • While forcing diversity we still maintain accuracy • Reject new classifier if adding it to existing ensemble decreases its accuracy • To produce predictions we take the majority vote of the ensemble

C1 + + - + - Overview of DECORATE Current Ensemble Training Examples + - - + + Base Learner Artificial Examples

C2 - + - - + - - - + + Overview of DECORATE Current Ensemble Training Examples + C1 - - + + Base Learner Artificial Examples

C3 Overview of DECORATE Current Ensemble Training Examples + C1 - - + + Base Learner C2 - + + + - Artificial Examples

Artificial Data • Examples are generated at each iteration • Number of examples is proportional to training size (1:1) • Randomly pick points from approx. training data distribution • For numeric attributes • compute mean and std dev & generate values from the Gaussian • For nominal attributes • Compute prob. of occurrence of each distinct value & generate values from this distribution • To label examples • Find class membership probabilities predicted by current ensemble • Select labels s.t. probability of selection is inversely proportional to ensemble predictions

+ + - + - + + - + Active-DECORATE Unlabeled Examples Utility = 0.1 Current Ensemble Training Examples C1 C2 DECORATE C3 C4

0.3 0.2 0.5 + + - + - + - - + - Acquire Label Active-DECORATE Unlabeled Examples Utility = 0.1 0.9 Current Ensemble Training Examples C1 C2 DECORATE C3 C4 QBag/QBoost similarly implemented using Bagging/AdaBoost in place of Decorate

Measure of Utility • To evaluate the expected utility of unlabeled examples we use the margins on the examples • Similar to [Abe and Mamitsuka 98] • Given the class membership probabilities predicted by the committee • The margin is defined as diff between highest and second highest predicted class probability • Smaller margins imply greater uncertainty in the class label • Other measures of utility will be discussed later

Summary of Data Sets

Experimental Methodology • Compared Active-Decorate with QBag, QBoost and Decorate (using random sampling) • Used ensembles of size 15 • Used J48 as the base learner • J48 is a Java implementation of C 4.5 decision tree induction • 2x10-fold cross-validations were run on 15 UCI datasets • In each fold, learning curves were generated • The set of available examples treated as unlabeled pool • At each iteration, the active learner selected sample of pts to be labeled and added to training set • For passive learner, Decorate, examples were selected randomly • At the end of the learning curve, all algos see the same examples • The curves evaluate the how well an active learner orders the set of examples in terms of utility

Examples saved Metrics – Data Utilization Ratio Accuracy Active Random • Primary aim of active learning – reduce amount of data needed to induce accurate model Num of training examples

Metrics – Data Utilization Ratio Accuracy Examples saved Active Random • Define target error rate as the error that Decorate can achieve on a given dataset • Error averaged over pts of the learning curve corresponding to last 50 examples • Record smallest num of examples required by a learner to achieve same or lower error Num of training examples

Metrics – Data Utilization Ratio Accuracy Examples saved Active Random • Data utilization ratio: • (num of examples required by active learner) / (num of examples required by Decorate) • Reflects how efficiently the active learner is using data • Similar to measure used by Abe & Mamitsuka [98] Num of training examples

Error Reduction Metrics - Percentage Error Reduction Accuracy Active Random • How much an active learner improves accuracy over random sampling given a fixed amount of labeled data • Compute % reduction in error over Decorate • Average over points on the learning curve Num of training examples

Metrics - Percentage Error Reduction Error Reduction Accuracy Active Random • Towards end of learning curve all methods see almost the same examples • Hence, main impact of active learning is lower on curve • Capture this by reporting % error reduction on 20% of point on the curve where largest improvements are produced • Similar to a measure used by Saar-Tsechansky & Provost [01] Num of training examples

Metrics - Percentage Error Reduction Error Reduction Accuracy Active Random • Error reduction is considered significant if difference in error of the 2 systems averaged across selected pts of the curve is statistically significant (p<0.05) Num of training examples

Results – Data Utilization • On all but one dataset Active-Decorate produces improvements over Decorate • On average it requires 78% of the num of examples that Decorate needs • With as few as 29% of examples on soybean • On breast-w we notice a ceiling effect were none of the active methods improve on Decorate • Active-Decorate outperforms both QBag and QBoost on 10 datasets • On some datasets (vowel & primary), QBag & QBoost failed to achieve the target error • Decorate itself achieves the target error with far fewer examples than is available • e.g. on breast-w it achieves the target error with only 30 of the available 630 examples • Hence improving on the data utilization of Decorate is fairly challenging

Results – Error Reduction • On all datasets Active-Decorate produces significant reductions in error over Decorate • On 8 datasets Active-Decorate produces higher reductions than other active methods • It produces a wide range of improvements • From moderate (4.2% on credit-g) to high (70.68% on vowel) • With an average reduction of 21.2%

Learning Curve for Soybean

Measures of Utility • There are two main aspects of any QBC approach • The method employed to construct the committee • Measure used to rank utility of unlabeled examples • We compared different methods for constructing committees • Ranked examples based on margins • Alternate approach – use Jensen-Shannon (JS) divergence [Cover & Thomas 91] • JS-div is a measure of similarity between probability distributions

Jensen-Shannon Divergence • If Pi(x) is the class probability distribution given by i-th classifier for example x, then JS-div of ensemble of size n as: • H(P) is the Shannon entropy of distribution P = {pj, j=1,…,K} defined as: • Higher values of JS-div indicate greater spread in predicted class probability distribution • Zero iff the distributions are identical • A similar measure was used by [McCallum & Nigam 98] • We ran experiments, as before, comparing JS-div with margins

Results – Utility Measures • In terms of data utilization, both methods equally matched • On error reduction, using margins is more effective • JS-div selects examples to reduce uncertainty in predicted class mem. probs • Which indirectly helps improve accuracy • Margins focus more directly on determining the decision boundary • Cost-sensitive decisions require accurate class probability estimates • Using JS-div could be more effective in such cases

Learning Curve for Vowel • Often both measures achieve target error with comparable number of examples • But error reduction produced by margins is higher

Committees for Sample Selection vs. Prediction • All active methods described use committees to select examples • In addition to sample selection, they also use the committees for prediction • We are evaluating the combination of sample selection and ensemble method • Active-Decorate does better than QBag • Could just be because Decorate is better than Bagging • Claim: Decorate not only produces accurate committees, but committees produce are more effective for sample selection

Committees for Sample Selection vs. Prediction • Implemented variant of Active-Decorate • At each iteration a committee constructed by Bagging is used to select examples given to Decorate • Thus separating evaluation of selector from predictor • Similarly, implemented a variant using AdaBoost as the selector • Compared the 3 variants on 4 datasets • On 3 of 4 datasets, using any selector with Decorate as predictor performed better than random selection • On the 4th dataset, the trends are same, but not statistically significant • Compared to AdaBoost and Bagging, Decorate committees select more informative examples for training Decorate

Learning Curve for Soybean

Related Work • Dagan & Engelson [95] measure utility of examples using vote entropy • i.e. the entropy of the class distribution based on majority votes of each committee member • [McCallum & Nigam 98] showed that it does not perform as well as JS-div • Another committee-based active learner – Co-Testing [Muslea et al. 00] • Requires 2 redundant views of the data • Hence limited applicability • Expected-error reduction methods [Cohn et al. 96, Roy & McCallum 01, Zhu et al. 03] • Select examples that are expected to minimize error on the actual test distribution • Is computationally intense, and must be tailored to specific learners • Active meta-learners like Active-Decorate can be applied to any learner

Future Work & Conclusions • Active-Decorate is a simple, yet effective approach to active learning • Produces significant improvements over Decorate • In general, it leads to more effective sample selection than QBag and QBoost • Using JS-divergence to evaluate effectiveness of examples is less effective for improving classification accuracy than margins • JS-div may be a better measure when the objective is improving class probability estimates • Active-Decorate is a meta-learning scheme – so it can be applied to other base learners • We can compare with other active learners, such as approaches for SVMs [Tong et al. 01]

Questions? DECORATE is now available as part of the Weka ML package. Machine Learning Group, UT-Austin www.cs.utexas.edu/users/ml

Ensemble Diversity • Combining classifiers is only useful if they disagree on some inputs • Diversity refers to a measure of disagreement (ambiguity) • For regression • Using mean squared error to measure accuracy • Using variance to measure diversity • Ensemble generalization error [Krogh & Vedelsby ′95] • – average error of the ensemble members • – average diversity of the ensemble • Increasing diversity while maintaining error of ensemble members → decreases ensemble error

Diversity for Classification • For classification the simple linear relation doesn’t hold • We still have reason to believe that diversity is related to error reduction [Cunnigham ′00] • Many measures of diversity have been used in the literature • [Kuncheva et al. ′03] compared different measures • They show that most of these measures are highly correlated • No conclusive study points to which measure of diversity is the best to use

Learning Curve for Soybean (Full)

Learning Curve for Vowel (Full)

Learning Curve for Soybean (Full)

Related Work • There have been other ensemble methods that focus on diversity • [Liu & Yao ′99], [Rosen ′96], [Opitz & Shavlik ′96], [Zenobi & Cunnigham ′01], [Tumer and Ghosh ′96] , [Opitz ′99] • How our work differs from others: • Other methods attempt to optimize accuracy and diversity of individual ensemble members • We try to minimize error of entire ensemble by increasing diversity • Some methods are dependent on the underlying learner (e.g. NN) • DECORATE is a general meta-learner applicable to any base learner • We compare with standard ensemble methods – others don’t • Except for [Opitz ′99] • We present learning curves - evaluates performance with varying amounts of data

Modeling Artificial Data • We use a very crude approximation of the data distribution • Assume independence of features • Assume Gaussian distribution for nominal attributes • We can do a better job of modeling the data • But, we get good results with the current method • It is unclear that a better model will improve results • It will however increase run time

Artificial vs. Unlabeled Data • The way we use artificial examples may appear counterintuitive to the way unlabeled data is used in semi-supervised learning • Where the labels given to the unlabeled data by the supervised learner is preserved (instead of being flipped) • Why does semi-supervised learning work? • Unlabeled data provides more information about the data distribution • Artificial data does not • Why does flipping labels not hurt Decorate? • If the current ensemble is accurate, aren’t we are forcing subsequent members to not be accurate? • No – we make sure that the error of the ensemble never decreases

When Should You Use DECORATE? • When you have few training examples • Or acquiring labeled data is expensive • For large amt. of training data you may still do better than Boosting • DECORATE performs better on 6 of 15 datasets given 100% of the data • For your dataset there is a good chance that DECORATE will outperform Boosting even with large amounts of data • When your base classifier cannot handle weighted examples • Boosting can be done with resampling – but might not be desirable • When you have noisy data • Boosting often increases error due to overfitting noisy data [Dietterich 00] • DECORATE is resilient to noise in data [Melville et al. 04]

Other Ensemble Methods • There are other ensemble methods that we can compare to • Error-Correcting Output Coding [Dietterich & Bakiri ′95] • Injecting randomness into the learning algorithm • We chose to compare to Bagging and Boosting • They are the mostly widely used and studied • We also compared to Random Forests • Which is not a meta-learner • But since we use decision trees we also compared with RFs

Labor Iris Heart-C Breast-W

Bagging [Breiman ′96] • Each classifier is trained on a set of m training examples • Examples drawn randomly with replacement from the original set of size m • Such a set is called a bootstrap replicate • Predictions are made by taking the majority vote of the ensemble • Ensemble members differ because they’re trained on different subsets of the data • Bagging reduces error due to variance of the base classifier

Boosting (AdaBoost.M1) [Freund & Shapire ′96] • Maintains a set of weights over the training examples • In each iteration classifier Ci is trained to minimize the weighted error • The weighted error of Ci is used to update the distribution of weights • Weights of misclassified examples are increased • Weights of correctly classified examples are decreased • Next classifier is trained on examples with updated distribution • This process is repeated for specified number of iterations • Ensemble predictions made using a weighted vote of individual classifiers • Weight of each classifier is computed according to its training accuracy

Diverse Ensembles for Active Learning