1 / 30

Sequential Genetic Search for Ensemble Feature Selection

IJCAI’2005, Edinburgh, Scotland August 1-5, 2005. Sequential Genetic Search for Ensemble Feature Selection. Alexey Tsymbal, Padraig Cunningham Department of Computer Science Trinity College Dublin Ireland Mykola Pechenizkiy Department of Computer Science University of Jyväskylä Finland.

Télécharger la présentation

Sequential Genetic Search for Ensemble Feature Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IJCAI’2005, Edinburgh, Scotland August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection Alexey Tsymbal, Padraig Cunningham Department of Computer ScienceTrinity College DublinIreland Mykola PechenizkiyDepartment of Computer ScienceUniversity of Jyväskylä Finland

  2. Contents • Introduction • Classification and Ensemble Classification • Ensemble Feature Selection • strategies • sequential genetic search • Our GAS-SEFS strategy • Genetic Algorithm-based Sequential Search for Ensemble Feature Selection • Experiment design • Experimental results • Conclusions and future work IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  3. Given n training instances (xi, yi) where xi are values of attributes and y is class Goal: given new x0, predict class y0 The Task of Classification J classes, n training observations, p features Training Set New instance to be classified CLASSIFICATION Examples:- prognostics of recurrence of breast cancer; - diagnosis of thyroid diseases; - antibiotic resistance prediction Class Membership of the new instance IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  4. Ensemble classification How to prepare inputs for generation of the base classifiers? IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  5. Ensemble classification How to combine the predictions of the base classifiers? IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  6. Ensemble feature selection • How to prepare inputs for generation of the base classifiers ? • Sampling the training set • Manipulation of input features • Manipulation of output targets (class values) • Goal oftraditional feature selection • find and remove features that are unhelpful or misleading to learning (making one feature subset forsingle classifier) • Goal ofensemble feature selection • find and remove features that are unhelpful or destructive to learning making different feature subsets for a number of classifiers • find feature subsets that will promote diversity (disagreement) between classifiers IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  7. Search in EFS Search space: 2#Features * #Classifiers Search strategies include: • Ensemble Forward Sequential Selection (EFSS) • Ensemble Backward Sequential Selection (EBSS) • Hill-Climbing (HC) • Random Subspacing Method (RSM) • Genetic Ensemble Feature Selection (GEFS) Fitness function: IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  8. Measuring Diversity The fail/non-fail disagreement measure : the percentage of test instances for which the classifiers make different predictions but for which one of them is correct: The kappa statistic: IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  9. Random Subspace Method • RSM itself is simple but effective technique for EFS • the lack of accuracy in the ensemble members is compensated for by their diversity • does not suffer from the curse of dimensionality • RS is used as a base in other EFS strategies, including Genetic Ensemble Feature Selection. • Generation of initial feature subsets using (RSM) • A number of refining passes on each feature set while there is improvement in fitness IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  10. Genetic Ensemble Feature Selection • Genetic search – important direction in FS research • GA as effective global optimization technique • GA for EFS: • Kuncheva, 1993: Ensemble accuracy instead of accuracies of base classifiers • Fitness function is biased towards particular integration method • Preventive measures to avoid overfitting • Alternative: use of individual accuracy and diversity • Overfitting of individual is more desirable than overfitting of ensemble • Opitz, 1999: Explicitly used diversity in fitness function • RSM for initial population • New candidates by crossover and mutation • Roulette-wheel selection (p proportional to fitness) IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  11. mutation population of genotypes (base classifiers) 10011 10111 phenotype space f 01001 01001 10 10 10011 011 001 10001 01 01 01001 001 011 coding scheme 00111 11001 recombination selection 01011 10001 10001 x 01011 fitness 10001 11001 01011 Current ensemble of base classifiers Genetic Ensemble Feature Selection IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  12. Basic Idea behind GA for EFS Ensemble (generation) RSM init BC1 Current Population (diversity) GA BCi New Population (fitness) BCEns. Size IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  13. Basic Idea behind GAS-SEFS RSM init Generation Ensemble BC1 Current Population (accuracies) New Population (fitness) GAi+1 diversity new BC (fitness) BCi BCi+1 BCi+1 IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  14. GAS-SEFS 1 of 2 • GAS-SEFS(Genetic Algorithm-based Sequential Search for Ensemble Feature Selection) • instead of maintaining a set of feature subsets in each generation like in GA, consists in applying aseries of genetic processes, one for each base classifier, sequentially. • After each genetic process one base classifier is selected into the ensemble. • GAS-SEFS uses the same fitness function, but • diversity is calculated with the base classifiers already formed by previous genetic processes • In the first GA process – accuracy only. • GAS-SEFS uses the same genetic operators as GA. IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  15. GAS-SEFS 2 of 2 • GA and GAS-SEFSpeculiarities: • Full feature sets are not allowed in RS • The crossover operator may not produce a full feature subset. • Individuals for crossover are selected randomly proportional to log(1+fitness) instead of just fitness • The generation of children identical to their parents is prohibited. • To provide a better diversity in the length of feature subsets, two different mutation operators are used • Mutate1_0 deletes features randomly with a given probability; • Mutate0_1 adds features randomly with a given probability. IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  16. Computational complexity Complexity of GA-based search does not depend on the #features GAS-SEFS: GA: where S is the number of base classifiers, S’ is the number of individuals (feature subsets) in one generation, and Ngen is the number of generations. EFSS and EBSS: where S is the number of base classifiers, N is the total number of features, and N’ is the number of features included or deleted on average in an FSS or BSS search. IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  17. Selection/Combination Static Dynamic Integration of classifiers Static Selection (CVM) Weighted Voting (WV) Dynamic Selection (DS) Dynamic Voting with Selection (DVS) Motivation for the Dynamic Integration: Each classifier is best in some sub-areas of the whole data set, where its local error is comparatively less than the corresponding errors of the other classifiers. IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  18. Experimental Design • Parameter settingsfor GA and GAS-SEFS: • a mutation rate - 50%; • a population size – 10; • a search length of 40 feature subsets/individuals: • 20 are offsprings of the current population of 10 classifiers generated by crossover, • 20 are mutated offsprings (10 with each mutation operator). • 10 generations of individuals were produced; • 400 (GA) and 4000 (GAS-SEFS) feature subsets. • To evaluateGA and GAS-SEFS: • 5 integration methods • Simple Bayes as Base Classifier • stratified random-sampling with 60%/20%/20% of instances in the training/validation/test set; • 70 test runs on each of 21 UCI data set for each strategy and diversity. IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  19. GA vs GAS-SEFS on two groups of datasets DVS F/N-F disagreement Ensemble Size Ensemble accuracies for GA and GAS-SEFS on two groups of data sets (1): < 9 and (2) >= 9 features with four ensemble sizes IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  20. GA vs GAS-SEFS for Five Integration Methods Ensemble accuracies for five integration methods on Tic-Tac-Toe Ensemble Size = 10 IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  21. Conclusions and Future Work • Diversity in ensemble of classifiers is very important • We have considered two genetic search strategies for EFS. • The new strategy, GAS-SEFS, consists in employing a series of genetic search processes • one for each base classifier. • GAS-SEFS results in better ensembles having greater accuracy • especially for data sets with relatively larger numbers of features. • one reason – each of the core GA processes leads to significant overfitting of a corresponding ensemble member • GAS-SEFS is significantly more time-consuming than GA. • GAS-SEFS = ensemble_size * GA • [Oliveira et al., 2003] better results for single FSS based on Pareto-front dominating solutions. • Adaptation of this technique to EFS is an interesting topic for further research. IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  22. Thank you! Alexey Tsymbal, Padraig Cunningham Dept of Computer ScienceTrinity College DublinIrelandAlexey.Tsymbal@cs.tcd.ie, Padraig.Cunningham@cs.tcd.ie Mykola PechenizkiyDepartment of Computer Scienceand Information Systems University of Jyväskylä Finland mpechen@cs.jyu.fi IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  23. Additional Slides

  24. References • [Kuncheva, 1993] Ludmila I. Kuncheva. Genetic algorithm for feature selection for parallel classifiers, Information Processing Letters 46: 163-168, 1993. • [Kuncheva and Jain, 2000] Ludmila I. Kuncheva and Lakhmi C. Jain. Designing classifier fusion systems by genetic algorithms, IEEE Transactions on Evolutionary Computation 4(4): 327-336, 2000. • [Oliveira et al., 2003] Luiz S. Oliveira, Robert Sabourin, Flavio Bortolozzi, and Ching Y. Suen. A methodology for feature selection using multi-objective genetic algorithms for handwritten digit string recognition, Pattern Recognition and Artificial Intelligence 17(6): 903-930, 2003. • [Opitz, 1999] David Opitz. Feature selection for ensembles. In Proceedings of the16th National Conference on Artificial Intelligence, pages 379-384, 1999, AAAI Press. IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  25. GAS-SEFS Algorithm IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  26. Other interesting findings • alpha • were different for different data sets, • for both GA and GAS-SEFS, alpha for the dynamic integration methods is bigger than for the static ones (2.2 vs 0.8 on average). • GAS-SEFS needs slightly higher values of alpha than GA (1.8 vs 1.5 on average). • GAS-SEFS always starts with a classifier, which is based on accuracy only, and the subsequent classifiers need more diversity than accuracy. • # of selected features falls as the ensemble size grows, • this is especially clear for GAS-SEFS, as the base classifiers need more diversity. • integration methods(for both GA and GAS-SEFS): • the static, SS and WV, and the dynamic DS start to overfit the validation set already after 5 generations and show lower accuracies, • accuracies of DV and DVS continue to grow up to 10 generations. IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  27. Paper Summary • New strategy for genetic ensemble feature selection, GAS-SEFS, is introduced • In contrast with previously considered algorithm (GA), it is sequential; a serious of genetic processes for each base classifier • More time-consuming, but with better accuracy • Each base classifier has a considerable level of overfitting with GAS-SEFS, but the ensemble accuracy grows • Experimental comparisons demonstrate clear superiority on 21 UCI datasets, especially for datasets with many features (gr1 vs gr2) IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  28. Simple Bayes as Base Classifier • Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) • Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) • If i-th attribute is categorical:P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C • If i-th attribute is continuous:P(xi|C) is estimated thru a Gaussian density function • Computationally easy in both cases IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  29. Dataset’s characteristics IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

  30. GA vs GAS-SEFS for Five Integration Methods Ensemble accuracies for GA (left) and GAS-SEFS (right) for five integration methods and four ensemble sizes on Tic-Tac-Toe Ensemble Size IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham P.

More Related