Effective Feature Selection Methods: Part I

Feature Selection MethodsPart-IBy:Dr. Rajeev SrivastavaIIT(BHU), Varanasi

Introduction • The feature is defined as a function of one or more measurements each of which specifies some quantifiable property of an image, and is computed such that it quantifies some significant characteristics of the object. • Feature selection is the process of selecting a subset of relevant features for use in model construction. • The features removed should be useless, redundant, or of the least possible use • The goal of feature selection is to find the subset of features that produces the best target detection and recognition performance and requires the least computational effort.

Reasons of Feature Selection • Feature selection is important to target detection and recognition systems mainly for three reasons: • First, using more features can increase system complexity, yet it may not always lead to higher detection/recognition accuracy. Sometimes, many features are available to a detection/recognition system. These features are not independent and may be correlated. A bad feature may greatly degrade the performance of the system. Thus, selecting a subset of good features is important • Second, Selecting many features means a complicated model being used to approximate the training data. According to the minimum description length principle (MDLP), a simple model is better than a complex model • Third, using fewer features can reduce the computational cost, which is important for real-time applications. Also it may lead to better classification accuracy due to the finite sample size effect. • Feature selection techniques provide three main benefits when constructing predictive models: • Improved model interpretability • Shorter Computation times • Enhanced generalisation by reduction

Advantages of Feature Selection • It reduces the dimensionality of the feature space, to limit storage requirements and increase algorithm speed • It removes the redundant, irrelevant or noisy data. • The immediate effects for data analysis tasks are speeding up the running time of the learning algorithms. • Improving the data quality. • Increasing the accuracy of the resulting model. • Feature set reduction, to save resources in the next round of data collection or during utilization; • Performance improvement, to gain in predictive accuracy; • Data understanding, to gain knowledge about the process that generated the data or simply visualize the data

Taxonomy of Feature Selection (Statistical pattern Recognition) (Produce same subset on a given problem every time)

Feature Selection Approaches • There are two approaches in Feature selection: • Forward Selection: Start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not significantly decrease the error. • Backward Selection: Start with all the variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly. To reduce over fitting, the error referred to above is the error on a validation set that is distinct from the training set.

Schemes for Feature Selection • The relationship between a FSA and the inducer chosen to evaluate the usefulness of the feature selection process can take three main forms: • Filter Methods : These methods select features based on discriminating criteria that are relatively independent of classification • Minimum redundancy-maximum relevance (MRMR) method is example of filter method. They supplement the maximum relevance criteria along with minimum redundancy criteria to choose additional features that are maximally dissimilar to already identified ones.

Wrapper Methods : These methods select features based on discriminating criteria that are relatively independent of classification • Embedded Methods : The inducer has its own FSA (either explicit or implicit). The methods to induce logical conjunctions provide an example of this embedding. Other traditional machine learning tools like decision trees or artificial neural networks are included in this scheme.

Filters vs Wrappers • Filters: • Fast execution (+): Filters generally involve a non-iterative computation on the dataset, which can execute much faster than a classifier training session • Generality (+): Since filters evaluate the intrinsic properties of the data, rather than their interactions with a particular classifier, their results exhibit more generality: the solution will be “good” for a larger family of classifiers • Tendency to select large subsets (-): Since the filter objective functions are generally monotonic, the filter tends to select the full feature set as the optimal solution. This forces the user to select an arbitrary cutoff on the number of features to be selected • Wrappers: • Accuracy (+): wrappers generally achieve better recognition rates than filters since they are tuned to the specific interactions between the classifier and the dataset • Ability to generalize (+): wrappers have a mechanism to avoid overfitting, since they typically use cross-validation measures of predictive accuracy • Slow execution (-): since the wrapper must train a classifier for each feature subset (or several classifiers if cross-validation is used), the method can become unfeasible for computationally intensive methods • Lack of generality (-): the solution lacks generality since it is tied to the bias of the classifier used in the evaluation function. The “optimal” feature subset will be specific to the classifier under consideration

Naïve method • Sort the given d features in order of their prob. of correct recognition • Select the top m features from this sorted list • Disadvantage: Feature correlation is not considered; best pair of features may not even contain the best individual feature • Begin with a single solution (feature subset) & iteratively add or remove features until some termination criterion is met • Bottom up (forward method): begin with an empty set & add features • Top-down (backward method): begin with a full set & delete features • These “greedy” methods do not examine all possible subsets, so no guarantee of finding the optimal subset Sequential methods

Sequential Forward Selection 1. Start with the empty set Y0={∅} 2. Select the next best feature X+ 3. Update Yk+1=Yk + X+; 𝑘=𝑘+1 4. Go to 2 • SFS performs best when the optimal subset is small, When the search is near the empty set, a large number of states can be potentially evaluated • Towards the full set, the region examined by SFS is narrower since most features have already been selected • The search space is drawn like an ellipse to emphasize the fact that there are fewer states towards the full or empty sets • Disadvantage: Once a feature is retained, it cannot be discarded; nesting problem

Sequential Backward Selection 1. Start with the full set Y0=𝑋 2. Remove the worst feature X- 3. Update Yk+1=Yk – X- ; 𝑘=𝑘+1 4. Go to 2 • Sequential Backward Elimination works in the opposite direction of SFS • SBS works best when the optimal feature subset is large, since SBS spends most of its time visiting large subsets • The main limitation of SBS is its inability to re-evaluate the usefulness of a feature after it has been discarded

Generalized sequential forward selection • Start with the empty set, X=0 • Repeatedly add the most significant m-subset of (Y - X) (found through exhaustive search) Generalized sequential backward selection • Start with the full set, X=Y • Repeatedly delete the least significant m-subset of X (found through exhaustive search)

Bidirectional Search (BDS) • BDS is a parallel implementation of SFS and SBS • SFS is performed from the empty set • SBS is performed from the full set • To guarantee that SFS and SBS converge to the same solution: • Features already selected by SFS are not removed by SBS • Features already removed by SBS are not selected by SFS • Start SFS with Yf={∅} • Start SBS with YB=𝑋 • Select the best feature X+ Update YF(k+1)=YFk+ X+; 𝑘=𝑘+1 • Remove the worst feature X- Update YB(k+1)=YBk+ X-; 𝑘=𝑘+1

Sequential floating selection (SFFS &SFBS) • There are two floating methods • Sequential floating forward selection (SFFS) starts from the empty set • After each forward step, SFFS performs backward steps as long as the objective function increases • Sequential floating backward selection (SFBS) starts from the full set • After each backward step, SFBS performs forward steps as long as the objective function increases • SFFS algorithm: 1. Y0={∅} 2. Select the best feature X+ Update Yk+1=Yk+ X+; 𝑘=𝑘+1 • Select the best feature X- • If J(Yk-x-)>J(Yk) then { J(x)=Criterion Function} Yk+1=Yk-x- ;k=k+1 go to step 3 Else go to step 2 (We need to do some book-keeping to avoid infinite loop)

Genetic Algorithm Feature Selection • In a GA approach, a given feature subset is represented as a binary string a “chromosome" of length n with a zero or one in position i denoting the absence or presence of feature i in the set ( n = total number of available features) • A population of chromosomes is maintained • Each chromosome is evaluated from evaluation function to determine its “fitness" which determines how likely the chromosome is to survive and breed into the next generation • New chromosomes are created from old chromosomes by the processes of • Crossover, where parts of two different parent chromosomes are mixed to create offspring • Mutation: where the bits of a single parent are randomly perturbed to create a child • Choosing an appropriate evaluation function is an essential step for successful application of GAs to any problem domain

Minimum Redundancy Maximum Relevance Feature Selection • This approach is based on recognizing that the combinations of individually good variables do not necessarily lead to good classification • To maximize the joint dependency of top ranking variables on the target variable, the redundancy among them must be reduced, So we select maximally relevant variables and avoiding the redundant ones • First, mutual information (MI) between the candidate variable and the target variable is calculated (relevance term) • Then average MI between the candidate variable and the variables that are already selected is computed (redundancy term) • The entropy-based mRMRscore (higher it is for a feature, more that feature is needed) is obtained by subtracting the redundancy from relevance • Both relevance and redundancy estimation are low dimensional problems (involves only 2 variables). This is much easier than directly estimating multivariate density or mutual information in the high dimensional space • It only measures the quantity of redundancy between the candidate variables and the selected variables but does not deal with the type of this redundancy

References • FEATURE SELECTION METHODS AND ALGORITHMS L.Ladha, Research Scholar, Department Of Computer Science, Sri Ramakrishna College Of Arts and Science for Women, Coimbatore, Tamilnadu, India • Feature Selection: Evaluation, application and small sample performance, Anil Jaiin, Douglas Zongker Michigan State University USA • Using covariates for improving the minimum Redundancy Maximum Relevance feature selection Method Olcay KURS¸UN1, C. Okan S¸AKAR2, Oleg FAVOROV3

Effective Feature Selection Methods: Part I

Effective Feature Selection Methods: Part I

Presentation Transcript

A Survey on Text Classification

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004

Multiple testing and false discovery rate in feature selection

Ch. 12 Breeding methods for cross-pollinated crops

Feature Selection

Feature selection

Extending Propositional Satisfiability to Determine Minimal Fuzzy-Rough Reducts

Ch 13: Social Psych and Business

Biostatistics-Lecture 7 Variable selection methods

Dual Active Feature and Sample Selection for Graph Classification

Feature Grouping-Based Fuzzy-Rough Feature Selection

Selection Part 2

Dr. K C Chakrabarty, CMD at Deep Prajwalan at Varanasi, UP

Machine Learning Feature Creation and Selection

Dr. K C Chakrabarty, CMD at Deep Prajwalan at Varanasi, UP

Feature Generation and Selection in SRL

Dual Active Feature and Sample Selection for Graph Classification

Feature Selection, Feature Extraction

Data Mining (and machine learning)

A genetic algorithm-based method for feature subset selection

Feature Selection Focused within Error Clusters

Feature selection