Near-Optimal Scalable Feature Selection

Near-Optimal Scalable Feature Selection Siggi Olafsson and Jaekyung Yang Iowa State University INFORMS Annual Conference October 24, 2004

Feature Selection • Eliminate redundant/irrelevant features • Reduced dimensionality • Potential benefits: • Simpler models • Faster induction • More accurate prediction/classification • Knowledge obtained from knowing which features are important 2004 INFORMS Annual Conference

Measuring Feature Quality • Find the subset F of features that maximizes some objective, e.g., • Correlation measures (filter) • Accuracy of classification model (wrapper) • Information gain of filter, gain ratio, etc. • No single measure that always works best 2004 INFORMS Annual Conference

Optimization Approach • Combinatorial optimization problem • Feasible region is {0,1}m, where m is number of features • NP-Hard • Previous optimization methods applied • Branch-and-bound • Genetic algorithms & evolutionary search • Single pass heuristics • Also been formulated as mathematical programming problem 2004 INFORMS Annual Conference

New Approach: NP Method • Nested Partitions (NP) method: • Developed for simulation optimization • Particularly effective for large-scale combinatorial type optimization problems • Accounts for noisy performance measures 2004 INFORMS Annual Conference

NP Method • Maintains a subset called most promising region • Partitioning • Most promising region partitioned into subsets • Remaining feasible solutions aggregated • Random Sampling • Random sample of solutions from each subset • Used to select the next most promising region 2004 INFORMS Annual Conference

Backtrack to previous All subsets Move to best subregion Feature a1 included a1 not included Feature a2 included a2 not included Feature a3 included a3 not included Partitioning Tree Current most promising 2004 INFORMS Annual Conference

Intelligent Partitioning • For NP in general • Partitioning imposes a structure on the search space • Done well the algorithm converges quickly • For NP for feature selection • Partitioning defined by the order of feature • Select most important feature first, etc • E.g., rank according to the information gain of the features (entropy partitioning) 2004 INFORMS Annual Conference

Test Data Sets • Test data sets from UCI Repository 2004 INFORMS Annual Conference

How Well Does it Work? • Comparison between NP and another well known heuristic, namely genetic algorithm (GA) 2004 INFORMS Annual Conference

How Close to Optimal? • So far, this is a heuristic random search with no performance guarantee • However, the Two-Stage Nested Partitions (TSNP) can be shown obtain near optimal solutions with high probability • Assure that ‘correct choice’ is made with probability at least y each time • Correct choice means within an indifference zone d of optimal performance 2004 INFORMS Annual Conference

Two-Stage Sampling • Instead of taking a fixed number of samples from each subregion, use statistical selection, e.g. Rinott: Sample variance estimated from 1st phase Sample points in 1st phase Number of samples needed from j-th region in iteration k Constant determined by the desired probability ψ of selecting the correct region 2004 INFORMS Annual Conference

Performance Guarantee • When maximum depth is reached where 2004 INFORMS Annual Conference

Scalability • The NP and TSNP were originally conceived for simulation optimization • Can handle noisy performance • More sample prescribed in noisy regions • Incorrect moves are corrected through the backtracking element (both NP and TSNP) • Can we use a (small) subset of instances instead of all instances? • This is a common approach to increase scalability of data mining algorithms, but is it worthwhile here? 2004 INFORMS Annual Conference

Numerical Results: Original NP 2004 INFORMS Annual Conference

Observations • Using a random sample can improve performance considerably • Evaluation of each sample feature subset becomes faster • Very small sample degenerates performance • There is now too much noise and the method backtracks excessively → more steps • The TSNP would prescribe more samples! • The expected number of steps is constant • What is the best fraction R of instances to use in the TSNP? 2004 INFORMS Annual Conference

Optimal Sample for TSNP • If we decrease the sample size, then the computation for each sample point decreases • However, the sample variance increased and more sample points will be needed • To find approximate R*, we thus minimize Number of sample points needed in each step Computation time given the number of sample point needed 2004 INFORMS Annual Conference

Approximating the Variance • Now it can be shown that that 2004 INFORMS Annual Conference

Optimal Sampling Ratio • Now obtain the optimal sampling ratio • The constants c0, c1, c2 are estimated from the data, and h, l and d are determined by user preferences 2004 INFORMS Annual Conference

Numerical Results * Statistically better than TSNP w/sampling 2004 INFORMS Annual Conference

Conclusions • Feature selection integral in data mining • Inherently a combinatorial optimization problem • From scalability standpoint it is desirable to be able to deal with nosy data • Nested partitions method • Flexible performance guarantees • Allows for effective use of random sampling • Very good performance on test problems 2004 INFORMS Annual Conference

References • Full papers available: • S. Ólafssonand J. Yang (2004).“Intelligent Partitioning for Feature Selection,”INFORMS Journal on Computing, in print. • S. Ólafsson (2004). “Two-Stage Nested Partitions Method for Stochastic Optimization,”Methodology and Computing in Applied Probability, 6, 5-27. • J. Yang and S. Ólafsson (2004). “Optimization-Based Feature Selection with Adaptive Instance Sampling,”Computers and Operations Research, to appear 2004 INFORMS Annual Conference

Near-Optimal Scalable Feature Selection

Near-Optimal Scalable Feature Selection

Presentation Transcript

Feature selection

Feature Selection

Feature selection

Near Optimal Rate Selection for Wireless Control Systems

Feature Selection

Feature Selection

Feature Selection

FEATURE SELECTION = GENE SELECTION

Feature selection

Feature Selection

Feature Selection

Feature Selection, Feature Extraction

Feature Selection

Optimal Feature Generation

Feature selection

OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization

Feature Selection

Feature Selection

Near-optimal Observation Selection using Submodular Functions

Optimal Feature Generation

Feature selection