Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution Presented by Jingting Zeng 11/26/2007

Outline • Introduction to Feature Selection • Feature Selection Models • Fast Correlation-Based Filter (FCBF) Algorithm • Experiment • Discussion • Reference

Introduction of Feature Selection • Definition • A process that chooses an optimal subset of features according to an objective function • Objectives • To reduce dimensionality and remove noise • To improve mining performance • Speed of learning • Predictive accuracy • Simplicity and comprehensibility of mined results

An Example for Optimal Subset • Data set (whole set) • Five Boolean features • C = F1∨F2 • F3= ┐F2 ,F5= ┐F4 • Optimal subset: • {F1, F2}or{F1, F3}

Models of Feature Selection • Filter model • Separating feature selection from classifier learning • Relying on general characteristics of data (information, distance, dependence, consistency) • No bias toward any learning algorithm, fast • Wrapper model • Relying on a predetermined classification algorithm • Using predictive accuracy as goodness measure • High accuracy, computationally expensive

Filter Model

Wrapper Model

Two Aspects for Feature Selection • How to decide whether a feature is relevant to the class or not • How to decide whether such a relevant feature is redundant or not compared to other features

Linear Correlation Coefficient • For a pair of variables (x,y): • However, it may not be able to capture the non-linear correlations

Information Measures • Entropy of variable X • Entropy of X after observing Y • Information Gain • Symmetrical Uncertainty

Fast Correlation-Based Filter (FCBF) Algorithm • How to decide whether a feature is relevant to the class C or not • Find a subset , such that • How to decide whether such a relevant feature is redundant • Use the correlation of features and class as a reference

Definitions • Predominant Correlation • The correlation between a feature and the class C is predominant • Redundant peer (RP) • If there is , is a RP of • Use to denote the set of RP for

Three Heuristics • If , treat as a predominant feature, remove all features in and skip identifying redundant peers for them • If , process all the features in at first. If non of them becomes predominant, follow the first heuristic • The feature with the largest value is always a predominant feature and can be a starting point to remove other features.

FCBF Algorithm Time Complexity: O(N)

FCBF Algorithm (cont.) Time complexity: O(NlogN)

Experiments • FCBF are compared to ReliefF, CorrSF and ConsSF • Summary of the 10 data sets

Results

Results (cont.)

Pros and Cons • Advantage • Very fast • Select fewer features with higher accuracy • Disadvantage • Cannot detect some features • 4 features generated by 4 Gaussian functions and adding 4 additional redundant features, FCBF selected only 3 features

Discussion • FCBF compares only individual features with each other • Try to use PCA to capture a group of features. Based on the result, then the FCBF is used.

Reference • L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003 • Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005. • www.cse.msu.edu/~ptan/SDM07/Yu-Ye-Liu.pdf • www1.cs.columbia.edu/~jebara/6772/proj/Keith.ppt

Thank you! Q and A

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Presentation Transcript

Facility Access and Shipment Tracking (FAST) Overview

Fast Food and Obesity

Personnel Selection

Cluster and Outlier Analysis

Valve selection

Data Mining: Preprocessing Techniques

Feature selection methods

Chapter 7. Cluster Analysis

Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)

[f´‚nE˘RIks]

AP Biology

Regression, correlation and liquid association in complex genomic data analysis

Edges and Binary Images

Examples of One-Dimensional Systolic Arrays

Network Payload-based Anomaly Detection and Content-based Alert Correlation

Chapter 7 Finite Impulse Response(FIR) Filter Design

ITEC 136 Business Programming Concepts

Using the Particle Filter Approach to Building Partial Correspondences between Shapes

Outline

Multivariate data

AP Biology