Feature Selection

# Feature Selection

## Feature Selection

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Feature Selection

2. Goals • What is Feature Selection for classification? • Why feature selection is important? • What is the filter and what is the wrapper approach to feature selection? • Examples

3. What is Feature Selection for classification? • Given: a set of predictors (“features”) V and a target variable T • Find: minimum set F that achieves maximum classification performance of T (for a given set of classifiers and classification performance metrics)

4. Why feature selection is important? • May Improve performance of classification algorithm • Classification algorithm may not scale up to the size of the full feature set either in sample or time • Allows us to better understand the domain • Cheaper to collect a reduced set of predictors • Safer to collect a reduced set of predictors

5. Filters vs Wrappers: Wrappers Say we have predictors A, B, C and classifier M. We want to predict T given the smallest possible subset of {A,B,C}, while achieving maximal performance (accuracy) FEATURE SET CLASSIFIER PERFORMANCE {A,B,C} M 98% {A,B}M 98% {A,C} M 77% {B,C} M 56% {A} M 89% {B} M 90% {C} M 91% {.} M 85%

6. Filters vs Wrappers: Wrappers The set of all subsets is the power set and its size is 2|V| . Hence for large V we cannot do this procedure exhaustively; instead we rely on heuristic search of the space of all possible feature subsets. {A,B} 98 start {A} 89 {A,C} 77 {A,B}98 {A,B,C}98 {} 85 {B} 90 {B,C} 56 {A,C}77 {C} 91 end {B,C}56

7. Filters vs Wrappers: Wrappers A common example of heuristic searchis hill climbing: keep adding features one at a time until no further improvement can be achieved. {A,B} 98 start {A} 89 {A,C} 77 {A,B}98 {A,B,C}98 {} 85 {B} 90 {B,C} 56 {A,C}77 {C} 91 end {B,C}56

8. Filters vs Wrappers: Wrappers • A common example of heuristic searchis hill climbing: keep adding features one at a time until no further improvement can be achieved (“forward greedy wrapping”) • Alternatively we can start with the full set of predictors and keep removing features one at a time until no further improvement can be achieved (“backward greedy wrapping”) • A third alternative is to interleave the two phases (adding and removing) either in forward or backward wrapping (“forward-backward wrapping”). • Of course other forms of search can be used; most notably: • Exhaustive search • Genetic Algorithms • Branch-and-Bound (e.g., cost=# of features, goal is to reach performance th or better)

9. Filters vs Wrappers: Filters In the filter approach we do not rely on running a particular classifier and searching in the space of feature subsets; instead we select features on the basis of statistical properties. A classic example is univariate associations: FEATURE ASSOCIATION WITH TARGET Threshold gives suboptimal solution {A} 89% Threshold gives optimal solution {B} 90% {C} 91% Threshold gives suboptimal solution

10. Example Feature Selection Methods in Biomedicine: Univariate Association Filtering • Order all predictors according to strength of association with target • Choose the first k predictors and feed them to the classifier • Various measures of association may be used: X2, G2, Pearson r, Fisher Criterion Scoring, etc. • How to choose k? • What if we have too many variables?

11. Example Feature Selection Methods in Biomedicine: Recursive Feature Elimination • Filter algorithm where feature selection is done as follows: 1. build linear Support Vector Machine classifiers using V features 2. compute weights of all features and choose the best V/2 3. repeat until 1 feature is left 4. choose the feature subset that gives the best performance (using cross-validation)

12. Example Feature Selection Methods in Bioinformatics: GA/KNN Wrapper approach whereby: • heuristic search=Genetic Algorithm, and • classifier=KNN

13. How do we approach the feature selection problem in our research? • Find the Markov Blanket • Why?

14. A fundamental property of the Markov Blanket • MB(T) is the minimal set of predictor variables needed for classification (diagnosis, prognosis, etc.) of the target variable T (given a powerful enough classifier and calibrated classification) V V V C D V V V V V V V V V V V V V V V V V V V T H V V V V V V V V V V V V V V V V V V V I V

15. HITON: An algorithm for feature selection that combines MB induction with wrapping C.F. Aliferis M.D., Ph.D., I. Tsamardinos Ph.D., A. Statnikov M.S. Department of Biomedical Informatics, Vanderbilt University AMIA Fall Conference, November 2003

16. HITON: An algorithm for feature selection that combines MB induction with wrapping

17. HITON: An algorithm for feature selection that combines MB induction with wrapping • Step #1: Find the parents and children of T; call this set PC(T) • Step #2: Find the PC(.) set of each member of PC(T); take the union of all these sets to be PCunion • Step #3: Run a special test to filter out from PCunion the non-members of MB(T) that can be identified as such (not all can); call the resultant set TMB (tentative MB) • Step #4: Apply heuristic search with a desired classifier/loss function and cross-validation to identify variables that can be dropped from TMB without loss of accuracy

18. HITON (Data D; Target T; Classifier A) “returns a minimal set of variables required for optimal classification of T using algorithm A” MB(T) = HITON-MB(D, T) // Identify Markov Blanket Vars = Wrapper(MB(T), T, A) // Use wrapping to remove unnecessary variables Return Vars HITON-MB(Data D, Target T) “returns the Markov Blanket of T” PC = parents and children of T returned by HITON-PC(D, T) PCPC = parents and children of the parents and children or T CurrentMB = PCPCPC // Retain only parents of common children and remove false positives " potential spouse X inCurrentMB and "Y inPC: ifnot\$S in {Y} V -{T, X} so that ^ (T ; X | S ) then retain X in CurrentMB else remove it Return CurrentMB HITON-PC(Data D, Target T) “returns parents and children of T” Wrapper(Vars, T, A) “returns a minimal set among variables Vars for predicting T using algorithm A and a wrapping approach”  Select and remove a variable. If internally cross-validated performance of A remains the same permanently remove the variable. Continue until all variables are considered.

19. HITON-PC(Data D, Target T) “returns parents and children of T” CurrentPC = {} Repeat Find variable Vi not in CurrentPC that maximizes association(Vi, T) and admit Vi into CurrentPC If there is a variable X and a subset S of CurrentPC s.t. ^(X : T | S) remove Vifrom CurrentPC; mark Vi and do not consider it again Until no more variables are left to consider ReturnCurrentPC

20. Filters vs Wrappers: Which Is Best? • None over all possible classification tasks! • We can only prove that a specific filter (or wrapper) algorithm for a specific classifier (or class of classifiers), and a specific class of distributions yields optimal or sub-optimal solutions. Unless we provide such proofs we are operating on faith and hope…

21. A final note: What is the biological significance of selected features? • In MB-based feature selection and CPN-faithful distributions: causal neighborhood of target (i.e., direct causes, direct effects, direct causes of the direct effects of target). • In other methods: ???

22. Case Studies

23. Case Study: Categorizing Text Into Content Categories Automatic Identification of Purpose and Quality of Articles In Journals Of Internal Medicine Yin Aphinyanaphongs M.S. , Constantin Aliferis M.D., Ph.D. (presented in AMIA 2003)

24. Case Study: Categorizing Text Into Content Categories • The problem: classify Pubmed articles as [high quality & treatment specific] or not • Same function as the current Clinical Quality Filters of Pubmed (in the treatment category)

25. Case Study: Categorizing Text Into Content Categories • Overview: • Select Gold Standard • Corpus Construction • Document representation • Cross-validation Design • Train classifiers • Evaluate the classifiers

26. Case Study: Categorizing Text Into Content Categories • Select Gold Standard: • ACP journal club. Expert reviewers strictly evaluate and categorize in each medical area articles from the top journals in internal medicine. • Their mission is “to select from the biomedical literature those articles reporting original studies and systematic reviews that warrant immediate attention by physicians.” • The treatment criteria -ACP journal club • “Random allocation of participants to comparison groups.” • “80% follow up of those entering study.” • “Outcome of known or probable clinical importance.” • If an article is cited by the ACP , it is a high quality article.

27. Case Study: Categorizing Text Into Content Categories • Corpus construction: 12/2000 9/1999 8/1998 Get all articles from the 49 journals in the study period. Review ACP Journal from 8/1998 to 12/2000 for articles that are cited by the ACP.  15,803 total articles, 396 positives (high quality treatment related)

28. Case Study: Categorizing Text Into Content Categories • Document representation: • “Bag of words” • Title, abstract, Mesh terms, publication type • Term extraction and processing: e.g. “The clinical significance of cerebrospinal.” • Term extraction • “The”, “clinical”, “significance”, “of”, “cerebrospinal” • Stop word removal • “Clinical”, “Significance”, “Cerebrospinal” • Porter Stemming (i.e. getting the roots of words) • “Clinic*”, “Signific*”, “Cerebrospin*” • Term weighting • log frequency with redundancy.

29. Case Study: Categorizing Text Into Content Categories • Cross-validation design 10 fold cross Validation to measure error 20% reserve train 80% validation test 15803 articles

30. Case Study: Categorizing Text Into Content Categories • Classifier families • Naïve Bayes (no parameter optimization) • Decision Trees with Boosting (# of iterations = # of simple rules) • Linear & Polynomial Support Vector Machines (cost from {0.1, 0.2, 0.4, 0.7, 0.9, 1, 5, 10, 20, 100, 1000}, degree from {1,2,3,5,8})

31. Case Study: Categorizing Text Into Content Categories • Evaluation metrics (averaged over 10 cross-validation folds): • Sensitivity for fixed specificity • Specificity for fixed sensitivity • Area under ROC curve • Area under 11-point precision-recall curve • “Ranked retrieval”

32. Case Study: Categorizing Text Into Content Categories

33. Case Study: Categorizing Text Into Content Categories

34. Case Study: Categorizing Text Into Content Categories 36 27 Clinical Query Filter 18 9 Clinical Query Filter Performance

35. Case Study: Categorizing Text Into Content Categories Clinical Query Filter

36. Case Study: Categorizing Text Into Content Categories • Alternative/additional approaches? • Negation detection • Citation analysis • Sequence of words • Variable selection to produce user-understandable models • Analysis of ACPJ potential bias • Others???

37. Case Study: Diagnostic Model From Array Gene Expression Data Computational Models of Lung Cancer: Connecting Classification, Gene Selection, and Molecular Sub-typing C.Aliferis M.D., Ph.D., Pierre Massion M.D. I. Tsamardinos Ph.D., D. Hardin Ph.D.

38. Case Study: Diagnostic Model From Array Gene Expression Data • Specific Aim 1: “Construct computational models that distinguish between important cellular states related to lung cancer, e.g., (i) Cancerous vs Normal Cells; (ii) Metastatic vs Non-Metastatic cells; (iii) Adenocarcinomas vs Squamous carcinomas”. • Specific Aim 2: “Reduce the number of gene markers by application of biomarker (gene) selection algorithms such that small sets of genes can distinguish among the different states (and ideally reveal important genes in the pathophysiology of lung cancer).”

39. Case Study: Diagnostic Model From Array Gene Expression Data • Bhattacharjee et al. PNAS, 2001 • 12,600 gene expression measurements obtained using Affymetrix oligonucleotide arrays • 203 patients and normal subjects, 5 disease types, ( plus staging and survival information)

40. Case Study: Diagnostic Model From Array Gene Expression Data • Linear and polynomial-kernel Support Vector Machines (LSVM, and PSVM respectively) C optimized via C.V. from {10-8, 10-7, 10-6, 10-5, 10-4, 10-3, 10-2, 0.1, 1, 10, 100, 1000} and degree from the set: {1, 2, 3, 4}. • K-Nearest Neighbors (KNN) (k optimized via C.V.) • Feed-forward Neural Networks (NNs). 1 hidden layer, number of units chosen (heuristically) from the set {2, 3, 5, 8, 10, 30, 50}, variable-learning-rate back propagation, custom-coded early stopping with (limiting) performance goal=10-8 (i.e., an arbitrary value very close to zero), and number of epochs in the range [100,…,10000], and a fixed momentum of 0.001 • Stratified nested n-fold cross-validation (n=5 or 7 depending on task)

41. Case Study: Diagnostic Model From Array Gene Expression Data • Area under the Receiver Operator Characteristic (ROC) curve (AUC) computed with the trapezoidal rule (DeLong et al. 1998). • Statistical comparisons among AUCs were performed using a paired Wilcoxon rank sum test (Pagano et al. 2000). • Scale gene values linearly to [0,1] • Feature selection: • RFE (parameters as the ones used in Guyon et al 2002) • UAF (Fisher criterion scoring; k optimized via C.V.)

42. Case Study: Diagnostic Model From Array Gene Expression Data • Classification Performance

43. Case Study: Diagnostic Model From Array Gene Expression Data • Gene selection

44. Case Study: Diagnostic Model From Array Gene Expression Data • A more detailed look: • Specific Aim 3:“Study how aspects of experimental design (including data set, measured genes, sample size, cross-validation methodology) determine the performance and stability of several machine learning (classifier and feature selection) methods used in the experiments”.

45. Case Study: Diagnostic Model From Array Gene Expression Data • Overfitting: we replace actual gene measurements by random values in the same range (while retaining the outcome variable values). • Target class rarity: we contrast performance in tasks with rare vs non-rare categories. • Sample size: we use samples from the set {40,80,120,160, 203} range (as applicable in each task). • Predictor info redundancy: we replace the full set of predictors by random subsets with sizes in the set {500, 1000, 5000, 12600}.