1 / 55

Learning from Imbalanced Data Prof. Haibo He Electrical Engineering

Learning from Imbalanced Data Prof. Haibo He Electrical Engineering University of Rhode Island, Kingston, RI 02881 Computational Intelligence and Self-Adaptive Systems (CISA) Laboratory http://www.ele.uri.edu/faculty/he/ Email: he@ele.uri.edu.

joshuah
Télécharger la présentation

Learning from Imbalanced Data Prof. Haibo He Electrical Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning from Imbalanced Data Prof. Haibo He Electrical Engineering University of Rhode Island, Kingston, RI 02881 Computational Intelligence and Self-Adaptive Systems (CISA) Laboratory http://www.ele.uri.edu/faculty/he/ Email: he@ele.uri.edu This lecture notes is based on the following paper: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  2. Learning from Imbalanced Data Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 The problem: Imbalanced Learning The solutions: State-of-the-art The evaluation: Assessment Metrics The future: Opportunities and Challenges

  3. TheNature of Imbalanced Learning Problem Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  4. The Nature of Imbalance Learning The Problem • Explosive availability of raw data • Well-developed algorithms for data analysis What about data in reality? Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 Requirement? • Balanceddistribution of data • Equalcosts of misclassification

  5. Imbalance is Everywhere Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  6. Growing interest Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  7. The Nature of Imbalance Learning Mammography Data Set: An example of between-class imbalance Imbalance can be on the order of 100 : 1 up to 10,000 : 1! Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  8. Intrinsic andextrinsicimbalance Intrinsic: • Imbalance due to the nature of the dataspace Extrinsic: • Imbalance due to time, storage, and other factors • Example: Data transmission over a specific interval of time with interruption Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  9. The Nature of Imbalance Learning Data Complexity Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  10. Relative imbalance and absoluterarity ? • The minority class may be outnumbered, but not necessarily rare • Therefore they can be accurately learned with little disturbance Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  11. Imbalanced data with small sample size Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 • Data with high dimensionality and small sample size • Face recognition, gene expression • Challenges with small sample size: • Embedded absolute rarity and within-class imbalances • Failure of generalizing inductive rules by learning algorithms • Difficulty in forming good classification decision boundary over more features but less samples • Risk of overfitting

  12. TheSolutions to Imbalanced Learning Problem Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  13. Solutions to imbalanced learning Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  14. Sampling methods Sampling methods Create balance though sampling Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  15. Sampling methods Random Sampling S: training data set; Smin: set of minority class samples, Smaj: set of majority class samples; E: generated samples Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  16. Informed Undersampling Sampling methods • EasyEnsemble • Unsupervised: use random subsets of the majority class to create balance and form multiple classifiers • BalanceCascade • Supervised: iteratively create balance and pull out redundant samples in majority class to form a final classifier Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  17. Informed Undersampling Sampling methods Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  18. Sampling methods Synthetic Sampling with Data Generation Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  19. Sampling methods Synthetic Sampling with Data Generation • Synthetic minority oversampling technique (SMOTE) Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  20. Sampling methods Adaptive Synthetic Sampling Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  21. Sampling methods Adaptive Synthetic Sampling • Overcomes over generalization in SMOTE algorithm • Border-line-SMOTE Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  22. Sampling methods Adaptive Synthetic Sampling Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  23. Sampling methods Sampling with Data Cleaning Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  24. Sampling methods Sampling with Data Cleaning Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  25. Sampling methods Cluster-based oversampling (CBO) method Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  26. Sampling methods CBO Method Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  27. Sampling methods Integration of Sampling and Boosting Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  28. Sampling methods Integration of Sampling and Boosting Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  29. Cost-Sensitive methods Cost-Sensitive Methods Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  30. Cost-Sensitive methods Cost-Sensitive Learning Framework Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  31. Cost-Sensitive methods Cost-Sensitive Dataspace Weighting with Adaptive Boosting Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  32. Cost-Sensitive methods Cost-Sensitive Dataspace Weighting with Adaptive Boosting Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  33. Cost-Sensitive methods Cost-Sensitive Decision Trees • Cost-sensitive adjustments for the decision threshold • The final decision threshold shall yield the most dominant point on the ROC curve • Cost-sensitive considerations for split criteria • The impurity function shall be insensitive to unequal costs • Cost-sensitive pruning schemes • The probability estimate at each node needs improvement to reduce removal of leaves describing the minority concept • Laplace smoothing method and Laplace pruning techniques Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  34. Cost-Sensitive methods Cost-Sensitive Neural Network Four ways of applying cost sensitivity in neural networks Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  35. Kernel-Based Methods Kernel-based learning framework • Based on statistical learning and Vapnik-Chervonenkis (VC) dimensions • Problems with Kernel-based support vector machines (SVMs) • Support vectors from the minority concept may contribute less to the final hypothesis • Optimal hyperplane is also biased toward the majority class Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  36. Kernel-Based Methods Integration of Kernel Methods with Sampling Methods • SMOTE with Different Costs (SDCs) method • Ensembles of over/under-sampled SVMs • SVM with asymmetric misclassification cost • Granular Support Vector Machines—Repetitive Undersampling (GSVM-RU) algorithm Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  37. Kernel-Based Methods Kernel Modification Methods • Kernel classifier construction • Orthogonal forward selection (OFS) and Regularized orthogonal weighted least squares (ROWLSs) estimator • SVM class boundary adjustment • Boundary movement (BM), biased penalties (BP), class-boundary alignment(CBA), kernel-boundary alignment (KBA) • Integrated approach • Total margin-based adaptive fuzzy SVM (TAF-SVM) • K-category proximal SVM (PSVM) with Newton refinement • Support cluster machines (SCMs), Kernel neural gas (KNG), P2PKNNC algorithm, hybrid kernel machine ensemble (HKME) algorithm, Adaboost relevance vector machine (RVM), … Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  38. Active Learning Methods Active Learning Methods • SVM-based active learning • Active learning with sampling techniques • Undersampling and oversampling with active learning for the word sense disambiguation (WSD) imbalanced learning • New stopping mechanisms based on maximum confidence and minimal error • Simple active learning heuristic (SALH) approach Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  39. Active Learning Methods Additional methods Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  40. TheEvaluation of Imbalanced Learning Problem Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  41. Assessment Metrics How to evaluate the performance of imbalanced learning algorithms ? Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  42. Assessment Metrics Singular assessment metrics • Limitations of accuracy – sensitivity to data distributions Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  43. Assessment Metrics Singular assessment metrics • Insensitive to data distributions Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  44. Assessment Metrics Singular assessment metrics Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  45. Assessment Metrics Receive Operating Characteristics (ROC) curves Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  46. Assessment Metrics Precision-Recall (PR) curves • Plotting the precision rate over the recall rate • A curve dominates in ROC space (resides in the upper-left hand) if and only if it dominates (resides in the upper-right hand) in PR space • PR space has all the analogous benefits of ROC space • Provide more informative representations of performance assessment under highly imbalanced data Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  47. Assessment Metrics Cost Curves Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  48. TheFuture of Imbalanced Learning Problem Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  49. Opportunities and Challenges Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

  50. Opportunities and Challenges Understanding the Fundamental Problem • What kind of assumptions will make imbalanced learning algorithms work better compared to learning from the original distributions? • To what degree should one balance the original data set? • How do imbalanced data distributions affect the computational complexity of learning algorithms? • What is the general error bound given an imbalanced data distribution? Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

More Related