Active Learning for Class Imbalance Problem
E N D
Presentation Transcript
Problem to be addressed • Motivation • class imbalance problem • referring to the situation that at least one of class having significantly less number of training examples • or examples in training data belonging to one class heavily outnumber the examples in the other class • Currently, most of the machine learning algorithms assume the training data to be balanced, support vector machine, logistic regression, naïve bayesian classifier etc,. • During the last few decades, some effective methods have been proposed to attack this problem, like up-sampling, down-sampling and asymmetric bagging, etc,.
Problem to be addressed • Detailed problem • Traditional machine learning algorithms are often biased toward the majority class • Since the goal of the classifiers is to reduce the training error, not taking the data distribution into consideration • Consequently, examples from the majority class are well-classified while the examples from minority class tend to be misclassified
Several Common Approaches • From the data perspective • Over-sampling • Under-sampling • Asymmetric Bagging • From the learning algorithm perspective • Adjusting the cost function • Tuning the related parameters
Background Knowledge • Active Learning • Similar to semi-supervised learning method, the key idea is to use both the labeled and unlabeled data for classifier training. • Active learning is composed of four components • A small set of labeled training data, a large pool of unlabeled data, a based learning algorithm and an active learner (selection strategy) • Active learning is not a machine learning algorithm, It can be seen as a enhancing wrapper method • The difference between semi-supervised learning and active learning
Background Knowledge • Active Learning • Goals of active learning • Maximizing the learning performance while minimizing the required labeled training examples • Achieving better performance using the same amount of labeled training data • Needing less training samples to obtain the same learning performance
An Example • SVM-based Active Learning • A small set of labeled training examples • A large pool of unlabeled data • Base learning algorithm SVM • Active Learner (selection strategy) • Instances closest to the current separating hyperplane are selected and asks for human labeling
Problems • SVM-based Active Learning • In classical active learning methods, the most informative samples are selected from the entire unlabeled pool • In other words, each iteration of active learning involves the computation of distance of each sample to the decision boundary • For large-scale data set, it is time-consuming and computationally inefficient
Paper Contribution • Proposed method • Instead of querying the whole unlabeled pool , a subset is first selected • Select the closed sample from using the criterion that is among the top closest instances with probability
Paper Contribution • Proposed Method • The probability that at least one of the L instances is among the closest is • We have
Paper Contribution • Proposed Method • For example • The active learner will pick one instance, with 95% probability, that is among the top 5% closed instances to the separating hyperplane, by randomly sampling only instances regardless of the training set size
Experiments • Evaluation Metric • g-means • where sensitivity and specifity are the accuracies of the positive and negative instances respectively
Conclusions • This paper propose a method to address the class imbalance problem using active learning technique • Experimental results show that this approach can achieve a significant decrease in the training time, while maintaining the same or even higher g-means value by using less number of training examples • Active selection of informative examples from a randomly selected subset avoid searching the whole unlabeled pool