120 likes | 246 Vues
This paper addresses the challenge of data stream classification, which is complicated by infinite data length and concept drift. We propose an ensemble classification framework that effectively trains classifiers using only a limited amount of labeled data within streaming environments. By employing semi-supervised clustering and focusing on the most relevant data, our method improves upon traditional techniques that require fully labeled data. The experimental results, utilizing both synthetic and real-world datasets, demonstrate that our approach provides significant advantages in accuracy and efficiency.
E N D
Data Stream Classification:Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan BhavaniThuraisingham University of Texas at Dallas Jing Gao Jiawei Han University of Illinois at Urbana-Champaign To appear in IEEE International Conference on Data Mining, (ICDM) Pisa, Italy, Dec 15-19, 2008 Funded by: Air Force
Data Stream Classification Techniques • Data stream classification is a challenging task because of two reasons • Infinite length – can’t use all historical data for training • Concept-drift – old models become outdated • Solutions: • Single model classification with incremental learning • Ensemble classification • Ensemble techniques can be updated more efficiently, and handles concept-drift more effectively. • Our solution: ensemble approach
Ensemble Classification • Ensemble techniques build an ensemble of M classifiers. • The data stream is divided into equal-sized chunks • A new classifier is trained from each labeled chunk • The new classifier replaces one old classifier (if required) Data stream Last labeled data chunk Last unlabeled data chunk 1 2 Ensemble L1 Train new Classifier L2 Ensemble classification Update ensemble LM
Limited Labeled Data Problem • Ensemble techniques assume that the entire data chunk is labeled during training. • This assumption is impractical because • Data labeling is both time consuming and costly • It may not be possible to label all the data • Specially in an streaming environment, where data is being produced at a high speed • Our solution: • Train classifiers with limited amount of labeled data • Assuming a fraction of the data chunk is labeled • We obtain better result compared to other techniques that train classifiers with fully labeled data chunk
Training With Partially-Labeled Chunk • Train new classifier using semi-supervised clustering • If a new class has arrived in the stream, refine the existing classifiers • Update the ensemble Data stream Last partially -labeled chunk Last unlabeled data chunk 2 1 Ensemble Train a classifier using Semi-supervised Clustering L1 L2 Ensemble classification 3 refine ensemble LM 4 Update ensemble
Semi-Supervised Clustering • Overview: • Only a few instances in the training data are labeled • We apply impurity-based clustering • The goal is to minimize cluster impurity • A cluster is completely pure if all the labeled data in that cluster is from the same class • Objective function: K: number of clusters, Xi: data points belonging to cluster i, Li: labeled data points belonging to cluster i Impi: Impurity of cluster i = Entropy * dissimilarity count
Semi-Supervised Clustering Using E-M • Constrained initialization: • Initialize K seeds • For each class Cj, • Select kj seeds from the labeled instances of Cj using farthest-first traversal heuristic • where kj = (Nj/N) * K • Nj= number of instances in class Cj, N = total number of labeled instances • Repeat E-step and M-step until convergence: • E-step • Assign clusters to each instance • So that the objective function is minimized • Apply Iterative Conditional Mode (ICM) until convergence • M-step • Re-compute cluster centroids
Saving Cluster Summary as Micro-Cluster • For each of the K clusters created using the semi-supervised clustering, save the followings • Centroid • n: total number of points • L: total number of labeled points • Lt[ ]: array containing the total number of labeled points belonging to each class. • e.g. : Lt[j]: total number of data points belonging to class Cj • Sum[ ]: array containing the sum of the attribute values of each dimension of all the data points • e.g. : Sum[r]: sum of the rth dimension of all data points in the cluster
Using the Micro-Clusters as Classification Model • We remove all the raw data points after saving the micro-clusters • The set of K such micro-clusters built from a data chunk serves as a classification model • To classify a test data point x using the model: • Find the Q nearest micro-clusters (by computing the distance between x and their centroids) • For each class Cj, Compute the “cumulative normalized frequency (CNFrq[j])”, where • CNFrq[j] = sum of Lt[j]/L of all the Q micro-clusters • Output the class Cj with the highest CNFrq[j]
Experiments • Data sets: • Synthetic data – simulates evolving data streams • Real data – botnet data, collected from real botnet traffic generated in a controlled environment • Baseline: • On-demand Stream • C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18(5):577–589, 2006. • For training, we use 5% labeled data in each chunk. • So, if there are 100 instances in a chunk • Our technique (SmSCluster) use only 5 labeled and 95 unlabeled instances for training • On Deman Stream uses 100 labeled instances for training • Environment • S/W: Java, OS: Windows XP, H/W: Pentium-IV, 3GHz dual core, 2GB RAM
Results • Each time unit = 1,000 data points (botnet) • and 1,600 data points (synthetic)
Conclusion • We address a more practical approach in classifying evolving data streams – training with limited amount of labeled data. • Our technique applies semi-supervised clustering to train classification models. • Our technique outperforms other state-of-the-art stream classification techniques, which use 20 times more labeled data for training than our technique.