Data Stream Classification: Training with Limited Amount of Labeled Data

Data Stream Classification:Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan BhavaniThuraisingham University of Texas at Dallas Jing Gao Jiawei Han University of Illinois at Urbana-Champaign To appear in IEEE International Conference on Data Mining, (ICDM) Pisa, Italy, Dec 15-19, 2008 Funded by: Air Force

Data Stream Classification Techniques • Data stream classification is a challenging task because of two reasons • Infinite length – can’t use all historical data for training • Concept-drift – old models become outdated • Solutions: • Single model classification with incremental learning • Ensemble classification • Ensemble techniques can be updated more efficiently, and handles concept-drift more effectively. • Our solution: ensemble approach

Ensemble Classification • Ensemble techniques build an ensemble of M classifiers. • The data stream is divided into equal-sized chunks • A new classifier is trained from each labeled chunk • The new classifier replaces one old classifier (if required) Data stream Last labeled data chunk Last unlabeled data chunk 1 2 Ensemble L1 Train new Classifier L2 Ensemble classification Update ensemble LM

Limited Labeled Data Problem • Ensemble techniques assume that the entire data chunk is labeled during training. • This assumption is impractical because • Data labeling is both time consuming and costly • It may not be possible to label all the data • Specially in an streaming environment, where data is being produced at a high speed • Our solution: • Train classifiers with limited amount of labeled data • Assuming a fraction of the data chunk is labeled • We obtain better result compared to other techniques that train classifiers with fully labeled data chunk

Training With Partially-Labeled Chunk • Train new classifier using semi-supervised clustering • If a new class has arrived in the stream, refine the existing classifiers • Update the ensemble Data stream Last partially -labeled chunk Last unlabeled data chunk 2 1 Ensemble Train a classifier using Semi-supervised Clustering L1 L2 Ensemble classification 3 refine ensemble LM 4 Update ensemble

Semi-Supervised Clustering • Overview: • Only a few instances in the training data are labeled • We apply impurity-based clustering • The goal is to minimize cluster impurity • A cluster is completely pure if all the labeled data in that cluster is from the same class • Objective function: K: number of clusters, Xi: data points belonging to cluster i, Li: labeled data points belonging to cluster i Impi: Impurity of cluster i = Entropy * dissimilarity count

Semi-Supervised Clustering Using E-M • Constrained initialization: • Initialize K seeds • For each class Cj, • Select kj seeds from the labeled instances of Cj using farthest-first traversal heuristic • where kj = (Nj/N) * K • Nj= number of instances in class Cj, N = total number of labeled instances • Repeat E-step and M-step until convergence: • E-step • Assign clusters to each instance • So that the objective function is minimized • Apply Iterative Conditional Mode (ICM) until convergence • M-step • Re-compute cluster centroids

Saving Cluster Summary as Micro-Cluster • For each of the K clusters created using the semi-supervised clustering, save the followings • Centroid • n: total number of points • L: total number of labeled points • Lt[ ]: array containing the total number of labeled points belonging to each class. • e.g. : Lt[j]: total number of data points belonging to class Cj • Sum[ ]: array containing the sum of the attribute values of each dimension of all the data points • e.g. : Sum[r]: sum of the rth dimension of all data points in the cluster

Using the Micro-Clusters as Classification Model • We remove all the raw data points after saving the micro-clusters • The set of K such micro-clusters built from a data chunk serves as a classification model • To classify a test data point x using the model: • Find the Q nearest micro-clusters (by computing the distance between x and their centroids) • For each class Cj, Compute the “cumulative normalized frequency (CNFrq[j])”, where • CNFrq[j] = sum of Lt[j]/L of all the Q micro-clusters • Output the class Cj with the highest CNFrq[j]

Experiments • Data sets: • Synthetic data – simulates evolving data streams • Real data – botnet data, collected from real botnet traffic generated in a controlled environment • Baseline: • On-demand Stream • C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18(5):577–589, 2006. • For training, we use 5% labeled data in each chunk. • So, if there are 100 instances in a chunk • Our technique (SmSCluster) use only 5 labeled and 95 unlabeled instances for training • On Deman Stream uses 100 labeled instances for training • Environment • S/W: Java, OS: Windows XP, H/W: Pentium-IV, 3GHz dual core, 2GB RAM

Results • Each time unit = 1,000 data points (botnet) • and 1,600 data points (synthetic)

Conclusion • We address a more practical approach in classifying evolving data streams – training with limited amount of labeled data. • Our technique applies semi-supervised clustering to train classification models. • Our technique outperforms other state-of-the-art stream classification techniques, which use 20 times more labeled data for training than our technique.

Data Stream Classification: Training with Limited Amount of Labeled Data