Incremental learning in data stream analysis

Incremental learning in data stream analysis Mario.Guarracino@cnr.it High Performance Computing and Networking InstituteNational Research Council – Naples, ITALY International workshop on Data Stream Management and MiningBeijing, October 27-28, 2008

Acknowledgements • Panos Pardalos, OnurSeref, CludioCifarelli • Davide Feminiano, Salvatore Cuciniello • Rosanna Verde Beijing, October 27-28, 2008

Outline • Introduction • Challenges, applications and existing methods • ReGEC(Regularized Generalized Eigenvalue Classifier) • I-Regec(Incremental ReGEC) • I-ReGEC on data streams • Experiments Beijing, October 27-28, 2008

Supervised learning • Supervised learning refers to the capability of a system to learn from examples • The trained system is able to answer to new questions • Supervised means the desired answer for the training set is provided by an external teacher • Binary classification is among the most successful methods for supervised learning Beijing, October 27-28, 2008

Supervised learning • Incremental Regularized Generalized Eigenvalue Classifier (I-ReGEC) is a supervised learning algorithm, using a subset of training data • The advantage is the classification model can be incrementally updated • The algorithm online decides which points bring new information and updates the model • Experiments and comparison assess I-ReGECclassification accuracy and processing speed rates Beijing, October 27-28, 2008

The challenges • Applications on massive data sets are emerging with an increasingfrequency • Data has to be analyzed the data as soon as produced • Legacy data bases and warehouses with petabytes of data can not be loaded in main memory and data are accessed as streams • Classification algorithms have to deal with a large amount of data that are delivered in form of streams Beijing, October 27-28, 2008

The challenges • Classification has to be performed online • Data is processed on fly, at the transmission speed • Need for datasampling • Classification models not over fitting data and detailed enough to describe the phenomena. • Change of training behavior during time • Nature and gradient of data changes over time Beijing, October 27-28, 2008

Applications on data streams • Sensor networks • Power grids, telecommunications, bio, seismic, security,… • Computer network traffic • spam, intrusion detection, IP traffic logs… • Bank transactions • Financial data, frauds, credit cards,… • Web • Browser clicks, user queries, link rating,… Beijing, October 27-28, 2008

Support Vector Machines • SVM algorithm is among the most successful method for classification, and its variations have been applied to data streams • The general purpose methods are only suitable for small size problems • For largeproblems, chunking subset selection and decomposition methods use subsets of points • SVM-Light and libSVM are among the most preferredimplementations that use chunking subset selection and decomposition methods Beijing, October 27-28, 2008

Support Vector Machines support vectors Find two parallel lines with maximum margin and leave all points of a class on one side B A w 2 || || min ¹ ω 0 2 w + ³ s t A b e . . w + < - B b e Beijing, October 27-28, 2008

SVM for data streams • Batch technique uses SVM model on complete data set • Error-driven technique randomly stores k samples in a training and the other in a test set. If a point is well classified, it remains in the traingset • Fixed-partition technique divides the training set in batches of fixed size. This partition permits to add points to current SVM accordingly to the ones loaded in memory Beijing, October 27-28, 2008

SVM for data streams • Exceeding-margin technique checks, at the time t and for each new point, if the new data exceeds the margin evaluated by SVM • In case, the point is added to the incremental training, otherwise it is discarded • Fixed-margin + errors technique adds the new point to the training if either it exceeds the margin or it is misclassified Beijing, October 27-28, 2008

Pros of SVM based methods • They require a minimalcomputational burden to build classification models • In case of kernel-based nonlinear classification, they reduce the size of the training set, and thus, the related kernel • All of these methods show that a sensibledatareduction is possible while maintaining a comparable level of classification accuracy Beijing, October 27-28, 2008

w - g 2 A e || || min ¹ w - g ω, γ 0 2 B e || || A different approach: ReGEC x ' w 2 - 0 g Find two lines, each the closest to one set and the furthest to the other = 2 = 1 g 0 - 1 w ' x B A M.R. Guarracino, C. Cifarelli, O. Seref, P. Pardalos. A Classification Method Based on Generalized Eigenvalue Problems, OMS, 2007. Beijing, October 27-28, 2008

w - g 2 A e || || min ¹ w - g ω, γ 0 2 B e || || ReGEC formulation Let: G = [A –e]’[A –e], H = [B –e]’[B –e], z = [w’ g]’ Equation becomes: Raleigh quotient of Generalized Eigenvalue Problem G x = l H x z G z ' min ¹ z 0 z H z ' Beijing, October 27-28, 2008

ReGEC classification of new points A new point is assigned to the class described by the closest line B A w - g | x ' | = dist ( x , P ) i w || || Beijing, October 27-28, 2008

ReGEC • Let [w1, g1]and [w2, g2]be eigenvectors of min and max eigenvalues ofG x = l H x • a ∈ A, closer to x'w1 - 1 =0than to x'w2-2=0, • b ∈ B, closer to x'w2 - 2=0than to x'w1-1=0. Beijing, October 27-28, 2008

Incremental learning • The purpose of incrementallearning is to find a small and robust subset of the trainingset that provides comparable accuracy results. • A smaller set of points reduces the probability of overfittingthe problem. • A classification model built from a smaller subset is computationally more efficient in predicting new points. • As new points become available, the cost of retraining the algorithm decreases if the influence of the new points is only evaluated by the small subset. C. Cifarelli, M.R. Guarracino, O. Seref, S. Cuciniello, and P.M. Pardalos. Incremental Classification with Generalized Eigenvalues, JoC, 2007. Beijing, October 27-28, 2008

Incremental learning algorithm 1: 1= C \ C0 2: {M0, Acc0}= Classify ( C; C0 ) 3: k = 1 4: while|k| > 0 do 5: xk = x :maxxÎ{Mk-1∩k-1}{dist(x, Pclass(x))} 6: {Mk, Acck}= Classify( C; {Ck-1U{xk}} ) 7: ifAcck > Acck-1then 8: Ck= Ck-1U{xk} 9: end if 10: k = k-1 \ {xk} 11: k = k + 1 12: end while Beijing, October 27-28, 2008

Incremental classification on data streams • Find a small and robust subset of the training set while accessing data available in awindow • When the window is full, all points are processed by the classifier wsize M.R. Guarracino, S. Cuciniello, D. Feminiano. Incremental Generalized Eigenvalue Classification on Data Streams. MODULAD, 2007. Beijing, October 27-28, 2008

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams wsize Beijing, October 27-28, 2008

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams New data Old data At each step, data in window are processed with the incremental learning classifier… And hyperplanes are built Beijing, October 27-28, 2008

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams Step by step new points are processed …and I-ReGEC updates hyperplanes configuration Beijing, October 27-28, 2008

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams But not all points are considered… Some of them are discarted if their information contribution is useless Beijing, October 27-28, 2008

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams New unknown incoming points are classified by their distance from the hyperplanes Beijing, October 27-28, 2008

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams An incremental learning technique based on ReGEC that determines the classification model from a very small sample of data stream Beijing, October 27-28, 2008

Experiments Large-noisy-crossed-norm Data set 200.000 points with 20 features equally divided in 2 classes 100.000 train points 100.000 test points Each class is drawn from a multivariate normal distribution Beijing, October 27-28, 2008

Miss-classification results SI-ReGEC has the lowest error and uses the smallest incremental set B: Batch SVM ED: Error-driven KNN FP: Fixed partition SVM EM: Exceeding-margin SVM EM+E: Fixed margin + errors Beijing, October 27-28, 2008

Window size Larger windows lead to smaller train subset and execution time increases with window growth One billion elements per day can be processedon standard hardware Beijing, October 27-28, 2008

Conclusion and future work • The classification accuracy of I-ReGEC a well compares with other methods • I-ReGEC produces smal incremental training sets • In future, investigate how to dinamically adapt window size to stream rate and nonstationary data streams Beijing, October 27-28, 2008

Incremental learning in data stream analysis Mario.Guarracino@cnr.it High Performance Computing and Networking InstituteNational Research Council – Naples, ITALY http://www.na.icar.cnr.it/~mariog

Incremental learning in data stream analysis

Incremental learning in data stream analysis

Presentation Transcript

Data Stream Mining

Data Stream Processor

Incremental Linear Discriminative Analysis (LDA)

Stream Data

Data Analysis for Student Learning

Data Stream Clustering

Data Stream Protocol

Data Stream Mining and Incremental Discretization

Data Stream Management

Data Stream Computation

Analysis of : Operator Scheduling in a Data Stream Manager

Incremental Cluster-wise Regression Analysis of Functional fMRI data

Population Based Incremental Learning

Data Stream Mining

Compound classification and stream data analysis

Learning Excel for Data Analysis

L:21 Incremental Analysis

Data Stream Mining

Continual Learning with Gated Incremental Memories for sequential data processing