1 / 31

Incremental learning in data stream analysis

Incremental learning in data stream analysis. Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National Research Council – Naples, ITALY. International workshop on Data Stream Management and Mining Beijing, October 27-28, 2008. Acknowledgements.

chick
Télécharger la présentation

Incremental learning in data stream analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incremental learning in data stream analysis Mario.Guarracino@cnr.it High Performance Computing and Networking InstituteNational Research Council – Naples, ITALY International workshop on Data Stream Management and MiningBeijing, October 27-28, 2008

  2. Acknowledgements • Panos Pardalos, OnurSeref, CludioCifarelli • Davide Feminiano, Salvatore Cuciniello • Rosanna Verde Beijing, October 27-28, 2008

  3. Outline • Introduction • Challenges, applications and existing methods • ReGEC(Regularized Generalized Eigenvalue Classifier) • I-Regec(Incremental ReGEC) • I-ReGEC on data streams • Experiments Beijing, October 27-28, 2008

  4. Supervised learning • Supervised learning refers to the capability of a system to learn from examples • The trained system is able to answer to new questions • Supervised means the desired answer for the training set is provided by an external teacher • Binary classification is among the most successful methods for supervised learning Beijing, October 27-28, 2008

  5. Supervised learning • Incremental Regularized Generalized Eigenvalue Classifier (I-ReGEC) is a supervised learning algorithm, using a subset of training data • The advantage is the classification model can be incrementally updated • The algorithm online decides which points bring new information and updates the model • Experiments and comparison assess I-ReGECclassification accuracy and processing speed rates Beijing, October 27-28, 2008

  6. The challenges • Applications on massive data sets are emerging with an increasingfrequency • Data has to be analyzed the data as soon as produced • Legacy data bases and warehouses with petabytes of data can not be loaded in main memory and data are accessed as streams • Classification algorithms have to deal with a large amount of data that are delivered in form of streams Beijing, October 27-28, 2008

  7. The challenges • Classification has to be performed online • Data is processed on fly, at the transmission speed • Need for datasampling • Classification models not over fitting data and detailed enough to describe the phenomena. • Change of training behavior during time • Nature and gradient of data changes over time Beijing, October 27-28, 2008

  8. Applications on data streams • Sensor networks • Power grids, telecommunications, bio, seismic, security,… • Computer network traffic • spam, intrusion detection, IP traffic logs… • Bank transactions • Financial data, frauds, credit cards,… • Web • Browser clicks, user queries, link rating,… Beijing, October 27-28, 2008

  9. Support Vector Machines • SVM algorithm is among the most successful method for classification, and its variations have been applied to data streams • The general purpose methods are only suitable for small size problems • For largeproblems, chunking subset selection and decomposition methods use subsets of points • SVM-Light and libSVM are among the most preferredimplementations that use chunking subset selection and decomposition methods Beijing, October 27-28, 2008

  10. Support Vector Machines support vectors Find two parallel lines with maximum margin and leave all points of a class on one side B A w 2 || || min ¹ ω 0 2 w + ³ s t A b e . . w + < - B b e Beijing, October 27-28, 2008

  11. SVM for data streams • Batch technique uses SVM model on complete data set • Error-driven technique randomly stores k samples in a training and the other in a test set. If a point is well classified, it remains in the traingset • Fixed-partition technique divides the training set in batches of fixed size. This partition permits to add points to current SVM accordingly to the ones loaded in memory Beijing, October 27-28, 2008

  12. SVM for data streams • Exceeding-margin technique checks, at the time t and for each new point, if the new data exceeds the margin evaluated by SVM • In case, the point is added to the incremental training, otherwise it is discarded • Fixed-margin + errors technique adds the new point to the training if either it exceeds the margin or it is misclassified Beijing, October 27-28, 2008

  13. Pros of SVM based methods • They require a minimalcomputational burden to build classification models • In case of kernel-based nonlinear classification, they reduce the size of the training set, and thus, the related kernel • All of these methods show that a sensibledatareduction is possible while maintaining a comparable level of classification accuracy Beijing, October 27-28, 2008

  14. w - g 2 A e || || min ¹ w - g ω, γ 0 2 B e || || A different approach: ReGEC x ' w 2 - 0 g Find two lines, each the closest to one set and the furthest to the other = 2 = 1 g 0 - 1 w ' x B A M.R. Guarracino, C. Cifarelli, O. Seref, P. Pardalos. A Classification Method Based on Generalized Eigenvalue Problems, OMS, 2007. Beijing, October 27-28, 2008

  15. w - g 2 A e || || min ¹ w - g ω, γ 0 2 B e || || ReGEC formulation Let: G = [A –e]’[A –e], H = [B –e]’[B –e], z = [w’ g]’ Equation becomes: Raleigh quotient of Generalized Eigenvalue Problem G x = l H x z G z ' min ¹ z 0 z H z ' Beijing, October 27-28, 2008

  16. ReGEC classification of new points A new point is assigned to the class described by the closest line B A w - g | x ' | = dist ( x , P ) i w || || Beijing, October 27-28, 2008

  17. ReGEC • Let [w1, g1]and [w2, g2]be eigenvectors of min and max eigenvalues ofG x = l H x • a ∈ A, closer to x'w1 - 1 =0than to x'w2-2=0, • b ∈ B, closer to x'w2 - 2=0than to x'w1-1=0. Beijing, October 27-28, 2008

  18. Incremental learning • The purpose of incrementallearning is to find a small and robust subset of the trainingset that provides comparable accuracy results. • A smaller set of points reduces the probability of overfittingthe problem. • A classification model built from a smaller subset is computationally more efficient in predicting new points. • As new points become available, the cost of retraining the algorithm decreases if the influence of the new points is only evaluated by the small subset. C. Cifarelli, M.R. Guarracino, O. Seref, S. Cuciniello, and P.M. Pardalos. Incremental Classification with Generalized Eigenvalues, JoC, 2007. Beijing, October 27-28, 2008

  19. Incremental learning algorithm 1: 1= C \ C0 2: {M0, Acc0}= Classify ( C; C0 ) 3: k = 1 4: while|k| > 0 do 5: xk = x :maxxÎ{Mk-1∩k-1}{dist(x, Pclass(x))} 6: {Mk, Acck}= Classify( C; {Ck-1U{xk}} ) 7: ifAcck > Acck-1then 8: Ck= Ck-1U{xk} 9: end if 10: k = k-1 \ {xk} 11: k = k + 1 12: end while Beijing, October 27-28, 2008

  20. Incremental classification on data streams • Find a small and robust subset of the training set while accessing data available in awindow • When the window is full, all points are processed by the classifier wsize M.R. Guarracino, S. Cuciniello, D. Feminiano. Incremental Generalized Eigenvalue Classification on Data Streams. MODULAD, 2007. Beijing, October 27-28, 2008

  21. I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams wsize Beijing, October 27-28, 2008

  22. I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams New data Old data At each step, data in window are processed with the incremental learning classifier… And hyperplanes are built Beijing, October 27-28, 2008

  23. I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams Step by step new points are processed …and I-ReGEC updates hyperplanes configuration Beijing, October 27-28, 2008

  24. I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams But not all points are considered… Some of them are discarted if their information contribution is useless Beijing, October 27-28, 2008

  25. I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams New unknown incoming points are classified by their distance from the hyperplanes Beijing, October 27-28, 2008

  26. I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams An incremental learning technique based on ReGEC that determines the classification model from a very small sample of data stream Beijing, October 27-28, 2008

  27. Experiments Large-noisy-crossed-norm Data set 200.000 points with 20 features equally divided in 2 classes 100.000 train points 100.000 test points Each class is drawn from a multivariate normal distribution Beijing, October 27-28, 2008

  28. Miss-classification results SI-ReGEC has the lowest error and uses the smallest incremental set B: Batch SVM ED: Error-driven KNN FP: Fixed partition SVM EM: Exceeding-margin SVM EM+E: Fixed margin + errors Beijing, October 27-28, 2008

  29. Window size Larger windows lead to smaller train subset and execution time increases with window growth One billion elements per day can be processedon standard hardware Beijing, October 27-28, 2008

  30. Conclusion and future work • The classification accuracy of I-ReGEC a well compares with other methods • I-ReGEC produces smal incremental training sets • In future, investigate how to dinamically adapt window size to stream rate and nonstationary data streams Beijing, October 27-28, 2008

  31. Incremental learning in data stream analysis Mario.Guarracino@cnr.it High Performance Computing and Networking InstituteNational Research Council – Naples, ITALY http://www.na.icar.cnr.it/~mariog

More Related