1 / 26

Data Stream Mining and Incremental Discretization

Data Stream Mining and Incremental Discretization. John Russo CS561 Final Project April 26, 2007. Overview. Introduction Data Mining: A Brief Overview Histograms Challenges of Streaming Data to Data Mining Using Histograms for Incremental Discretization of Data Streams Fuzzy Histograms

marlow
Télécharger la présentation

Data Stream Mining and Incremental Discretization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007

  2. Overview • Introduction • Data Mining: A Brief Overview • Histograms • Challenges of Streaming Data to Data Mining • Using Histograms for Incremental Discretization of Data Streams • Fuzzy Histograms • Future Work

  3. Introduction • Data mining • Class of algorithms for knowledge discovery • Patterns, trends, predictions • Utilizes statistical methods, neural networks, genetic algorithms, decision trees, etc. • Streaming data presents unique challenges to traditional data mining • Non-persistence – one opportunity to mine • Data rates • Non-discrete • Changing over time • Huge volumes of data

  4. Data MiningTypes of Relationships • Classes • Predetermined groups • Clusters • Groups of related data • Sequential Patterns • Used to predict behavior • Associations • Rules are built from associations between data

  5. Data MiningAlgorithms • K-means clustering • Unsupervised learning algorithm • Classified data set into pre-defined clusters • Decision Trees • Used to generate rules for classification • Two common types: • CART • CHAID • Nearest Neighbor • Classify a record in a dataset based upon similar records in a historical dataset

  6. Data MiningAlgorithms (continued) • Rule Induction • Uses statistical significance to find interesting rules • Data Visualization • Uses graphics for mining

  7. Histograms and Data Mining

  8. Histograms and Supervised Learning – An Example

  9. Histograms and Supervised Learning – An Example • We have two classes: • Mortgage approval = “yes” • P(mortgage approval = "Yes") = 5/10 = .5 • Mortgage approval = “no” • P(mortgage approval = "Yes") = 5/10 = .5 • Let’s calculate some of the conditional probabilities based upon training data: • P(age<=30|mortgage approval = "Yes") = 2/5 = .4 • P(age<=30|mortgage approval = "No") = 2/5 = .4 • P(income="Low"| mortgage approval = "Yes") = 2/5 = .4 • P(income="Low"| mortgage approval = "No") = 2/5 = .4 • P(income = "Medium"|mortgage approval = "Yes") = 1/5 = .2 • P(income = "Medium"|mortgage approval = "No") = 1/5 = .2 • P(marital status = "Married"| mortgage approval = "Yes") = 3/5 = 0.6 • P(marital status = "Married"| mortgage approval = "No") = 3/5 = 0.6 • P(credit rating = "Good"|mortgage approval = "Yes") = 1/5 = .2 • P(credit rating = "Good"|mortgage approval = "No") = 2/5 = .5

  10. Histograms and Supervised Learning – An Example • We will use Bayes’ rule and the naïve assumption that all attributes are independent: • P(A1=a1...Ak=ak) is irrelevant, since it is the same for every class • Now, let’s predict the class for one observation: • X=Age<=30, income="medium", marital status = "married", credit rating = "good"

  11. Histograms and Supervised Learning – An Example • P(X|mortgage approval = "Yes") = .4 * .2 * .6 * .2 = 0.0096 • P(X|mortgage approval = "No") = .4 * .2 * .6 * .5 = 0.024 • P(X|C=c)*P(C=c) : 0.0096 * .4 = 0.00384 • 0.024 * .4 = 0.0096 • X belongs to “no” class. • The probabilities are determined by frequency counts, the frequencies are tabulated in bins. • Two common types of histograms • Equal-width – the range of observed values is divided into k intervals • Equal-frequency – the frequencies are equal in all bins • Difficulty is determining number of bins or k • Sturges’ rule • Scott’s rule • Determining k for a data stream is problematic

  12. Challenges of Data Streaming to Data Mining • Determining k for a histogram or machine learning • Concept drifting • Data from the past is no longer valid for the model today • Several approaches • Incremental learning – CVFDT • Ensemble classifiers • Ambiguous decision trees • What about “ebb and flow” problem?

  13. Incremental Discretization • Way to create discrete intervals from a data stream • Partition Incremental Discretization (PID) algorithm (Gama and Pinto) • Two-level algorithm • Creates intervals at level 1 • Only one pass over the stream • Aggregates level 1 intervals into level 2 intervals

  14. Incremental DiscretizationExample

  15. Incremental DiscretizationExample • Sensor data reporting on air temperature, soil moisture and flow of water in a sprinkler. • The data shown in the previous slide is training data • Once trained, model can predict what we should set sprinkler to based upon conditions • 4 class problem

  16. Incremental DiscretizationExample • We will walk through level 1 for the temperature attribute. • Decide an estimated range -> 30 – 85 • Pick number of intervals (11) • Step is set to 5 • 2 vectors: breaks and counts • Set a threshold for splitting an interval -> 33% of all observed values • Begin to work through training set • If a value falls below the lower bound of the range, add a new interval before the first interval • If a value falls above the upper bound of the range, add a new interval after the last value • If an interval reaches the threshold, split it evenly and divide the count between the old interval and the new

  17. Incremental DiscretizationExample • Breaks vector for our sample after training • Counts vector for our sample after training

  18. Second Layer • The second layer is invoked whenever necessary. • User intervention • Changes in intervals of first layer • Input • Breaks and counters from layer 1 • Type of histogram to be generated

  19. Second Layer • Objective is to create a smaller number of intervals based upon layer 1intervals • For equal width histograms: • Computes number of intervals based upon observed range in layer 1 • Traverses the vector of breaks once and adds counters of consecutive intervals • Equal frequency • Computes exact number of data points in each interval • Traverses counter and adds counts for consecutive interval • Stops for each layer 2 interval when frequency is reached

  20. Application of PID for Data Mining • Add a data structure to both layer 1 and layer 2. • Matrix: • Columns: intervals • Rows: classes • Naïve Bayesian classification can be easily done

  21. Example MatrixTemperature Attribute

  22. What happens when training is no longer valid (for example, winter?) Assume sensors are still on in winter but sprinklers are not Dealing with Concept Drift

  23. Dealing with Concept DriftFuzzy Histograms • Fuzzy histograms are used for visual content representation. • A given attribute can be a member of more than 1 interval. • Varying degrees of membership • Degree of membership is determined by a membership function

  24. Fuzzy Histograms with PID • Use membership function to build layer 2 intervals based upon a determinant in layer 1 • Sprinkler example • Soil moisture is potentially a member of >1 interval • One interval is a high value • During winter, ensure that all values of moisture fall into highest end of range

  25. References • [1] Hand, David. Mannila, Heikki and Padhraic Smyth. Principles of Data Mining. Cambridge, MA: MIT Press, 2001. • [2] Sturges, H.(1926) The choice of a class-interval. J. Amer. Statist. Assoc., 21, 65–66. • [3] D.W. Scott. On optimal and data-based histograms, Biometrika 66(1979) 605-610. • [4] David Freedman and Persi Diaconis (1981). "On the histogram as a density estimator: • L2 theory." Probability Theory and Related Fields. 57(4): 453-476 • [5] Jianping Zhang, Huan Liu and Paul P. Wang, Some current issues of streaming data mining, Information Sciences, Volume 176, Issue 14, Streaming Data Mining, 22 July 2006, Pages 1949-1951. • [6] Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Francisco, California, August 26 - 29, 2001). KDD '01. ACM Press, New York, NY, 97-106. • [7] Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 226-235. • [8] Natwichai, J. and Li, X. (2004). Knowledge Maintenance on Data Streams with Concept Drifting. In: Zhang, J., He, J. and Fu, Y. 2004, (705-710), Shanghai, China. • [9] Gama, J. and Pinto, C. 2006. Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06. ACM Press, New York, NY, 662-667. • [10] Anastasios Doulamis and Nikolaos Doulamis.Fuzzy histograms for Efficient Visual Content Representation:Application to content-based image retrieval. In IEEE International Conference on Multimedia and Expo(ICME’01),page227.IEEE Press,2001. • [11] Gaber, M.M., Zaslavsky, A. & Krishnaswamy, S. 2005, "Mining data streams: a review", SIGMOD Rec., vol. 34, no. 2, pp. 18-26.

  26. Questions ?

More Related