Data Stream Mining and Incremental Discretization

Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007

Overview • Introduction • Data Mining: A Brief Overview • Histograms • Challenges of Streaming Data to Data Mining • Using Histograms for Incremental Discretization of Data Streams • Fuzzy Histograms • Future Work

Introduction • Data mining • Class of algorithms for knowledge discovery • Patterns, trends, predictions • Utilizes statistical methods, neural networks, genetic algorithms, decision trees, etc. • Streaming data presents unique challenges to traditional data mining • Non-persistence – one opportunity to mine • Data rates • Non-discrete • Changing over time • Huge volumes of data

Data MiningTypes of Relationships • Classes • Predetermined groups • Clusters • Groups of related data • Sequential Patterns • Used to predict behavior • Associations • Rules are built from associations between data

Data MiningAlgorithms • K-means clustering • Unsupervised learning algorithm • Classified data set into pre-defined clusters • Decision Trees • Used to generate rules for classification • Two common types: • CART • CHAID • Nearest Neighbor • Classify a record in a dataset based upon similar records in a historical dataset

Data MiningAlgorithms (continued) • Rule Induction • Uses statistical significance to find interesting rules • Data Visualization • Uses graphics for mining

Histograms and Data Mining

Histograms and Supervised Learning – An Example

Histograms and Supervised Learning – An Example • We have two classes: • Mortgage approval = “yes” • P(mortgage approval = "Yes") = 5/10 = .5 • Mortgage approval = “no” • P(mortgage approval = "Yes") = 5/10 = .5 • Let’s calculate some of the conditional probabilities based upon training data: • P(age<=30|mortgage approval = "Yes") = 2/5 = .4 • P(age<=30|mortgage approval = "No") = 2/5 = .4 • P(income="Low"| mortgage approval = "Yes") = 2/5 = .4 • P(income="Low"| mortgage approval = "No") = 2/5 = .4 • P(income = "Medium"|mortgage approval = "Yes") = 1/5 = .2 • P(income = "Medium"|mortgage approval = "No") = 1/5 = .2 • P(marital status = "Married"| mortgage approval = "Yes") = 3/5 = 0.6 • P(marital status = "Married"| mortgage approval = "No") = 3/5 = 0.6 • P(credit rating = "Good"|mortgage approval = "Yes") = 1/5 = .2 • P(credit rating = "Good"|mortgage approval = "No") = 2/5 = .5

Histograms and Supervised Learning – An Example • We will use Bayes’ rule and the naïve assumption that all attributes are independent: • P(A1=a1...Ak=ak) is irrelevant, since it is the same for every class • Now, let’s predict the class for one observation: • X=Age<=30, income="medium", marital status = "married", credit rating = "good"

Histograms and Supervised Learning – An Example • P(X|mortgage approval = "Yes") = .4 * .2 * .6 * .2 = 0.0096 • P(X|mortgage approval = "No") = .4 * .2 * .6 * .5 = 0.024 • P(X|C=c)*P(C=c) : 0.0096 * .4 = 0.00384 • 0.024 * .4 = 0.0096 • X belongs to “no” class. • The probabilities are determined by frequency counts, the frequencies are tabulated in bins. • Two common types of histograms • Equal-width – the range of observed values is divided into k intervals • Equal-frequency – the frequencies are equal in all bins • Difficulty is determining number of bins or k • Sturges’ rule • Scott’s rule • Determining k for a data stream is problematic

Challenges of Data Streaming to Data Mining • Determining k for a histogram or machine learning • Concept drifting • Data from the past is no longer valid for the model today • Several approaches • Incremental learning – CVFDT • Ensemble classifiers • Ambiguous decision trees • What about “ebb and flow” problem?

Incremental Discretization • Way to create discrete intervals from a data stream • Partition Incremental Discretization (PID) algorithm (Gama and Pinto) • Two-level algorithm • Creates intervals at level 1 • Only one pass over the stream • Aggregates level 1 intervals into level 2 intervals

Incremental DiscretizationExample

Incremental DiscretizationExample • Sensor data reporting on air temperature, soil moisture and flow of water in a sprinkler. • The data shown in the previous slide is training data • Once trained, model can predict what we should set sprinkler to based upon conditions • 4 class problem

Incremental DiscretizationExample • We will walk through level 1 for the temperature attribute. • Decide an estimated range -> 30 – 85 • Pick number of intervals (11) • Step is set to 5 • 2 vectors: breaks and counts • Set a threshold for splitting an interval -> 33% of all observed values • Begin to work through training set • If a value falls below the lower bound of the range, add a new interval before the first interval • If a value falls above the upper bound of the range, add a new interval after the last value • If an interval reaches the threshold, split it evenly and divide the count between the old interval and the new

Incremental DiscretizationExample • Breaks vector for our sample after training • Counts vector for our sample after training

Second Layer • The second layer is invoked whenever necessary. • User intervention • Changes in intervals of first layer • Input • Breaks and counters from layer 1 • Type of histogram to be generated

Second Layer • Objective is to create a smaller number of intervals based upon layer 1intervals • For equal width histograms: • Computes number of intervals based upon observed range in layer 1 • Traverses the vector of breaks once and adds counters of consecutive intervals • Equal frequency • Computes exact number of data points in each interval • Traverses counter and adds counts for consecutive interval • Stops for each layer 2 interval when frequency is reached

Application of PID for Data Mining • Add a data structure to both layer 1 and layer 2. • Matrix: • Columns: intervals • Rows: classes • Naïve Bayesian classification can be easily done

Example MatrixTemperature Attribute

What happens when training is no longer valid (for example, winter?) Assume sensors are still on in winter but sprinklers are not Dealing with Concept Drift

Dealing with Concept DriftFuzzy Histograms • Fuzzy histograms are used for visual content representation. • A given attribute can be a member of more than 1 interval. • Varying degrees of membership • Degree of membership is determined by a membership function

Fuzzy Histograms with PID • Use membership function to build layer 2 intervals based upon a determinant in layer 1 • Sprinkler example • Soil moisture is potentially a member of >1 interval • One interval is a high value • During winter, ensure that all values of moisture fall into highest end of range

References • [1] Hand, David. Mannila, Heikki and Padhraic Smyth. Principles of Data Mining. Cambridge, MA: MIT Press, 2001. • [2] Sturges, H.(1926) The choice of a class-interval. J. Amer. Statist. Assoc., 21, 65–66. • [3] D.W. Scott. On optimal and data-based histograms, Biometrika 66(1979) 605-610. • [4] David Freedman and Persi Diaconis (1981). "On the histogram as a density estimator: • L2 theory." Probability Theory and Related Fields. 57(4): 453-476 • [5] Jianping Zhang, Huan Liu and Paul P. Wang, Some current issues of streaming data mining, Information Sciences, Volume 176, Issue 14, Streaming Data Mining, 22 July 2006, Pages 1949-1951. • [6] Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Francisco, California, August 26 - 29, 2001). KDD '01. ACM Press, New York, NY, 97-106. • [7] Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 226-235. • [8] Natwichai, J. and Li, X. (2004). Knowledge Maintenance on Data Streams with Concept Drifting. In: Zhang, J., He, J. and Fu, Y. 2004, (705-710), Shanghai, China. • [9] Gama, J. and Pinto, C. 2006. Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06. ACM Press, New York, NY, 662-667. • [10] Anastasios Doulamis and Nikolaos Doulamis.Fuzzy histograms for Efficient Visual Content Representation:Application to content-based image retrieval. In IEEE International Conference on Multimedia and Expo(ICME’01),page227.IEEE Press,2001. • [11] Gaber, M.M., Zaslavsky, A. & Krishnaswamy, S. 2005, "Mining data streams: a review", SIGMOD Rec., vol. 34, no. 2, pp. 18-26.

Questions ?

Data Stream Mining and Incremental Discretization

Data Stream Mining and Incremental Discretization

Presentation Transcript

Spatiotemporal Stream Mining Applied to Seismic+ Data

More Stream-Mining

Data Stream Mining

More Stream-Mining

Incremental learning in data stream analysis

Stream Hierarchy Data Mining for Sensor Data

Data Discretization Unification

Incremental Mining Association Rules

Incremental Mining of Association Rules

Discretization

A survey on stream data mining

Incremental and Interactive Sequence Mining

MMDSS 2007 Data stream management and mining

Data Stream Mining

Extending DSMS for Data Stream Mining

Incremental Clustering for Mining in a Data Warehousing Environment

Debellor Data Mining Platform with Stream Architecture

Discretization

Incremental Mining of Association Rules

Extending DSMS for Data Stream Mining

Debellor Data Mining Platform with Stream Architecture

Data Stream Mining