Clustering Data Streams

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

Data Stream • Massive data sets accumulated at an astonishing rate. • Examples: • Tracking network data to study change in traffic patterns and possible intrusions • Tracking meteorological data, such as temperatures

NASA MISR satellite Collects several TB of satellite imagery data daily

Challenges • Compactness of data representation • Fast, incremental processing of new data points (one-pass and linear access of data) • Clear and fast identification of changes in evolving clustering models

Compactness • Utilize a data structure that summarizes a group of data points, minimizing the storage space • The space required does not grow appreciably with the number of points processed

Incremental Processing of data • When clustering new data points, the algorithm should not require comparison with all the points processed in the past • The data must be processed as they are produced. Linear scan is required, random access is prohibitively expensive.

Identification of Changes The algorithm must be able to: • diagnose changes in evolving data streams • distinguish outliers from data points that represent a new cluster

Current Algorithms • BIRCH • STREAM • CLU-STREAM • …

BIRCH • Use CF vectors to store data • CF = (N, ∑Xi2 , ∑ Xi ) Xiis a vector • Store the number of points, the linear sum and the square sum of all data points in a micro-cluster • Sufficient to calculate centroids, radius, diameter and distances

B-Tree Root 29 7 16 22 39* 3* 5* 19* 20* 22* 24* 27* 38* 2* 7* 14* 16* 29* 33* 34*

CF3 CF6 Non-Leaf node CF1 CF2 CF3 CF4 CF5 CF6 Leaf node Building of CF Tree • B-Tree, with a branch factor B, threshold T and L maximum number of entries in a leave node

Adjusting CF Tree • Increases the threshold T so that each leaf entry to absorb more points. T can be set as radius or diameter. • Leaf entries with “far fewer” points are regarded as “outliers” and written back to disk.

STREAM • Process data streams in batches of points • Use weighted centroids Ci to represent ith batch of points. • Recursively cluster the weighted centroids until k-clusters

Problems with BIRCH & STREAM • Old data points are equally important as new data points • May not be able to detect new trends in evolving data stream

CLU-STREAM • Also use CF vectors to store data summary • Use time stamps to record the elapsed time from the beginning • Take snapshots at different time stamp, favoring the most recent data (Snapshot: micro-clusters stored at particular moments in the stream)

CLU-STREAM (continue) • A snapshot contains q micro-clusters, q depends on the memory available • New data points will be assigned to one of the micro-clusters in previous snapshot if it falls within the maximum boundary of that micro-cluster.

CLU-STREAM (continue) • If a new data points fails to fit into any current cluster, a new cluster is created, and an existing one is deleted or two merged. • A cluster is removed if the average time-stamp when it absorbs m new data points is least recent.

Detect New Trends • Comparing clustering results from snapshots to snapshots reveals trends in evolving data stream.

References • Aggarwal, C. C., Han J., Wang, J. & Yu, P. S. (2003). A Framework for Clustering Evolving Data Stream. In Proc. of the 29th VLDB Conference. • Barbara, D. (2003). Requirements for Clustering Data Streams. SIGKDD Explorations, 3 (2), 23-27. • Ester M., Kriegel H.-P., Sander J. & Xu X (1998). Clustering for Mining in Large Spatial Databases. Special Issue on Data Mining, KI-Journal, ScienTec Publishing, No. 1. • Guha, S., Meyerson, A., Mishra, N. & Motwani, R., Callaghan, L. (2003). Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering, 15 (3), 515-528. • Zhang, T., Ramakrishnan R. & Livny, M. (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proc. of ACM SIGMOD International Conference on Management of Data.

Clustering Data Streams