220 likes | 524 Vues
A Framework for Clustering Evolving Data Streams. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad. Outline. Background of Clustering Motivation for Clustering over Streaming Data. Overall Solution Micro Clusters Pyramid Time Frame
E N D
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad
Outline • Background of Clustering • Motivation for Clustering over Streaming Data. • Overall Solution • Micro Clusters • Pyramid Time Frame • Macro Cluster • Cluster Maintenance
Background of Clustering • Definition of Clustering • For a given set of data points, partitioning them into one or more groups of similar objects. • “Similarity” is often defined with the use of some distance measure. • Difference between “group by” queries and clustering.
Background of Clustering • Some of the most popular clustering algorithms: • K- Means, BIRCH, CURE, Density Based Clustering. • Clustering has many applications in data bases, information visualization, data mining. • What are Oultiers?
Motivation • Challenge in Streaming Environment: • Clustering is an expensive process. • Resource constraints. • Infinite streams. • Can simply extending one pass algorithms for static databases to stream processing suffice?
Motivation • Requirements of clustering for stream processing: • Statistical summary information storage. • Efficient update process. • Ability to cluster for a specific time horizon,
Overall Solution of the Paper • Divide the clustering process to two phases Online Component: periodically stores detailed summary statistics Offline Component uses only the summary statistics to do clustering
Micro-Clusters • What is a Micro-Cluster A Micro-Cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline Macro-clustering. View of Micro-Cluster View of Macro-Cluster
Micro-Clusters • What to Store in a Micro-Cluster = Key idea: Additivity Property
Pyramidal Time Frame • The micro-clusters are stored at snapshots. • The snapshots follow a pyramidal pattern … … Snapshot • When should we make the snapshot?
Pyramidal Time Frame • Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32. • For a data stream the maximum number of snap- shots maintained at T time units since the beginning of the stream mining process is (α + 1) log α(T). (α + 1 for each order)
Why Pyramidal Pattern? • For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time. Please Note: Only Approximate Answers!!!
Micro Cluster Creation • It is assumed that a total of q micro-clusters are maintained at any moment by the algorithm. • This is done using an offline process (k-means) at the very beginning of the data stream computation process.
Online Micro Cluster Maintenance • How to deal with a new coming point? • Join one of the old cluster • Create a new cluster by its own • How to deal with the old clusters • Delete them(based on relevance stamp) • Merge them (merge the closest two) A merged cluster will have all the IDs its components have
Macro-Cluster Creation • Based on the Additivity Property of cluster feature vector
Macro-Cluster Creation Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T). Approach: • 1st step: Find the snapshot for T, get the micro-cluster set S(T). • 2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h). • Use S(T)-S(T-h) Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T) and a cluster with Id C1 in S(T-h). Then the we use CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before T-h, thus should not contribute to the micro-cluster formed in (T-h,T)
Example C_ID: [C1, C2, C3] C_ID: [C1] C_ID: [C2, C3] Time: T-h Result: T-h Time: T
Macro-Cluster Creation • Run K-means on Micro-Clusters
How do you feel about this paper? • My feeling: Quite Fuzzy Results: Approximation is every where. Nothing New: Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.
Counter Example C_ID: [C1, C2, C3] C_ID: [C2] C_ID: [C1, C3] Result Time: T Time: T-h
Advertisement • Di and Charu’s project deals with: • Deterministic Clusters • Clusters with Arbitrary Shapes • RealExpirations • Disk Version • Outlier Detection by Free