Clustering over Multiple Evolving Streams by Events and Correlations

Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007

Outline • Introduction • Data Summarization • Similarity Measurement • COMET-CORE Framework • Empirical Studies • Conclusion

Introduction (1) • Good clustering puts similar objects together and separates dissimilar ones into different clusters. • Useful information from clusters • Data collection in sensor networks • Stock market trades A B F E D G C

A B F E D G C Introduction (2) • Online data summarization with offline clustering. • Periodical Online Clustering User Waste!! Lose Information!!

Introduction (3) • COMET-CORE • Use online piecewise linear line segments to approximate original data • Update correlations when a stream encounters a new end point • Update clusters by the updated correlations Data point End point Update stream correlations

Data Summarization (1) • Problem Model • Γ= {S1, S2, …, Sn} • Si = Si[1, …, t, …] : i-th stream • Si[t] : arriving data of Si at time t • Siapp[t] : approximated data of Si at time t • : end points summary of stream Si • The objective is that given a set of data streams Γ and the threshold parameters, stream clusters are monitored online.

Data Summarization (2) • Approximation Line Formulation • For a sub-stream Si[ts,…,te] • The parameters : (te, Si[te]) (ts, Si[ts])

Data Summarization (3) • Error Function • Error Threshold • It may not easy to give a proper absolute error threshold • Relative error threshold (EX: 2% error of square sum of original data stream)

Data Summarization (4) • Online Linear Line Segment Approximation Value Error < Threshold δl Error > Threshold δl Generate New End Point Time tv1 tvk

Similarity Measurement (1) • Use Pearson correlation as similarity measure Regard two streams as two different random variables

Similarity Measurement (2) • Definition 4.2. Given two streams Si and Sj, and a weight function w(t), the weighted correlation coefficient between these two streams is defined as :

Similarity Measurement (3) • Definition 4.3. Given two streams Si and Sj, and a weight function w(t), the WC vector of Si and Sj is defined as :

Similarity Measurement (4) • Similarity Update • Update WC vector when a new end point generated • Linear scan of data streams  incremental update

Similarity Measurement (5) . . .

COMET-CORE Framework (1) • Definition 5.1. Assume that the centers of two clusters Ci and Cj are represented by end point sequence and , respectively. Then, the WC vector of two clusters denoted by is equal to . The weighted correlation between Ci and Cj denoted by wcorr(Ci, Cj) is equal to wcorr(Si, Sj) . • COMET-CORE A stream encounters a new end point Split Cluster Merge cluster

Non-trigger streams New trigger groups Update Weighted Correlation Compare correlation betweennon-trigger stream and representative stream with δa Compare Correlation with δa COMET-CORE Framework (2) • Split cluster Ctmp Cnew1 Ck trigger streams Cnew2 Cnew3 Three new groups

S13,S14 S13,S14 S13,S14 S11,S12 S11,S12 S11,S12 S4,S5 S4,S5 S4,S5 S6,S7 S6,S7 S6,S7 S1,S2,S3 S1,S2,S3 S1,S2,S3 COMET-CORE Framework (3) • Assign WC vectors to newly generated clusters • Type1: Ci and Cj are belong to the same cluster originally. • Type2: Ci and Cj are belong to different clusters originally. • Type3: Ci is newly generated cluster, Coo is originally existing one. C1 C11 Cx Cy S1, S2, S3, S4, S5, S6, S7 S11, S12, S13, S14 C11 C14 C11 C14 C11 C14 Cx Cy Cx Cy Cx Cy C4 C4 C4 C1 C6 C1 C6 C1 C6 (a)Type1: (b)Type2: (c)Type3:

COMET-CORE Framework (4) • Merge Cluster • After splitting and updating the inter-cluster correlation • Two clusters are merged if the correlation ≥ δe until no this kind of cluster pairexists. wcorr(C1, C2)≥ δe Cnew C1 C2 Merge wcorr(Cnew , Ck) = min(wcorr(C2 ,Ck), wcorr(C2 ,Ck)) wcorr(C1, C2) wcorr(C2, Ck) Ck Ck

Empirical Studies (1) • Clustering algorithms • Basic: periodically agglomerative clustering • ODAC: periodically hierarchical clustering • COMET-CORE All streams 2Dis(P) – (Dis(C1) + Dis(C2)) < Threshold Dissimilarity > Threshold Clustering Result

Empirical Studies (2) • Clustering quality measurement • Silhouette Validation a(Si) is the average dissimilarity of stream Si to all other streams in the same cluster b(Si) is the average dissimilarity of stream Si to all other streams in the another closest cluster • Cluster Silhouette • Global Silhouette

Empirical Studies (3) • Evaluation on Real Data • δa=δe = 0.5 Data Sets

Empirical Studies (4) • Evaluation on Cylinder-Bell-Funnel Data Set • δa=δe = 0.8 • 100 streams for each type (total 600 streams) • normal distribution number ranges from 0 to 1 are randomly added on each streams 6 types 128 long

Empirical Studies (5) • Evaluation on Random Walk Data Set • δa=δe = 0.7 • Period = 200 data points (Basic & ODAC) 1. Streams number 2. Cluster number Almost independent of cluster num 20000 Points in Each Stream Fixed 500 Streams

Conclusion • The paper proposes a novel and efficient online clustering framework COMET-CORE for clustering over streams. • COMET-CORE uses efficient split and merge algorithm to modify clusters with good clustering quality.

Clustering over Multiple Evolving Streams by Events and Correlations

Clustering over Multiple Evolving Streams by Events and Correlations

Presentation Transcript

Multiple Streams and Punctuated-Equilibrium Theory

Clustering Data Streams

Clustering Data Streams

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Fuzzy Clustering with Multiple Kernels

Efficiently Correlating Complex Events over Live and Archived Data Streams

Scalable Clustering using Multiple GPUs

Keyword Search over Relational Tables and Streams

Multiple Events

A Framework for Clustering Evolving Data Streams

BRAID: Discovering Lag Correlations in Multiple Streams

Extracting Events from Probabilistic Streams

Multiple Aggregations Over Data Streams

MULTIPLE EVENTS

Dual clustering ： integrating data clustering over optimization and constraint domains

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Lecture 1: Correlations and multiple regression

Continuous Analytics Over Discontinuous Streams

RPC, Events, Streams

Evolving technology and Changing Conferences or Events

Multiple Aggregations Over Data Streams

A Framework for Clustering Evolving Data Streams