Capturing Subspace Correlation with δ-Clusters in Large Datasets

-ClustersCapturing Subspace Correlation in a Large Data Set Authors: Yang Jiong, Wei Wang etc.(ICDE02) Presenter: Xuehua Shen xshen@uiuc.edu Data Mining: Concepts and Techniques

Presentation Layout • Overview of Clustering • Related Work of -Clusters • -Clusters Model • FLOC algorithm Data Mining: Concepts and Techniques

Clustering • Clustering: the process of grouping a set of objects into classes of similar objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Data Mining: Concepts and Techniques

Major Clustering Methods • Partition algorithm • Hierarchy algorithm • Density-based • Grid-based • Model-based Data Mining: Concepts and Techniques

Similarity • Clustering: the process of grouping a set of objects into classes of similar objects • But how to define similarity? Data Mining: Concepts and Techniques

Similarity cont. • Traditional clustering model: based on distance functions • Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer • But strong correlations may still exist among a set of objects even if they are far apart from each other as measured by the distance function Data Mining: Concepts and Techniques

Similarity cont. • -Clusters model: similar when exhibiting a coherent pattern on a subset of dimensions • Can cluster objects which show shifting pattern or scaling pattern Data Mining: Concepts and Techniques

Similarity cont. • Example of Coherent Pattern: Shifting Pattern Scaling Pattern Data Mining: Concepts and Techniques

Subspace Clustering • From high dimensional clustering (problematic) To subspace clustering • Not restricted with fixed ordering of columns contrasted with pattern in time-series data • Challenge: curse of dimensionality! Data Mining: Concepts and Techniques

Subspace Clustering cont. • Example of subspace clustering Data Mining: Concepts and Techniques

Applications • Microarray Data Analysis in Biology • E-Commerce Data Mining: Concepts and Techniques

Microarray Data Analysis • Matrix (Dense) Rows: Genes Columns: Various Samples experiment conditions or tissues • Values in Matrix: expression level relative abundance of the mRNA of a gene under a specific condition Data Mining: Concepts and Techniques

Microarray Data Analysis cont. • From Scaling Pattern to Shifting Pattern Red: Interested Gene, Green: Controlled Gene • Investigations show that several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels rise and fall coherently under a subset of conditions Data Mining: Concepts and Techniques

E-Commerce • Example: Rating of Movies (1: lowest rate, 10: highest rate) • Shifting Pattern • If a new movies and 1st viewer rate 7 and 3rd viewer rate 9, 2nd viewer probably will like this movie too Data Mining: Concepts and Techniques

Presentation Layout • Overview of clustering • Related Work of -Clusters • -Clusters Model • FLOC algorithm Data Mining: Concepts and Techniques

Related Work • CLIQUE, ORCLUS, PROCLUS (subspace clustering) • Can’t capture neither the shifting pattern nor the scaling pattern • Bicluster model proposed as a measure of coherence of genes and conditions in a submatrix of a DNA array Data Mining: Concepts and Techniques

Bicluster • Model: Mean squared residue score of submatrix: a submatrix AIJ is called a -biCluster if H(I,J) • Algorithm: A random algorithm to give an approximate answer Data Mining: Concepts and Techniques

Weakness of bicluster • Missing Values • Constraints Data Mining: Concepts and Techniques

Presentation Layout • Overview • Related Work • -Clusters Model • FLOC algorithm Data Mining: Concepts and Techniques

Occupancy Threshold • A parameter to control the percentage of missing values in a submatrix • |J’i| is the specified attributes for object i in -Clusters • |J| is the number of attributes in the -Clusters Data Mining: Concepts and Techniques

Occupancy Threshold cont. • Similar occupancy threshold for attribute j in -Clusters • Example =0.6 Data Mining: Concepts and Techniques

Volume • The volume of a -Clusters(I,J) is the number of specified entries dij in (I,J) • Example volume is 3*3=9 Data Mining: Concepts and Techniques

Base • Object Base • Attribute Base Data Mining: Concepts and Techniques

Base cont. • -Clusters Base • For perfect -Clusters Data Mining: Concepts and Techniques

Residue • Entry Residue if dij is specified otherwise is 0 Data Mining: Concepts and Techniques

Residue cont. • -Clusters Residue • r-residue -Clusters if -clusters residue is equal to or smaller than r Data Mining: Concepts and Techniques

Presentation Layout • Overview of Clustering • Related Work of -Clusters • -Clusters Model • FLOC algorithm(Flexible Overlapping Clustering) Data Mining: Concepts and Techniques

Flow Chart Y N Generating initial clusters Determine the best action For each row and each column Perform the best action sequentially improved Data Mining: Concepts and Techniques

Initial Cluster • Randomly Generate k initial cluster • Different parameters  makes different size cluster Data Mining: Concepts and Techniques

Choose best actions • For every object or attribute, there are k actions which can be done, • Choose the best action among the k candidates according to gain • Gain is the difference between original residue and the residue assuming the action is done on the cluster Data Mining: Concepts and Techniques

Choose Best Actions cont. • Even if gain is negative sometimes we do the action in order to get the global optimum Data Mining: Concepts and Techniques

Do the actions sequentially • Generate the actions sequence 1) the same order in all iterations 2) random order sequence 3) weighted random order sequence Data Mining: Concepts and Techniques

Output the Best cluster • After some iterations, no improvement of minimum residue, algorithm stops and k best cluster is output Data Mining: Concepts and Techniques

End • Thank you! Data Mining: Concepts and Techniques

Capturing Subspace Correlation with δ-Clusters in Large Datasets