350 likes | 478 Vues
This presentation covers the δ-Clusters model for subspace clustering, focusing on identifying coherent patterns within large datasets, such as those found in microarray data analysis. Traditional clustering methods often struggle due to the curse of dimensionality. The δ-Clusters approach effectively captures both shifting and scaling patterns among objects, offering a new perspective on similarity. We also discuss the FLOC algorithm and its relevance to clustering in various applications, including e-commerce and biology.
E N D
-ClustersCapturing Subspace Correlation in a Large Data Set Authors: Yang Jiong, Wei Wang etc.(ICDE02) Presenter: Xuehua Shen xshen@uiuc.edu Data Mining: Concepts and Techniques
Presentation Layout • Overview of Clustering • Related Work of -Clusters • -Clusters Model • FLOC algorithm Data Mining: Concepts and Techniques
Clustering • Clustering: the process of grouping a set of objects into classes of similar objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Data Mining: Concepts and Techniques
Major Clustering Methods • Partition algorithm • Hierarchy algorithm • Density-based • Grid-based • Model-based Data Mining: Concepts and Techniques
Similarity • Clustering: the process of grouping a set of objects into classes of similar objects • But how to define similarity? Data Mining: Concepts and Techniques
Similarity cont. • Traditional clustering model: based on distance functions • Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer • But strong correlations may still exist among a set of objects even if they are far apart from each other as measured by the distance function Data Mining: Concepts and Techniques
Similarity cont. • -Clusters model: similar when exhibiting a coherent pattern on a subset of dimensions • Can cluster objects which show shifting pattern or scaling pattern Data Mining: Concepts and Techniques
Similarity cont. • Example of Coherent Pattern: Shifting Pattern Scaling Pattern Data Mining: Concepts and Techniques
Subspace Clustering • From high dimensional clustering (problematic) To subspace clustering • Not restricted with fixed ordering of columns contrasted with pattern in time-series data • Challenge: curse of dimensionality! Data Mining: Concepts and Techniques
Subspace Clustering cont. • Example of subspace clustering Data Mining: Concepts and Techniques
Applications • Microarray Data Analysis in Biology • E-Commerce Data Mining: Concepts and Techniques
Microarray Data Analysis • Matrix (Dense) Rows: Genes Columns: Various Samples experiment conditions or tissues • Values in Matrix: expression level relative abundance of the mRNA of a gene under a specific condition Data Mining: Concepts and Techniques
Microarray Data Analysis cont. • From Scaling Pattern to Shifting Pattern Red: Interested Gene, Green: Controlled Gene • Investigations show that several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels rise and fall coherently under a subset of conditions Data Mining: Concepts and Techniques
E-Commerce • Example: Rating of Movies (1: lowest rate, 10: highest rate) • Shifting Pattern • If a new movies and 1st viewer rate 7 and 3rd viewer rate 9, 2nd viewer probably will like this movie too Data Mining: Concepts and Techniques
Presentation Layout • Overview of clustering • Related Work of -Clusters • -Clusters Model • FLOC algorithm Data Mining: Concepts and Techniques
Related Work • CLIQUE, ORCLUS, PROCLUS (subspace clustering) • Can’t capture neither the shifting pattern nor the scaling pattern • Bicluster model proposed as a measure of coherence of genes and conditions in a submatrix of a DNA array Data Mining: Concepts and Techniques
Bicluster • Model: Mean squared residue score of submatrix: a submatrix AIJ is called a -biCluster if H(I,J) • Algorithm: A random algorithm to give an approximate answer Data Mining: Concepts and Techniques
Weakness of bicluster • Missing Values • Constraints Data Mining: Concepts and Techniques
Presentation Layout • Overview • Related Work • -Clusters Model • FLOC algorithm Data Mining: Concepts and Techniques
Occupancy Threshold • A parameter to control the percentage of missing values in a submatrix • |J’i| is the specified attributes for object i in -Clusters • |J| is the number of attributes in the -Clusters Data Mining: Concepts and Techniques
Occupancy Threshold cont. • Similar occupancy threshold for attribute j in -Clusters • Example =0.6 Data Mining: Concepts and Techniques
Volume • The volume of a -Clusters(I,J) is the number of specified entries dij in (I,J) • Example volume is 3*3=9 Data Mining: Concepts and Techniques
Base • Object Base • Attribute Base Data Mining: Concepts and Techniques
Base cont. • -Clusters Base • For perfect -Clusters Data Mining: Concepts and Techniques
Residue • Entry Residue if dij is specified otherwise is 0 Data Mining: Concepts and Techniques
Residue cont. • -Clusters Residue • r-residue -Clusters if -clusters residue is equal to or smaller than r Data Mining: Concepts and Techniques
Presentation Layout • Overview of Clustering • Related Work of -Clusters • -Clusters Model • FLOC algorithm(Flexible Overlapping Clustering) Data Mining: Concepts and Techniques
Flow Chart Y N Generating initial clusters Determine the best action For each row and each column Perform the best action sequentially improved Data Mining: Concepts and Techniques
Initial Cluster • Randomly Generate k initial cluster • Different parameters makes different size cluster Data Mining: Concepts and Techniques
Choose best actions • For every object or attribute, there are k actions which can be done, • Choose the best action among the k candidates according to gain • Gain is the difference between original residue and the residue assuming the action is done on the cluster Data Mining: Concepts and Techniques
Choose Best Actions cont. • Even if gain is negative sometimes we do the action in order to get the global optimum Data Mining: Concepts and Techniques
Do the actions sequentially • Generate the actions sequence 1) the same order in all iterations 2) random order sequence 3) weighted random order sequence Data Mining: Concepts and Techniques
Output the Best cluster • After some iterations, no improvement of minimum residue, algorithm stops and k best cluster is output Data Mining: Concepts and Techniques
End • Thank you! Data Mining: Concepts and Techniques