College of Computing and Information Technology Arab Academy for Science and Technology

College of Computing and Information Technology Arab Academy for Science and Technology FINDING CLUSTERS OF ARBITRARY SHAPES, SIZES AND DENSITIESWITHIN NATURAL CLUSTERS USING THE CONCEPT OF MUTUAL NEAREST NEIGHBORS by Mohamed Ali Abbas Mohamed Thesis Advisors Committee Referees

Agenda • Introduction 2. Contribution (CSHARP: A Clustering Using Shared Reference Points Algorithm) • Concept of Reference Points • Cluster’s Propagation • CSHARP Algorithm • Speed Performance 3. Validation indexes • F-measure • Jaccard Coefficient • Adjusted Rand Index 4. Experimentation (15 data sets were tested) • Tow 2-d Chameleon's data sets • Eleven low dimensional data sets • Tow High dimensional data sets 5. Conclusion & Future work 6. References

1. INTRODUCTION • Proximity Measures • Inter-distance vs. Intra-distance • Distance Measures • Similarity Measures • Related Work • K-means • DBScan • Chameleon • Mitosis • Jarvis-Patrick • SNN • Spectral Clustering • Definition of Clustering • Natural Cluster • Clustering vs. Classification • Clustering Process • Features Extraction • Algorithm Design • Cluster Validation • Result Interpretation

CL US TER ING • Clustering of data is an important step in data analysis. The main goal of clustering is to partition data objects into well separated groups so that objects lying in the same group are more similar to one another than to objects in other groups. • Clusters can be described in terms of internal homogeneity and external separation.

Inter-cluster distances are maximized Intra-cluster distances are minimized Inter/Intra Cluster Distances • A good clustering is one where • (Intra-cluster distance) the sum of distances between objects in the same cluster are minimized, • (Inter-cluster distance) while the distances between different clusters are maximized

Top-Down & Bottom-Up Approaches T o p Down Partition Clusters according to External Separation Merge points according to Internal Homogeneity Bottom Up

Natural Clusters A natural cluster is a cluster of anyshape,Size and density, and it should not be restricted to a globular shape as a wide number of classical algorithms assume, or to a specific user-defined density as some density-based algorithms require.

Classification vs. Clustering • Objects characterized by one or more features • Classification (Supervised Learning) • Have labels for data objects/ patterns. • Search for a “rule” that will accurately assign labels to new (unseen) patterns. • Clustering (Unsupervised Learning) • No labeling of data. • Group points into clusters based on how “near” they are to one another. • Identify hidden structurein data.

(2) Clustering Algorithm Design • 1) Feature Extraction Data Samples (4) Result Interpretation (3) Cluster Validation Knowledge Clustering Process Diagram

Clustering Inference • We can infer the properties of a specific object based on the category to which it belongs. For instance, when we see a seal lying easily on the ground, we know immediately that it is a good swimmer without really seeing it swim. It is a good Swimmer

Proximity Measures 1- Distance Measures: Proximity is the generalization of both dissimilarity and similarity. Minkowski distance is typically used with p being 1 or 2. The latter is the Euclidean distance, while the former is sometimes known as the Manhattan distance. Euclidean distance is probably the most common distance that have ever been used for numerical data.

Proximity Measures 2- Similarity Measures: Cosine similarity: is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as Jaccard similarity: The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

How hard is clustering? • One idea is to consider all possible clusterings, and pick the one that has best inter and intra cluster distance properties • Suppose we are given n points, and would like to cluster them into k-clusters. How many possible clusterings? • Too hard to do it by brute force or optimally • Solution: Iterative optimization algorithms • Start with a clustering, iteratively improve it (e.g. K-means)

Related Work • K-means (Forgy, 1965) • Jarvis-Patrick (1973) • DBScan ( Martin Ester, 1996) • Chameleon (Karypis G, 1999 • SNN (Ertoz, 2003) • Mitosis (NohaYousri, 2009) • Spectral Clustering (Chen, W.-Y., Song, Y., 2011)

Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x Compute centroids x x x K Means Example (K=2) Reassign clusters Converged! [From Mooney]

K Means Example (K=10) K-means clustering on Digits data set (PCA- reduced data) Centoids are marked with white cross

p q o Density-Based Clustering: Background • Two parameters: • Eps: Maximum radius of the neighborhood • MinPts: Minimum number of points in an Eps-neighborhood of that point p p1 q Density-reachable Density-connected A cluster is defined as a maximal set of density-connected points

Chameleon Variety of sizes Variety of Shapes Variety of Densities Increase in densities

Closeness vs. Interconnectivity Closeness is misleading Interconnectivity is misleading

Chameleon Algorithm • Two-phase approach • Phase -I • Uses a graph partitioning algorithm to divide the data set into a set of individual clusters (e.g “hmetis”). • Phase -II • uses an agglomerative hierarchical mining algorithm to merge the clusters. Chameleon selects the pair of clusters that maximizes

Chameleon DS5 Data Set

Chameleon Inconsistencies Inconsistencies of distances between patterns of Chameleon's initial clusters (bordered clusters).

Mitosis Concept • First example (p1, p2, d1) has two patterns with distance d1 related to each patterns' neighborhood distances, yet the two neighborhood distance averages are not related. • The second case (p3, p4, d2) gives an example of two related neighborhood distance averages, but an unrelated in-between distance d2. • The two patterns do not merge in any of these cases. • The same rationale is used when merging two clusters.

KNN Graph & Mutual KNN Graph

Jarvis-Patrick Algorithm • Compute the similarity matrix ( This corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points.) • Sparsify the similarity matrix by keeping only the k-most similar neighbors. (This corresponds to only keeping the k strongest links of similarity graph.) • Construct the shared nearest neighbor graph from the sparsified similarity matrix. At this point, we could apply a similarity threshold and find the connected components to obtain the clusters .

SNN Algorithm • 4. Find the SNN density of each point. Using a user specified parameters, Eps, find the number points that have an SNN similarity of Eps or greater to each point. This is the SNN density of the point. • 5. Find the core points. Using s user specified parameter, MinPts, find the core points, i.e. all points that have an SNN density greater than MinPts. • 6. Form clusters from the core points. If two core points are within a radius, Eps, of each other they are place in the same cluster. • 7. Discard all noise points. All none-core points that are not within a radius of EPs of a core point are discarded. • 8. Assign all non-noise, non-core points to clusters. This can be done by assigning such points to the nearest core point. (Note that steps 4-8 are DBScan)

0.0 • Decomposition • Find eigenvaluesXand eigenvectors A • of the matrix L 0.4 0.2 0.1 0.4 -0.2 -0.9 0.4 0.4 0.2 0.1 -0. 0.4 0.3 2.2 0.4 0.2 -0.2 0.0 -0.2 0.6 A = X = 2.3 0.4 -0.4 0.9 0.2 -0.4 -0.6 0.4 -0.7 -0.4 -0.8 -0.6 -0.2 2.5 0.4 -0.7 -0.2 0.5 0.8 0.9 3.0 • Map vertices to corresponding components of λ2 • Use k-means algorithm to cluster n rows of X into k groups x1 0.2 x2 0.2 x3 0.2 x4 -0.4 x5 -0.7 x6 -0.7 Spectral Bi-partitioning Algorithm • Pre-processing • Build Laplacianmatrix L of the graph

2. CONTRIBUTION Concept of Mutual Neighboring“A” is in 4-NB of “B”, however, “B” is not in 4-NB of “A”.

Concept of Reference Point/ Reference Block Reference Point ‘A’ and the Reference Block it represents.

Cluster Propagation Example A B

CSHARP’s Concept A new clustering algorithm CSHARP is presented for the purpose of finding clusters of arbitrary shapes and arbitrary densities in high dimensional feature spaces. It can be considered as a variation of the Shared Nearest Neighbor algorithm (SNN), in which each sample data point votes for the points in its k-nearest neighborhood. Sets of points sharing a common mutual nearest neighbor are considered as dense regions/blocks. These blocks are the seeds from which clusters may grow up. Therefore, CSHARP is not a point-to-point clustering algorithm. Rather, it is a block-to-block clustering technique. Much of its advantages come from these facts: Noise points and outliers correspond to blocks of small sizes, and homogeneous blocks highly overlap.

CSHARP Data Model

Representative Points • There are three possible types of representative points: • Strong Points representing blocks of size greater than a pre-defined threshold parameter . • Noise points, representing empty Reference-Lists. These points will be excluded initially from final clusters. • Weak points which are neither strong points nor noise points. These points may be merged with another existing clusters if they are members of another strong points.

Number of strong, weak and noise points found in the tested data sets and their percentages at given parameters settings; relative to the size of the corresponding data set.

CSHARP Flowchart

CSHARP Algorithm

Homogeneity Factor then Homogeneity Factor α = 3 cases with different homogeneity factors (a) 0.98, (b) 0.80 and (c) 0.55

Greedy Merging in CMune Ci greedily chooses Cl to merge with, as it has the maximum intersection among all reference blocks.

Significance Statistics of Results In general, the pattern of behavior of an index measure; such as the F-measure versus k (size of nearest neighborhood) has been found to be as shown blow

Preprocessing phase’s Complexity The time complexity for the Preprocessing phase for computing similarity matrix is where N is the number of data points. Complexity can be reduced to by the use of a data structure such as k-d tree. The space complexity for the Preprocessing stage for computing distances between features is: where N is the number of data points and F is the number of features dimensions).

Speed Performance The overall time complexity for the CSHARP is : Where N is the number of data points and K is the number of nearest neighbors. CSHARP takes a space complexity of where N is the number of data points K and is the number of nearest neighbors used; since only the K-nearest neighbors of each data point is required.

Speed Performance Speed of CSHARP and Adjusted CSHARP using Chameleon’s DS5 data set, compared to DBScan and K-means

Speed Performance Speed of CSHARP and Adjusted CSHARP using Chameleon’s DS5 data set, compared to DBScan, K-means, and Chameleon.

Speed Performance Speed of CSHARP and Adjusted CSHARP using Chameleon’s DS5 data set, compared to DBScan

Cluster Prototype • Some recent clustering algorithms have notion of prototype (i.e. data point or set of points) • To start with. This prototype can be only one data point selected arbitrarily as in K-means or selected according to some prior criteria as in K-medoids, DBScan, and Cure. • On the other hand prototype can be set of points represent very tiny clusters to merge as in Chameleon or to propagate around as in and CSHARP.

Cluster seeds representation: (1) centroids-based, i.e. k-means. (2) medoid-beased i.e. k-medoids. (3) core-point based i.e. DBScan. (4) well-scattered-points based i.e. Cure. (5) Block-of-points based i.e. Chameleon, and (6) Reference-points based i.e. CSHARP.

3- Results Validation

Results Validation The data set consists of seven data points. The values of a , b , c , and d are 2, 3, 7, and 9, respectively. Correspondingly, the numbers of pairs of points for the four cases are denoted as a, b, c , and d , respectively. Because the total number of pairs of points is N(N-1)/2, denoted as M , we have a + b + c + d = M .

College of Computing and Information Technology Arab Academy for Science and Technology