Parallel Density-based Hybrid Clustering

Parallel Density-based Hybrid Clustering Baoying Wang October 18th, 2008 The Sixth Virtual Conference on Genomics and Bioinformatics

Outline • Introduction • Sequential DH-Clustering • The Parallel Clustering Approach • Experimental Results • Conclusions

Introduction • Clustering techniques partition the data set into groups such that similar items fall into the same group [2]. Data clustering is a common data mining technique. • There are partitioning clustering and hierarchical clustering methods. Hierarchical clustering is more flexible than partitioning clustering. However, hierarchical clustering is very computationally expensive for large data sets. • Scalable parallel computers can be used to speed up hierarchical clustering. Recently there has been an increasing interest in parallel implementations of data clustering algorithms. • However, most existing parallel approaches have been developed for traditional agglomerative clustering.

Introduction (cont.) • In this paper, we propose a parallel algorithm to implement density-based hybrid clustering on MIMD (Multiple Instruction stream, Multiple Data stream) parallel machines using MPI. • DH-Clustering method clusters a dataset into a set of preliminary attractor trees and then merges the attractor trees until the whole dataset becomes one tree. DH-Clustering is a hybrid clustering. It is a combination of partitioning clustering and hierarchical clustering. It is faster than the traditional hierarchical clustering but still does not scale well when the data size increases. • Experiments demonstrate that our parallel approach speeds up the sequential hierarchical clustering tremendously with comparatively good clustering results.

Sequential DH-Clustering • The basic idea of DH-Clustering clustering is to partition the data set into clusters in terms of local density attractor trees. • Given a data point x, if we follow the steepest density ascending path, the path will finally lead to a local density attractor. All points whose steepest ascending paths lead to the same local attractor form a local attractor tree. If x doesn’t have such a path, it can be either an attractor or a noise. The local attractor trees are the preliminary clusters. The resultant graph is a collection of local attractor trees with local attractors as the roots. • After the local attract trees (preliminary clusters) are built, cluster merging process starts to combine the most similar cluster pair level-by-level based on the similarity measure. When two clusters are merged, two local attractor trees are combined into a new attractor tree. The cluster merging is processed recursively until there is one tree left.

Analysis of the DH-Clustering • DH-Clustering consists of five major steps: • loading data • calculating density • building local attractor trees • merging process • outputting results. • We will focus on the three major computational steps: (2), (3) and (4).

Analysis of the DH-Clustering (cont.) • Step 2 is to calculate the density of each data point to find the density of each data point, the number of neighbors falling in each equal interval ring (EINring neighbors) needs to be calculated. • If the dataset size is n, the time to find the EINring neighbors within each ring will be O(n). If we divide the neighborhood into m rings, the time to calculate the density of each data point will be O(m*n). • Therefore, the total time to calculate the densities of n points is O(m*n2).

Analysis of the DH-Clustering (cont.) • Step 3 is to build local attractor trees. The process starts with an arbitrary point and builds the path to a densest point within a specified step range. The process continues recursively until it reaches the peak (density attractor). • There are two extreme cases: • when the step is so large that the whole data set is clung to one attractor tree; • when the step is so small that every point is an attractor. • In case (1), it takes O(n2) time to build the attractor tree, and in case (2) the time to build n attractor trees is also O(n2). Therefore, the average time to build local attractors is O(n2).

Analysis of the DH-Clustering (cont.) • Step 4 is the merging process. It starts with k attractor trees and merges the closest attractor tree pair until there is one tree left. • The time complexity to find the closest pair among the k attractor trees is O(k2). And it takes (k-1) times to merge k trees into one tree. Therefore, the time of the whole merging process is O((k-1)k2) or O(k3). • Obviously, the time of merging process only depends on the value of k. In our experiments, k is much smaller than the data size. • Hence, generally, the merging process is not as expensive as the previous two steps.

Parallel DH-Clustering • Our parallel algorithm is designed to run on p parallel machines. First thought we had was to divide the dataset into p parts and assign each part to each machine. • However, one problem arose during density calculation. If each data part was isolated from another when densities were calculated, density values would not be correct, especially when the data portion on each machine covered data from all over the data space. • To solve this problem, we load the whole data set onto each machine, but each machine only calculates the densities of the points assigned to it. • In this way, we still achieve high efficiency without loss of accuracy. The complexity of density calculation on p machines is reduced to O(m*n2/p) from the sequential O(m*n2).

Parallel DH-Clustering (cont.) • After the densities are calculated, each machine focuses on its own data to build local attractor trees. • Since the data size on each machine is n/p, the parallel complexity of this step is reduced to O(n2/p2) from the sequential O(n2). • Generally, the number of local attractor trees built on individual machines is smaller than the sequential approach. For example, if the sequential approach produces k attractor trees, with the data divided among p machines, each machine may produce up to k local attractor trees. In this case, the total number of attractor trees might be up to p*k. • But this will not affect the final clustering result. It only adds more layers to the clustering structure since the merging process starts with more local attractors.

Parallel DH-Clustering (cont.) • There are two ways to merge the attractor trees. • One way is to let the local attractor trees stay on their own machine and assign one machine to do the merging. In this way, the closest tree pair is chosen by polling all the machines at each merging step. • The other way is to collect all local attractor trees to one machine before the merging process is carried out on this machine. • We did experiments and found out the first approach seemed more load balanced and computationally efficient but turned out very slow due to the communication time. • So we decided to adopt the second approach.

Some Technical Issues • Since the number of the local attractor trees on each machine is different and the attractor trees are of different sizes, it is very inefficient to send one tree at a time. To solve these problems, we pack all the local attractor trees into large sending buffers. A special delimiter is inserted between neighboring trees and the beginning of a tree is its attractor. • Another issue is that the sending buffers on different machines are different because the numbers of attractors are different. Therefore it is hard for the root machine to gather the data/trees with different sizes. To solve this problem, we let the root gather the size of the sending buffer from each machine first. Then we prepare an array of displacements on the receiving buffer based on the different sending sizes so when data comes from a certain machine, it will go directly to the position determined by its displacement.

Parallel DH-Clustering (cont.) • When all attractor trees are gathered in a receiving buffer at the root machine, all machines stop except the root machine. • The root machine will then dissect the receiving buffer based on delimiters and recover the attractor trees from the received data. • With all the newly recovered attractor trees, the merging process will start just as the sequential algorithm. The figure in next slide illustrates the parallel DH-Clustering process.

Mp Dp Density calculation Build local attractor trees The Whole Data Set M0 M1 Pack the trees to buffer D0 D1 Sending buffer Density calculation Density calculation Build local attractor trees Build local attractor trees … Pack the trees to buffer Pack the trees to buffer Sending buffer Sending buffer Gather all buffers Receiving buffer Merging Process

Experimental Results • We have implemented both the sequential DH-Clustering and the parallel DH-Clustering in C++ on the BigBen at the Pittsburgh Supercomputer Center. The parallel DH-Clustering is implemented using MPI. • BigBen is a Cray XT3 MPP system with 2068 computer nodes. Each computer node has two 2.6 GHz AMD Opteron processors and 2 GBs of memory. • In the experiments, we compared run times and clustering results between the sequential DH-Clustering and the parallel DH-Clustering.

Run Time Comparison

Comparison of Clustering Results

Conclusions • In this paper, we presented the parallel density based hybrid clustering (the parallel DH-Clustering). • The algorithm was implemented on the supercomputer, BigBen, using MPI. Our experiments show that the parallel DH-Clustering is much faster than the sequential approach. • The improvement is especially great for large data sets. The clustering results of the parallel approach are still comparable to those of the sequential. • In the future, we would like to test our method on larger datasets, including real data sets. We would like to experiment and find out the optimal numbers of machines for clustering different data sizes. These parameters will be a useful guide for users of parallel clustering algorithms.

Thank you! Questions?

Parallel Density-based Hybrid Clustering