Clustering

1. Clustering � Dragoljub Pokrajac 2003

2. Clustering and Unsupervised Learning Clustering is one of algorithms for unsupervised learning Class labels are not known in advance Unlike in classification, clustering models are learned purely based on attribute values

3. What is Clustering ? Given: Set of unlabeled patterns Each pattern contain one or more groups Goal: Group patterns such that patterns �similar� to each other are in one group, and �dissimilar� in distinct groups Such distinguished groups are called clusters

4. Example Group these figures according some criteria Attributes: Number of edges Color

5. Clustering by color

6. Clustering by number of edges

7. Some Issues in Clustering The actual number of clusters is not known Potential lack of apriori knowledge about data and cluster shapes Clustering could be performed on-line Time complexity, when working with large amounts of data

8. Types of Clustering Algorithms Partitioning Hierarchical Clusters for large data sets

9. Partitioning Methods Only one set of clusters is created at the output of the algorithm Number of clusters is usually specified Dataset is being partitioned into several groups and groups are updated through iterations K-means, EM, PAM, CLARA, CLARANS�

10. K Means Algorithm Repeat: Randomly initialize K cluster centers Each point assign to the nearest cluster center Re-estimate cluster centers by averaging coordinates of points assigned to each cluster

32. Problems With K-means Number of Clusters Must Be prespecified Algorithm Sensitive on Initialization May not converge to proper clusters TO DO: show this! Algorithm does not care about the shape of the clusters Algorithm does not care about densities �Blue� cluster much denser than the other clusters

33. EM Algorithm Idea: Each point came from one of Gaussian distributions Goal: estimate parameters of Gaussian distributions

34. Mixture of Gaussian Distributions With probability p1 data came from the distribution D1 determined by: Mean ?1, covariance matrix ?1, conditional density function p(x|D1) With probability p2 data came from the distribution D2 determined by: Mean ?2, covariance matrix ?2, conditional density function p(x|D2) � With probability pK data came from the distribution DK determined by: Mean ?K, covariance matrix ?K, conditional density function p(x|DK)

35. Gaussian Mixture - Formula Similar to well-known formula of total probability, we have formula of �total� probability density� p(x)= p1*p(x|D1)+ p2*p(x|D2)+� pK*p(x|DK) As we know, since conditional distributions are here Gaussians, we have:

36. EM Algorithm Details We need to set number of clusters K in advance Consists of two phases Expectation Compute for every pattern probability that is came from each of K clusters, conditioned by observed attributes Maximization Update estimated values for Means of the clusters Covariance matrices of the clusters Cluster priors (probability that a random point belongs to given cluster)

37. EM in Matlab for j=1:n_iterations % �Expectation� phase: Compute for patterns probabilities that they came from % each of K clusters for h=1:K Pmat(:,h)=p(h)*multinorm_distr_value(mu(h,:),sigma{h},X); end Pmat=Pmat./repmat(sum(Pmat,2),1,K); % % �Maximization� phase %Compute new means for h=1:K mu(h,:)=Pmat(:,h)'*X/p(h)/N; %N patterns end %Compute new sigmas for h=1:K XM=(X-repmat(mu(h,:),N,1)).*repmat(Pmat(:,h)).^0.5,1,2); sigma{h}=XM'*XM/p(h)/N; end % Compute new prior probabilities for h=1:K p(h)=w'*Pmat(:,h)/N; end end

38. EM Algorithm- Example 3500 points from mixture of three two-dimensional Gaussian Distributions EM algorithm initialized with Distribution means close to true means Covariance matrices equal to covariance matrix on all data Equal priors Each point colored by mixture of three primary colors More clear color: more certain in the cluster membership

51. Problems with EM Algorithm Slow Convergence depends on the initialization Assumes Gaussian clusters

52. Hierarchical Clustering Set of clusters is created The way clusters are created is depicted by dendograms Agglomerative Divisive

53. Agglomerative Clustering Put each pattern into one separate cluster While there are more than c clusters Merge two clusters closest according to some distance criterion Output c clusters

54. Distance Criterion SINGLE LINK Minimal distance between points in two clusters COMPLETE LINK Maximal distance between points in two clusters AVERAGE LINK Average distance between points in two clusters

58. Properties of Single Link Distance Favors elongated clusters

59. Properties of Complete Link Distance Favors Compact clusters

67. Main Problems Slow Distances should be recomputed O(n2) time complexity

68. Divisive Clustering We start from only one cluster and successively split the clusters into smaller E.g. using Minimal Spanning Tree (MST) Minimal Spanning Tree is tree connecting all edges of the graph such that the sum of vertices is minimal Note: MST can also be used to do single link�

69. Divisive Clustering using MST Consider patterns as vertices of fully connected graph Consider each pair of vertices as connected with edge length equal to the distance between points Compute MST Sort edges of MST in decreasing order While there are remaining edges Form new cluster by deleting the longest remaining edge

75. Clustering Large Datasets Issues Time complexity vs. Number of patterns Number of attributes Spatial complexity What if not whole dataset can fit in main memory?

76. DBSCAN Density based clusters Time complexity, using special data structure R* trees is O(NlogN) where N is number of patterns

77. Couple of Definitions Core point: pattern in which neighborhood there are more than Nmin patterns Nmin minimal number of patterns in the neighborhood Example: Nmin =9

78. Density Reachable Points

79. Density Reachable Points �Formally� Point q is density reachable from p1 if: p1 is a core point There are some core points p2, p3,�,pM such that p2 is in neighborhood of p1 p3 is in neighborhood of p2 p4 is in neighborhood of p3 � pM is in neighborhood of pM-1 q is in neighborhood of pM NOTE: q does not need to be a core point!

80. Density- Based Cluster A density-based cluster contains all the points density reachable from an arbitrary core point in the cluster!

81. Idea of DBSCAN Initially, all patterns of the database are unlabelled. BUT: For each pattern, we check whether it is labeled (it can be labeled if it was in some previously detected cluster) If the pattern is not labeled, we will check whether it is a core point, so that it may initiate a new cluster If the pattern has a cluster label, we do not do nothing but instead process next pattern

82. Idea of DBSCAN -Cont If the examined point is a core point, it seeds a new cluster. We observe the neighbors If the neighbor is already labeled, it means it is already examined so we do not need to assign label or to re-examine that Otherwise (neighbor is unlabeled) Each neighbor is assigned a label of a new cluster We recursively examine all core points in the neighborhood

83. DBSCAN - Algorithm DBSCAN: FOR each pattern in dataset IF the pattern is not already assigned to a cluster IF CORE_POINT(pattern)==Yes ASSIGN new cluster label to the pattern EXAMINE (pattern.neighbors)

84. Important Note In practical realization, we can avoid having recursive calls Maintain and update the list of all nodes from various neighborhoods that need to be examined NOTE: Instead of list, we could improve performance by using sets (sets do not contain duplicates�) This leads to the following practical, non-recursive version of DBSCAN

85. Non-Recursive DBSCAN FOR each pattern in dataset IF the pattern is not already assigned to a cluster IF CORE_POINT(pattern)==Yes ASSIGN new cluster label to the pattern ADD pattern.neighbors to the list WHILE list is not empty TAKE neighbor from the beginning of the list (and remove it from the list) IF neighbor is not already assigned to a cluster ASSIGN new cluster label to the neighbor IF CORE_POINT(neighbor)==Yes ADD neighbor.neighbors to the list; END WHILE

86. Remark In addition to the functionality provided in the described algorithm DBSCAN may assign a NOISE label to a pattern Pattern is NOISE if it is not a core point and if it is not density reachable from some core point NOISE patterns do not belong to any class

87. DBSCAN - Example

111. Problems with DBSCAN How to choose optimal parameters (size of neighborhood and minimal number of points) May not work well with non-uniform and/or Gaussian clusters How does it scale with the number of attributes?

112. Choice of Optimal Parameter Values (Ester et al, 1996) Use Nmin=4 From each point in dataset compute the distance to its 4th nearest neighbor Sort these distances and plot them Choose the distance threshold such that: It is situated at the �knee� of the curve The percentage of noise is pre-specified

114. Example Lets detect the same Gaussian clusters we successfully discovered by k-means and by EM algorithm

118. Problem of Dimensionality Speed of DBSCAN depends on the efficient search of neighbors We need special indexing structures R* trees work well up to 6 attributes X trees work well up to 12 attributes Is there an indexing structure which will scale well with the number of attributes???

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering