100 likes | 377 Vues
X-means: Extending K-means with Efficient Estimation of the Number of Clusters. Dan Phelleg, Andrew Moore Carnegie Mellon University Published: ICML 2000 Presentation by: Payam Refaeilzadeh. Problems with K-means. Need to know K Searching for K is expensive
E N D
X-means: Extending K-means with Efficient Estimation of the Number of Clusters Dan Phelleg, Andrew Moore Carnegie Mellon University Published: ICML 2000 Presentation by: Payam Refaeilzadeh
Problems with K-means • Need to know K • Searching for K is expensive • Even K-means with fixed-K scales poorly • Need to calculate the distance from each point to each centroid to find new cluster assignments
Remedies • Forward search for the appropriate value of k in a given range • Recursively split each cluster and use BIC score to decide if we should keep each split • Use kd-trees to accelerate individual rounds of K-means
Splitting • Use local BIC score to decide on keeping a split • Use global BIC score to decide which K to output at the end
BIC (Bayesian Information Criterion) • Adjusted Log-likelihood of the model. • The likelihood that the data is “explained by” the clusters according to the spherical-Gaussian assumption of k-means
Kd-trees • Points to be clustered are put into a binary hierarchical structure • Each node represents a subset of points and stores • The minimal hyper-rectangle enclosing all points in the subset • The vector-sum of all the points in the subset • The number of points in the subset
Using kd-trees • For each centroid store a counter containing the vector sum of all the points belonging to it and the number of points • Update the above by scanning the kd-tree only once • Start with the root node and all centroids • As you walk down the tree centroids start to get black-listed (when the points in that node could not possibly belong to a centroid) • When only one centroid remains, the counter for that centroid can be updated using the statistics stored in the node • At the end of the scan we have enough info to recalculate the centroid coordinates