1 / 20

Model-based evaluation of clustering validation measures

Model-based evaluation of clustering validation measures. Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Marcel Brun, Chao Sima, Jianping Hua, James Lowey, Brent Carroll, Edward Suh, Edward R. Dougherty. Pattern Recognition, 2007. Outline. Motivation Objective

jatherton
Télécharger la présentation

Model-based evaluation of clustering validation measures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model-based evaluation of clustering validation measures Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Marcel Brun, Chao Sima, Jianping Hua, James Lowey, Brent Carroll, Edward Suh, Edward R. Dougherty Pattern Recognition, 2007

  2. Outline • Motivation • Objective • Model-based analysis • Conclusions

  3. Motivation • Historically, a host of “validity” measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. • No doubt one would like to measure the accuracy of a cluster operator based on a single application. But is this feasible?

  4. Objective • In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models

  5. Model-based analysis • Specification of labeled point processes • Generation of samples from the processes • Application of clustering algorithms to the data • Estimation of the error of several algorithms from these samples • Computation of the several validation measures for these algorithms on the same samples: • Quantification of the quality of the indices

  6. Model-based analysis (1/2) • Specification of labeled point processes • requires determining some labeled point process with sufficient variability to obtain a broad range of error values • avoids overly simple models that may be beneficial for some specific measures. • Generation of samples from the processes: • This step involves generating 100 sample sets (sets with their labels) for each process. • Application of clustering algorithms to the data

  7. Model-based analysis (2/2) • Estimation of the error of several algorithms from these samples • Computation of the several validation measures for these algorithms on the same samples: • Internal indices • Relative indices • External indices • Quantification of the quality of the indices

  8. Conclusions • when investigating the performance of a proposed clustering algorithm, it is best to consider varied models and use the true clustering error.

  9. Data set

  10. Clustering algorithms

  11. Error measure

  12. validation measures • Internal validation • is based on calculating properties of the resulting clusters; • relative validation • is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; • external validation • compares the partition generated by the clustering algorithm and a given partition of the data.

  13. Internal validation - Trace criterion, determinant criterion and invariant criterion Model 1 Model 2 Below 0.5 Model 3 Model 4 Model 5

  14. Internal validation –Dunn’s indices • the ratio between the minimum distance between two clusters and the size of the largest cluster Model 2 Model 5 Model 1

  15. Internal validation –Silhouette index • The silhouette is the average, over all clusters, of the silhouette width of their points 1 -1

  16. Internal validation –Hubert’s correlation with distance matrix M = n(n − 1)/2 be the number of pairs of different vectors

  17. Relative validation indices–Figure of merit • when used on microarray data, the clusters represent different biological groups, and therefore, points (genes) in the same cluster will possess similar pattern vectors (expression profiles) for additional features (arrays).

  18. Relative validation indices–Stability • the ability of a clustered data set to predict the clustering of another data set sampled from the same source.

  19. External validation indices–Hubert’s correlation • The Hubert statistic is based on the fact that the more similar the partitions, the more similar the matrices would be, and this similarity can be measured by their correlation. xi and xj belong to the same cluster→ d(i,j)=1 xi and xj belong to different cluster→ d(i,j)=0

  20. External validation indices–Rand statistics, Jaccard coefficient and Folkes and Mallows index

More Related