Model-Based Evaluation of Clustering Validation Measures in Pattern Recognition

Model-based evaluation of clustering validation measures Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Marcel Brun, Chao Sima, Jianping Hua, James Lowey, Brent Carroll, Edward Suh, Edward R. Dougherty Pattern Recognition, 2007

Outline • Motivation • Objective • Model-based analysis • Conclusions

Motivation • Historically, a host of “validity” measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. • No doubt one would like to measure the accuracy of a cluster operator based on a single application. But is this feasible?

Objective • In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models

Model-based analysis • Specification of labeled point processes • Generation of samples from the processes • Application of clustering algorithms to the data • Estimation of the error of several algorithms from these samples • Computation of the several validation measures for these algorithms on the same samples: • Quantification of the quality of the indices

Model-based analysis (1/2) • Specification of labeled point processes • requires determining some labeled point process with sufficient variability to obtain a broad range of error values • avoids overly simple models that may be beneficial for some specific measures. • Generation of samples from the processes: • This step involves generating 100 sample sets (sets with their labels) for each process. • Application of clustering algorithms to the data

Model-based analysis (2/2) • Estimation of the error of several algorithms from these samples • Computation of the several validation measures for these algorithms on the same samples: • Internal indices • Relative indices • External indices • Quantification of the quality of the indices

Conclusions • when investigating the performance of a proposed clustering algorithm, it is best to consider varied models and use the true clustering error.

Data set

Clustering algorithms

Error measure

validation measures • Internal validation • is based on calculating properties of the resulting clusters; • relative validation • is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; • external validation • compares the partition generated by the clustering algorithm and a given partition of the data.

Internal validation - Trace criterion, determinant criterion and invariant criterion Model 1 Model 2 Below 0.5 Model 3 Model 4 Model 5

Internal validation –Dunn’s indices • the ratio between the minimum distance between two clusters and the size of the largest cluster Model 2 Model 5 Model 1

Internal validation –Silhouette index • The silhouette is the average, over all clusters, of the silhouette width of their points 1 -1

Internal validation –Hubert’s correlation with distance matrix M = n(n − 1)/2 be the number of pairs of different vectors

Relative validation indices–Figure of merit • when used on microarray data, the clusters represent different biological groups, and therefore, points (genes) in the same cluster will possess similar pattern vectors (expression profiles) for additional features (arrays).

Relative validation indices–Stability • the ability of a clustered data set to predict the clustering of another data set sampled from the same source.

External validation indices–Hubert’s correlation • The Hubert statistic is based on the fact that the more similar the partitions, the more similar the matrices would be, and this similarity can be measured by their correlation. xi and xj belong to the same cluster→ d(i,j)=1 xi and xj belong to different cluster→ d(i,j)=0

External validation indices–Rand statistics, Jaccard coefficient and Folkes and Mallows index

Model-Based Evaluation of Clustering Validation Measures in Pattern Recognition

Model-Based Evaluation of Clustering Validation Measures in Pattern Recognition

Presentation Transcript

Approaches to clustering-based analysis and validation

Model Evaluation, Validation and Testing

Evaluation of Association Measures

CLUSTERING PROXIMITY MEASURES

Model Validation

Clustering Evaluation

Evaluation of journals based on bibliometric measures

Model-Based Evaluation

Validation of an Agent-based Civil Violence Model

Model Validation

Model-based Clustering in R

Model Based Validation

Evaluation of model-based predictive control

Model Based Evaluation of Interactive Systems

Model-based Validation of Streaming Data

Model-Based Requirements Validation

Microarray Data Analyisis: Clustering and Validation Measures

Model-Validation in Model-Based Development

Model Based Validation

Translation Validation of Compilers for Model-based Programming

Model-Validation in Model-Based Development

Model-Based Requirements Validation