130 likes | 244 Vues
This paper explores information-theoretic distance measures applied to clustering validation, comparing clustering outputs with true partitions. It addresses challenges such as the dimensionality, size, and sparsity of datasets, which can affect distance measures. The methodology emphasizes normalization, quasi-distance computation, symmetry, and the triangle law for effective clustering performance analysis. The experiments demonstrate that normalized distance measures outperform original measures, with normalized Shannon distance achieving the best performance among several evaluated metrics.
E N D
Information-theoretic distance measures for clustering validation:Generalization and normalization Presenter : Lin, Shu-Han • Authors : Ping Luo, HuiXiong, Guoxing Zhan, Junjie Wu, andZhongzhi Shi IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,TKDE(2009)
Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Comments
Motivation σ :the“true”partition π:clusteringoutput • Externalcriteriaforclusteringvalidation: • Information-theoreticdistancemeasuresareusedtoComparingtheclusteringoutputwiththe“true”partition • Clusteringabilityofalgorithms:Comparedifferentclusteringalgorithms,givendataset • Clusteringdifficultyofdatasets:Comparedifferentdatasets,givenalgorithm
Objectives • SinceDimension, size, sparseness of data; scales of attributes aredifferentfordifferentdatasets. • therangeofdistancemeasuresaredifferent • Todofaircomparison:distancenormalization
Methodology – ConditionalEntropy π:grouplabel σ:classlabel The equality C1=C2 yields the Shannon entropy 5
Methodology – Quasi-Distance σ :the“true”partition π:clusteringoutput Minimum reachable:d(π,σ)reaches its minimum over both and iffπ=σ Symmetry:d(π,σ)=d(σ,π) Triangle law:d(π,σ)+d(σ,π)≧d(σ,τ) 6
Methodology – NormalizationIssue Howtogetit? 7
Methodology – Computationof Theworseresultofπ(mgroups) Generateaπ0 ∈ PART(A)suchthat 8 σ:n
Methodology – Computationof Thereisandifferencebetweenand 9
Experiments ShannonEntropy GiniIndex Goodman-Kruskal PalEntropy 10
Experiments 11
Conclusions • Quasi-distance:externalmeasureforclusteringvalidation • Symmetry • Trianglelaw • Minimumreachable • Normalization:maximumvalueofadistancemeasure • Compareclusteringperformancesofanalgorithmondifferentdatasets • Thenormalizeddistancemeasuresoutperformtheoriginaldistancemeasure • NormalizedShannondistancehasbestperformanceamong4observeddistancemeasures
Comments • Advantage • Ideaisintuitive • Theoreticallyanalysis • Drawback • Describewhytheythinkquasi-distanceisbetterthanDCV. • Application • ThesameuseofDCV?