Dissimilarity Measure for Tree-Structured Categorical Data Clustering

On clustering tree structured data with categorical nature Presenter : Wu, Jia-Hao Authors : B. Boutsinas, T. Papastergiou .. PR (2008)

Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Personal Comments 2

Motivation • Clustering methods have been widely studied in various scientific areas. • Including machine learning, neural networks and statistics. • Traditionally, clustering methods deal with numerical data • Nowadays commercial or scientific databases usually contain categorical data which are called nominal ,non-metric or symbolic in the literature. (e.g. occupation, sex, symptom) Numerical data Categorical data

Taiwan North Taiwan Middle Taiwan South Taiwan YunLin PingTung KeeLung TaiPei TaoYuan TaiChung TaiNan KaoHsiung ChangHua DouLiou Objective • Present a dissimilarity measure which is capable to deal with tree structured categorical data. • It can be used for extending the various versions of the very popular k-means clustering algorithm to deal with such data.

Methodology- k-Means • k-Means • (1). Selection of the initial k-means. • (2). Assignment of each object to a cluster with nearest mean/medoid. • (3). Recalculation of k-means/medoids for clusters. • (4). Computation of the quality function. K=2 Euclidean Distance q(x,y) = ||x - y||2 Quality function：0.96

Methodology- Measure with categorical data • Overlap measure • When ai and aj are identical → 0 , otherwise → 1. • Weakness • All attribute values are of equal distance from each other. • Conceptual measure • Matrix by asking the user to judge the closeness between the attribute values. • Based on tree structure. • Semantic similarity measure • Based on the shortest path between two concepts X and Y in an ontology. • Weakness • The links in an ontology represent uniform distances.

1 2 3 Methodology- Proposed dissimilarity measure • The P.D.M. is based on using a set of user-defined ontologies, implemented by tree structures. • The distance between the corresponding nodes of the tree structure • fl(X, Y) is the level of the nearest common father node of X and Y nodes. • l(X) is the level of node X. • max(p(X)) is the length of the maximum path starting from the root to a leaf and containing node X • p(X, Y) is the length of the directed path (number of edges) connecting X and Y

Suggesting that the distance between any X and Ynodes must be smaller. Suggesting that the distance between any X and Y nodes must be smaller as they are closer to their father node. Suggesting that the distance between X and Y must be smaller as the maximum path containing X and Y becomes larger. Methodology- Proposed dissimilarity measure

Methodology- four criteria • An attribute value has distance zero to itself. • Distances are symmetric. • An attribute value has positive distance to all other values. • Distances obey the triangle inequality.

Methodology – Question of criteria • The triangle inequality does not hold • Ex：in a case similar to that represented by the path A → B → E, where A ≡ X, B ≡ Z, E ≡ Y X=1 0≦ - d(X,Y) + d(X,Z) + d(Z,Y) Z=2 Y=3

Methodology - Medoid • (1). Define the mean level of a set of categorical attribute values within a set of objects O. • (2). The calculation of a medoid is that have been appeared in the majority of the objects in O. • (3). The medoid attribute value is the attribute value of this path assigned to the average level of the nodes in O.

J19 J19 J19 J19 J19 J19 J20 J20 J20 J20 J20 J20 J4 J4 J1 J1 J4 J1 J9 J9 J9 J9 J9 J9 J15 J15 J15 J15 J15 J15 J5 J5 J5 J10 J10 J10 J10 J10 J10 J8 J8 J8 J2 J2 J2 J16 J16 J16 J16 J16 J16 J18 J18 J18 J18 J3 J3 J18 J18 J3 J17 J17 J17 J17 J17 J17 J6 J6 J14 J14 J14 J14 J12 J12 J12 J12 J6 J11 J11 J11 J11 J13 J13 J13 J13 J14 J14 J12 J12 J11 J11 J13 J13 J7 J7 J7 Experiments Other：Chi-Square Proposed：

J19 J19 J19 J20 J20 J20 J4 J4 J1 J1 J9 J9 J9 J9 J15 J15 J15 J15 J5 J5 J10 J10 J10 J10 J8 J8 J2 J2 J16 J16 J16 J16 J18 J18 J18 J18 J3 J3 J17 J17 J17 J17 J6 J6 J14 J14 J14 J14 J12 J12 J12 J12 J11 J11 J11 J11 J13 J13 J13 J13 J7 J7 Experiments (Cont.) Other：Chi-Square Proposed：

Conclusion • Provide a new dissimilarity measure for categorical objects which is based on ontologies. • And proved that the proposed dissimilarity measure is not a metric, because it does not obey the triangle inequality. • Use a very efficient cluster tree with categorical data. • However , the computation of the proposed dissimilarity measure is more time consuming ,but it allow that to be efficient even for very large tree structures. Calculation of each of fl(X, Y) and l(X) is O(h) in the worst case Calculation of max(p(X)) is in the order of |N| in the worst case.

Comments • Advantage • A good cluster method with categorical data. • Drawback • … • Application • Cluster questions.

Dissimilarity Measure for Tree-Structured Categorical Data Clustering

Dissimilarity Measure for Tree-Structured Categorical Data Clustering

Presentation Transcript

Clustering Categorical Data The Case of Quran Verses

Clustering Algorithms for Categorical Data Sets

Categorical Data

Categorical Data

Categorical Data

Categorical Data

On Data Labeling for Clustering Categorical Data

HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

Tree-Structured Indexes

Tree-Structured Indexes

Categorical Data

Categorical Data

A Hierarchical Clustering Algorithm for Categorical Sequence Data

Categorical Data

Categorical Data

Categorical Data

CACTUS-Clustering Categorical Data Using Summaries

Categorical data

Tree-Structured Indexes

Tree-Structured Indexes

Categorical K-means Clustering Algorithm

Clustering Categorical Data