1 / 15

On clustering tree structured data with categorical nature

On clustering tree structured data with categorical nature. Presenter : Wu, Jia-Hao Authors : B. Boutsinas, T. Papastergiou. PR (2008). Outline. Motivation Objective Methodology Experiments Conclusion Personal Comments. 2. Motivation.

psamantha
Télécharger la présentation

On clustering tree structured data with categorical nature

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On clustering tree structured data with categorical nature Presenter : Wu, Jia-Hao Authors : B. Boutsinas, T. Papastergiou .. PR (2008)

  2. Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Personal Comments 2

  3. Motivation • Clustering methods have been widely studied in various scientific areas. • Including machine learning, neural networks and statistics. • Traditionally, clustering methods deal with numerical data • Nowadays commercial or scientific databases usually contain categorical data which are called nominal ,non-metric or symbolic in the literature. (e.g. occupation, sex, symptom) Numerical data Categorical data

  4. Taiwan North Taiwan Middle Taiwan South Taiwan YunLin PingTung KeeLung TaiPei TaoYuan TaiChung TaiNan KaoHsiung ChangHua DouLiou Objective • Present a dissimilarity measure which is capable to deal with tree structured categorical data. • It can be used for extending the various versions of the very popular k-means clustering algorithm to deal with such data.

  5. Methodology- k-Means • k-Means • (1). Selection of the initial k-means. • (2). Assignment of each object to a cluster with nearest mean/medoid. • (3). Recalculation of k-means/medoids for clusters. • (4). Computation of the quality function. K=2 Euclidean Distance q(x,y) = ||x - y||2 Quality function:0.96

  6. Methodology- Measure with categorical data • Overlap measure • When ai and aj are identical → 0 , otherwise → 1. • Weakness • All attribute values are of equal distance from each other. • Conceptual measure • Matrix by asking the user to judge the closeness between the attribute values. • Based on tree structure. • Semantic similarity measure • Based on the shortest path between two concepts X and Y in an ontology. • Weakness • The links in an ontology represent uniform distances.

  7. 1 2 3 Methodology- Proposed dissimilarity measure • The P.D.M. is based on using a set of user-defined ontologies, implemented by tree structures. • The distance between the corresponding nodes of the tree structure • fl(X, Y) is the level of the nearest common father node of X and Y nodes. • l(X) is the level of node X. • max(p(X)) is the length of the maximum path starting from the root to a leaf and containing node X • p(X, Y) is the length of the directed path (number of edges) connecting X and Y

  8. Suggesting that the distance between any X and Ynodes must be smaller. Suggesting that the distance between any X and Y nodes must be smaller as they are closer to their father node. Suggesting that the distance between X and Y must be smaller as the maximum path containing X and Y becomes larger. Methodology- Proposed dissimilarity measure

  9. Methodology- four criteria • An attribute value has distance zero to itself. • Distances are symmetric. • An attribute value has positive distance to all other values. • Distances obey the triangle inequality.

  10. Methodology – Question of criteria • The triangle inequality does not hold • Ex:in a case similar to that represented by the path A → B → E, where A ≡ X, B ≡ Z, E ≡ Y X=1 0≦ - d(X,Y) + d(X,Z) + d(Z,Y) Z=2 Y=3

  11. Methodology - Medoid • (1). Define the mean level of a set of categorical attribute values within a set of objects O. • (2). The calculation of a medoid is that have been appeared in the majority of the objects in O. • (3). The medoid attribute value is the attribute value of this path assigned to the average level of the nodes in O.

  12. J19 J19 J19 J19 J19 J19 J20 J20 J20 J20 J20 J20 J4 J4 J1 J1 J4 J1 J9 J9 J9 J9 J9 J9 J15 J15 J15 J15 J15 J15 J5 J5 J5 J10 J10 J10 J10 J10 J10 J8 J8 J8 J2 J2 J2 J16 J16 J16 J16 J16 J16 J18 J18 J18 J18 J3 J3 J18 J18 J3 J17 J17 J17 J17 J17 J17 J6 J6 J14 J14 J14 J14 J12 J12 J12 J12 J6 J11 J11 J11 J11 J13 J13 J13 J13 J14 J14 J12 J12 J11 J11 J13 J13 J7 J7 J7 Experiments Other:Chi-Square Proposed:

  13. J19 J19 J19 J20 J20 J20 J4 J4 J1 J1 J9 J9 J9 J9 J15 J15 J15 J15 J5 J5 J10 J10 J10 J10 J8 J8 J2 J2 J16 J16 J16 J16 J18 J18 J18 J18 J3 J3 J17 J17 J17 J17 J6 J6 J14 J14 J14 J14 J12 J12 J12 J12 J11 J11 J11 J11 J13 J13 J13 J13 J7 J7 Experiments (Cont.) Other:Chi-Square Proposed:

  14. Conclusion • Provide a new dissimilarity measure for categorical objects which is based on ontologies. • And proved that the proposed dissimilarity measure is not a metric, because it does not obey the triangle inequality. • Use a very efficient cluster tree with categorical data. • However , the computation of the proposed dissimilarity measure is more time consuming ,but it allow that to be efficient even for very large tree structures. Calculation of each of fl(X, Y) and l(X) is O(h) in the worst case Calculation of max(p(X)) is in the order of |N| in the worst case.

  15. Comments • Advantage • A good cluster method with categorical data. • Drawback • … • Application • Cluster questions.

More Related