1 / 34

CACTUS-Clustering Categorical Data Using Summaries

CACTUS-Clustering Categorical Data Using Summaries. Advisor : Dr. Hsu Graduate : Min-Hung Lin IDSL seminar 2001/10/30. Outline. Motivation Objective Related Work Definitions CACTUS Performance Evaluation Conclusions Comments. Motivation.

Télécharger la présentation

CACTUS-Clustering Categorical Data Using Summaries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CACTUS-Clustering Categorical Data Using Summaries Advisor: Dr. Hsu Graduate:Min-Hung Lin IDSL seminar 2001/10/30

  2. Outline • Motivation • Objective • Related Work • Definitions • CACTUS • Performance Evaluation • Conclusions • Comments

  3. Motivation • Clustering with categorical attributes has received attention • Previous algorithms do not give a formal description of the clusters • Some of them need post-process the output of the algorithm to identify the final clusters.

  4. Objective • Introduce a novel formalization of a cluster for categorical attributes. • Describe a fast summarization-based algorithm CACTUS that discovers clusters. • Evaluate the performance of CACTUS on synthetic and real datasets.

  5. Related Work • EM algorithm [Dempster et al., 1977] • Iterative clustering technique • STIRR algorithm[Gibson et al., 1998] • Iterative algorithm based on non-linear dynamical systems • ROCK algorithm[Guha et al., 1999] • Hierarchical clustering algorithm

  6. DEF:Support

  7. DEF:Strongly Connected

  8. DEF:Strongly Connected(cont’d)

  9. Formal Definition of a Cluster

  10. Formal Definition of a Cluster (cont’d) • is the cluster-projection of C on • C is called a sub-cluster if it satisfies conditions (1) and (3) • A cluster C over a subset of all attributes is called a subspace cluster on S; if |S| = k then C is called a k-cluster

  11. DEF:Similarity

  12. Inter-attribute Summaries

  13. Intra-attribute Summaries

  14. Experiments

  15. Result • STIRR fails to discover • clusters consisting of overlapping cluster-projections on any attribute • clusters where two or more clusters share the same cluster projection • CACTUS correctly discovers all clusters

  16. CACTUS • Three-phase clustering algorithm • Summarization Phase • Compute the summary information • Clustering Phase • Discover a set of candidate clusters • Validation Phase • Determine the actual set of clusters

  17. Summarization Phase • Inter-attribute Summaries • Intra-attribute Summaries

  18. Clustering Phase • Computing cluster-projections on attributes • Level-wise synthesis of clusters

  19. Computing Cluster-Projections on Attributes • Step 1 :pairwise cluster-projection • Step 2 :intersection

  20. Computing Cluster-Projections on Attributes (cont’d) Cluster- projection

  21. Level-wise synthesis of clusters n

  22. Level-wise synthesis of clusters (cont’d) • Generation procedure

  23. Level-wise synthesis of clusters (cont’d) Candidate cluster

  24. Validation • Some of the candidate clusters may not have enough support because some of the 2-cluster may be due to different sets of tuples. • Check if the support of each candidate cluster is greater than the threshold: times the expected support of the cluster. • Only clusters whose support on D passes the threshold are retained.

  25. Validation Procedure • Setting the supports of all candidate clusters to zero. • For each tuple increment the support of the candidate cluster to which t belongs. • At the end of the scan, delete all candidate clusters whose support is less than the threshold.

  26. Extensions • Large Attribute Value Domains • Clusters in Subspaces

  27. Performance Evaluation • Evaluation of CACTUS on Synthetic and Real Datasets • Compared the performance of CACTUS with the performance of STIRR

  28. Synthetic Datasets • The test datasets were generated using the data generator developed by Gibson et al.(1 million tuples, 10 attributes, 100 attributes values for each attribute)

  29. Real Datasets • Two sets of bibliographic entries • 7766 entries are database-related • 30919 entries are theory-related • Four attributes: the first author, the second author, the conference, and the year. • Attribute domains are {3418,3529,1631,44},{8043,8190,690,42},{10212,10527,2315,52}

  30. Real Datasets (cont’d) Database-related Theory-related Mixture

  31. Results • CACTUS is very fast and scalable(only two scans of the dataset) • CACTUS outperforms STIRR by a factor between 3 and 10

  32. Conclusions • Formalized the definition of a cluster for categorical attributes. • Introduced a fast summarization-based algorithm CACTUS for discovering such clusters in categorical data. • Evaluated algorithm against both synthetic and real datasets.

  33. Future Work • Relax the cluster definition by allowing sets of attribute values are “almost” strongly connected to each other. • Inter-attribute summaries can be incremental maintained=>Derive an incremental clustering algorithm • Rank the clusters based on a measure of interestingness

  34. Comments • Pairwise cluster-projection is the NP-complete problem • A large number of candidate clusters is still a problem

More Related