Download
cactus clustering categorical data using summaries n.
Skip this Video
Loading SlideShow in 5 Seconds..
CACTUS-Clustering Categorical Data Using Summaries PowerPoint Presentation
Download Presentation
CACTUS-Clustering Categorical Data Using Summaries

CACTUS-Clustering Categorical Data Using Summaries

137 Views Download Presentation
Download Presentation

CACTUS-Clustering Categorical Data Using Summaries

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. CACTUS-Clustering Categorical Data Using Summaries Advisor: Dr. Hsu Graduate:Min-Hung Lin IDSL seminar 2001/10/30

  2. Outline • Motivation • Objective • Related Work • Definitions • CACTUS • Performance Evaluation • Conclusions • Comments

  3. Motivation • Clustering with categorical attributes has received attention • Previous algorithms do not give a formal description of the clusters • Some of them need post-process the output of the algorithm to identify the final clusters.

  4. Objective • Introduce a novel formalization of a cluster for categorical attributes. • Describe a fast summarization-based algorithm CACTUS that discovers clusters. • Evaluate the performance of CACTUS on synthetic and real datasets.

  5. Related Work • EM algorithm [Dempster et al., 1977] • Iterative clustering technique • STIRR algorithm[Gibson et al., 1998] • Iterative algorithm based on non-linear dynamical systems • ROCK algorithm[Guha et al., 1999] • Hierarchical clustering algorithm

  6. DEF:Support

  7. DEF:Strongly Connected

  8. DEF:Strongly Connected(cont’d)

  9. Formal Definition of a Cluster

  10. Formal Definition of a Cluster (cont’d) • is the cluster-projection of C on • C is called a sub-cluster if it satisfies conditions (1) and (3) • A cluster C over a subset of all attributes is called a subspace cluster on S; if |S| = k then C is called a k-cluster

  11. DEF:Similarity

  12. Inter-attribute Summaries

  13. Intra-attribute Summaries

  14. Experiments

  15. Result • STIRR fails to discover • clusters consisting of overlapping cluster-projections on any attribute • clusters where two or more clusters share the same cluster projection • CACTUS correctly discovers all clusters

  16. CACTUS • Three-phase clustering algorithm • Summarization Phase • Compute the summary information • Clustering Phase • Discover a set of candidate clusters • Validation Phase • Determine the actual set of clusters

  17. Summarization Phase • Inter-attribute Summaries • Intra-attribute Summaries

  18. Clustering Phase • Computing cluster-projections on attributes • Level-wise synthesis of clusters

  19. Computing Cluster-Projections on Attributes • Step 1 :pairwise cluster-projection • Step 2 :intersection

  20. Computing Cluster-Projections on Attributes (cont’d) Cluster- projection

  21. Level-wise synthesis of clusters n

  22. Level-wise synthesis of clusters (cont’d) • Generation procedure

  23. Level-wise synthesis of clusters (cont’d) Candidate cluster

  24. Validation • Some of the candidate clusters may not have enough support because some of the 2-cluster may be due to different sets of tuples. • Check if the support of each candidate cluster is greater than the threshold: times the expected support of the cluster. • Only clusters whose support on D passes the threshold are retained.

  25. Validation Procedure • Setting the supports of all candidate clusters to zero. • For each tuple increment the support of the candidate cluster to which t belongs. • At the end of the scan, delete all candidate clusters whose support is less than the threshold.

  26. Extensions • Large Attribute Value Domains • Clusters in Subspaces

  27. Performance Evaluation • Evaluation of CACTUS on Synthetic and Real Datasets • Compared the performance of CACTUS with the performance of STIRR

  28. Synthetic Datasets • The test datasets were generated using the data generator developed by Gibson et al.(1 million tuples, 10 attributes, 100 attributes values for each attribute)

  29. Real Datasets • Two sets of bibliographic entries • 7766 entries are database-related • 30919 entries are theory-related • Four attributes: the first author, the second author, the conference, and the year. • Attribute domains are {3418,3529,1631,44},{8043,8190,690,42},{10212,10527,2315,52}

  30. Real Datasets (cont’d) Database-related Theory-related Mixture

  31. Results • CACTUS is very fast and scalable(only two scans of the dataset) • CACTUS outperforms STIRR by a factor between 3 and 10

  32. Conclusions • Formalized the definition of a cluster for categorical attributes. • Introduced a fast summarization-based algorithm CACTUS for discovering such clusters in categorical data. • Evaluated algorithm against both synthetic and real datasets.

  33. Future Work • Relax the cluster definition by allowing sets of attribute values are “almost” strongly connected to each other. • Inter-attribute summaries can be incremental maintained=>Derive an incremental clustering algorithm • Rank the clusters based on a measure of interestingness

  34. Comments • Pairwise cluster-projection is the NP-complete problem • A large number of candidate clusters is still a problem