CACTUS-Clustering Categorical Data Using Summaries

# CACTUS-Clustering Categorical Data Using Summaries

## CACTUS-Clustering Categorical Data Using Summaries

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. CACTUS-Clustering Categorical Data Using Summaries Advisor： Dr. Hsu Graduate：Min-Hung Lin IDSL seminar 2001/10/30

2. Outline • Motivation • Objective • Related Work • Definitions • CACTUS • Performance Evaluation • Conclusions • Comments

3. Motivation • Clustering with categorical attributes has received attention • Previous algorithms do not give a formal description of the clusters • Some of them need post-process the output of the algorithm to identify the final clusters.

4. Objective • Introduce a novel formalization of a cluster for categorical attributes. • Describe a fast summarization-based algorithm CACTUS that discovers clusters. • Evaluate the performance of CACTUS on synthetic and real datasets.

5. Related Work • EM algorithm [Dempster et al., 1977] • Iterative clustering technique • STIRR algorithm[Gibson et al., 1998] • Iterative algorithm based on non-linear dynamical systems • ROCK algorithm[Guha et al., 1999] • Hierarchical clustering algorithm

6. DEF:Support

7. DEF:Strongly Connected

8. DEF:Strongly Connected(cont’d)

9. Formal Definition of a Cluster

10. Formal Definition of a Cluster (cont’d) • is the cluster-projection of C on • C is called a sub-cluster if it satisfies conditions (1) and (3) • A cluster C over a subset of all attributes is called a subspace cluster on S; if |S| = k then C is called a k-cluster

11. DEF:Similarity

12. Inter-attribute Summaries

13. Intra-attribute Summaries

14. Experiments

15. Result • STIRR fails to discover • clusters consisting of overlapping cluster-projections on any attribute • clusters where two or more clusters share the same cluster projection • CACTUS correctly discovers all clusters

16. CACTUS • Three-phase clustering algorithm • Summarization Phase • Compute the summary information • Clustering Phase • Discover a set of candidate clusters • Validation Phase • Determine the actual set of clusters

17. Summarization Phase • Inter-attribute Summaries • Intra-attribute Summaries

18. Clustering Phase • Computing cluster-projections on attributes • Level-wise synthesis of clusters

19. Computing Cluster-Projections on Attributes • Step 1 :pairwise cluster-projection • Step 2 :intersection

20. Computing Cluster-Projections on Attributes (cont’d) Cluster- projection

21. Level-wise synthesis of clusters (cont’d) • Generation procedure

22. Level-wise synthesis of clusters (cont’d) Candidate cluster

23. Validation • Some of the candidate clusters may not have enough support because some of the 2-cluster may be due to different sets of tuples. • Check if the support of each candidate cluster is greater than the threshold: times the expected support of the cluster. • Only clusters whose support on D passes the threshold are retained.

24. Validation Procedure • Setting the supports of all candidate clusters to zero. • For each tuple increment the support of the candidate cluster to which t belongs. • At the end of the scan, delete all candidate clusters whose support is less than the threshold.

25. Extensions • Large Attribute Value Domains • Clusters in Subspaces

26. Performance Evaluation • Evaluation of CACTUS on Synthetic and Real Datasets • Compared the performance of CACTUS with the performance of STIRR

27. Synthetic Datasets • The test datasets were generated using the data generator developed by Gibson et al.(1 million tuples, 10 attributes, 100 attributes values for each attribute)

28. Real Datasets • Two sets of bibliographic entries • 7766 entries are database-related • 30919 entries are theory-related • Four attributes: the first author, the second author, the conference, and the year. • Attribute domains are {3418,3529,1631,44},{8043,8190,690,42},{10212,10527,2315,52}

29. Real Datasets (cont’d) Database-related Theory-related Mixture

30. Results • CACTUS is very fast and scalable(only two scans of the dataset) • CACTUS outperforms STIRR by a factor between 3 and 10

31. Conclusions • Formalized the definition of a cluster for categorical attributes. • Introduced a fast summarization-based algorithm CACTUS for discovering such clusters in categorical data. • Evaluated algorithm against both synthetic and real datasets.

32. Future Work • Relax the cluster definition by allowing sets of attribute values are “almost” strongly connected to each other. • Inter-attribute summaries can be incremental maintained=>Derive an incremental clustering algorithm • Rank the clusters based on a measure of interestingness

33. Comments • Pairwise cluster-projection is the NP-complete problem • A large number of candidate clusters is still a problem