1 / 24

Efficiently Clustering Transactional data with Weighted Coverage Density

Efficiently Clustering Transactional data with Weighted Coverage Density. M. Hua Yan , Keke Chen, and Ling Liu Proceedings of the 15 th International Conference on Information and Knowledge Management, ACM CIKM, 2006. 報告人 : 吳建良. Outline. Motivation SCALE Framework BKPlot Method

hollye
Télécharger la présentation

Efficiently Clustering Transactional data with Weighted Coverage Density

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan , Keke Chen, and Ling Liu Proceedings of the 15th International Conference on Information and Knowledge Management, ACM CIKM, 2006 報告人:吳建良

  2. Outline • Motivation • SCALE Framework • BKPlot Method • WCD Clustering Algorithm • Cluster Validity Evaluation • Experimental Results

  3. Motivation • Transactional data is a kind of special categorical data • t1={milk, bread, beer}, t2={milk, bread} • Can be transformed to row by column table with Boolean value • Large volume and high dimensionality make the existing algorithms inefficient to process the transformed data • Clustering transactional data algorithm: LargeItem, CLPOE, CCCD • Require users to manually tune at least one or two parameters • Setting these parameters are different from dataset to dataset

  4. SCALE Framework • ACE & BkPlot (SSDBM’05) • ACE: Agglomerative Categorical clustering with Entropy criterion • BkPlot: • Examine the entropy difference between the clustering structures with varying K • Reports the Ks where the clustering stricture changes dramatically • Evaluation Metrics • LISR: Large Item Size Ratio • AMI: Average pair-clusters Merging Index

  5. ACE Algorithm • Bottom-up process • Initially, each record is a cluster • Iteratively, find the most similar pair of clusters Cp and Cq, and then merge them • Incremental entropy • The most similar pair of clusters • is minimum among all possible pairs • denote the Im value in forming the K-cluster partition from the K+1-cluster partition

  6. BkPlot • Increasing rate of entropy: • N: total records, d: columns • Small increasing rate • Merging does not introduce any impurity to the clusters • Clustering structure is not significantly changed • Large increasing rate • Introduce considerable impurity into the partitions • Clustering structure can be changed significantly

  7. BkPlot (contd.) • Relative changes • Use relative changes to determine if a globally significant clustering structure emerges I(K)≈I(K+1), but I(K-1)>I(K)

  8. BkPlot (contd.) Second-order differential of ECG: Entropy Characteristic Graph (ECG)

  9. WCD Clustering Algorithm • Notations • D: transactional dataset • N: size of dataset • I={I1, I2,…, Im}: a set of items • tj={Ij1, Ij2,…, Ijl}: a transaction • A transaction clustering result CK={C1, C2,…,CK} is a partition of D, where

  10. Intra-cluster Similarity Measure • Coverage Density (CD) • Given a cluster Ck • Mk: Number of distinct items • : Items set of Ck • Nk : Number of transaction in Ck • Sk: Sum occurrences of all items in Ck CD↑, compactness ↑

  11. Intra-cluster Similarity Measure (contd.) • Drawback of CD • Insufficient to measure the density of frequent itemset • Each item has equal contribution in a cluster • Two clusters may have the same CD but different filled-cell distribution a b c a b c

  12. Intra-cluster Similarity Measure (contd.) • Weighted Coverage Density (WCD) • Focus on high-frequency items • Define Wj as CD WCD a b c a b c

  13. Clustering Criterion • Expected Weighted Coverage Density (EWCD) • Clustering algorithm try to maximize the EWCD • When every individual transaction is considered as a cluster, it will get the maximum EWCD=1 • Use BKPlot method to generate a set of candidate “best Ks”

  14. WCD Clustering Algorithm Input: Dataset D, Number of clusters K, Initial K seeds Output: K clusters /* Phase 1 – Initialization*/ K seeds form the initial K clusters; while not end of D do read one transaction t from D; add t into Ci that maximizes EWCD; write <t, i> back to D; /* Phase 2 – Iteration*/ while moveMark = true do moveMark = false; randomly generate the access sequence R while has not checked all transactions do read <t, i>; if moving t to cluster Cj increases EWCD and i ≠ j moveMark = true; write <t, j> back to D;

  15. Cluster Validity Evaluation • LISR (Large Item Size Ratio) • Measure the preservation of frequent itemsets • , where LSk is #Large Items in Ck • high concurrences of items high possibility of finding more frequent itemsets at user-specified minimum support

  16. Cluster Validity Evaluation (contd.) • Inter-cluster dissimilarity between Ci and Cj simplify , where Mij is the number of distinct items after merging two cluster thus Mij ≧max{Mi, Mj} Because of and , d(Ci, Cj) is a real number between 0 and 1

  17. Cluster Validity Evaluation (contd.) Ci Cj • Example • If Mi=Mj=Mij, then d(Ci,Cj)=0 • Mi=Mj=3, Mij=5 a b c a b c Ci Cj a b c c d e

  18. Cluster Validity Evaluation (contd.) • AMI (Average pair-clusters Merging Index) • Evaluate the overall inter-dissimilarity of a clustering result having K clusters • better the clustering quality

  19. Experiments • Dataset • Tc30a6r1000 • 1000 records, 30 column, 6 possible attribute values • Zoo • 101 records, 18 attributes • Mushroom • 8124 instances, 22 attributes • Mushroom100k • Sample the mushroom data with duplicates • 100,000 instances • TxI4Dx • IBM Data Generator

  20. Experimental Results • Tc30a6r1000 The repulsion parameter r of CLOPE is controlling the number of clusters 5 clusters 9 clusters

  21. Experimental Results (contd.) • Zoo: K=7 is the best 2 clusters 4 clusters 7 clusters

  22. Experimental Results (contd.) • Mushroom: K=19 is the best

  23. Experimental Results (contd.) • Performance evaluation on mushroom100k r=0.5~4.0 r=2.0

  24. Experimental Results (contd.) • Performance evaluation on TxI4Dx T10I4Dx TxI4D100k

More Related