1 / 19

Summarizing Itemset Patterns: A Profile-Based Approach

Summarizing Itemset Patterns: A Profile-Based Approach. Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin ACM KDD 05 ’ Advisor : Jia-Ling Koh Speaker : Yu-Jiun Liu

elijah
Télécharger la présentation

Summarizing Itemset Patterns: A Profile-Based Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summarizing Itemset Patterns: A Profile-Based Approach Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin ACM KDD 05’ Advisor:Jia-Ling Koh Speaker:Yu-Jiun Liu 2006/01/06

  2. Introduction Ⅰ • Closed frequent pattern no super-pattern with the same support. • Maximal no frequent super-pattern. • Top-K V.S. K representatives

  3. Introduction Ⅱ • The format of these representatives. • How to find these representatives? • The measure of their quality.

  4. Definition • Bernoulli Distribution Vector • Pattern Profile

  5. Equations • The relative frequency of item οi in D’. • Estimated Support

  6. Pattern Profile Example • Both of the above datasets can be summarized by <abcd>, but the quality is better for D1. • p(a) = (50+1000)/(50+100+1000) = 0.91 • Mabc = <[0.91,0.96,1], abcd, 0.87> • M = <[0.91,0.96,1,1], abcd, 1>

  7. Pattern Summarization • First, construct a special profile for each pattern that only contains that pattern itself. • Use the Kullback-Leibler divergence to merge similar patterns. • KL-divergence

  8. Hierarchical Agglomerative Clustering

  9. K-means Clustering

  10. Optimization Heuristics • Closed Itemset vs. Frequent Itemsets • Given patterns α and β, if and their supports are equal, then • Approximate Profiles • Using the following two equations to instead of original profile updating. for Algorithm 1 for Algorithm 2

  11. Quality Evaluation • Definition (Restoration Error) • T is a testing pattern set. • T’ is the collection of the itemsets generated by the master patterns in profiles and .

  12. Quality Evaluation • J tests “frequent patterns”, some of which may be estimated as “infrequent”. • Jc tests “estimated frequent patterns”, some of which are actually “infrequent”. • Therefore J and Jc are complementary to each other.

  13. Quality Evaluation • Lemma • For any frequent itemset π, there must exist a profile Mk such that , where ψk is the master itemset of Mk.

  14. Optimal Number of Profiles • How to determine K? • M = (p, ψ , ρ) • Ex: require for any i such that • p~q α~β  Dα~Dβ~Dα∪Dβ • Checking the derivative of the quality over K • , If J increase suddenly from K* to K* - 1, K* is likely to be a good choice.

  15. Optimal Number of Profiles

  16. Experiment • Three real datasets and a series of synthetic datasets. • Language: Visual C++ • CPU: Intel 3.2GHz • Memory: 1GB • OS: Windows XP

  17. Mushroom ※688 closed patterns

  18. BMS-Webview1 & Replace ※threshold = 0.1% ※4195 closed patterns ※many small frequent itemsets ※threshold = 3% ※4315 closed patterns ※many small frequent itemsets

  19. Synthetic Datasets • Provided by IBM • 7 datasets, each has 10000 transactions. Choose top-500. • K = 50 and 100

More Related