1 / 21

Sequential PAttern Mining using A Bitmap Representation

Sequential PAttern Mining using A Bitmap Representation. Jay Ayres, Johannes Gehrke Tomi Yiu, Jason Flannick ACM SIGKDD 2002. Outline. Introduction The SPAM Algorithm Data Representation Experimental Evaluation Conclusion. Introduction.

ace
Télécharger la présentation

Sequential PAttern Mining using A Bitmap Representation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Johannes Gehrke Tomi Yiu, Jason Flannick ACM SIGKDD 2002

  2. Outline • Introduction • The SPAM Algorithm • Data Representation • Experimental Evaluation • Conclusion

  3. Introduction • Sequential Patterns Mining :依據顧客購買商品的交易資料庫來進行分析,用來找出顧客購買商品行為是否存在商品間先後順序關係。 • Input Output ({b}, {d, c}, {c}) I={a, b, c, d} Dataset sorted by CID and TID

  4. Introduction (Frequent) • Sa:A Sequence. Example:({a}, {c}) • supD(Sa):The support of Sa in D.Example:Sa=({a},{b, c}), we know supD(Sa)=2 • Given a support threshold minSup, a Sequence Sa is called a frequent sequential pattern on D if supD(Sa)  minSup. • Example:minSup=2 • Sa is deemed frequent. Sequence for each customer

  5. The SPAM Algorithm(Lexicographic Tree for Sequences) Consider all sequences arranged in a sequence treewith the following structure: • The root of the tree is labeled with . {} I={a, b} Sequence a Sequence a, a a, b (a, b) Figure 1:The Lexicographic Sequence Tree a, a, a a, a, b a, (a, b)

  6. The SPAM Algorithm(Lexicographic Tree for Sequences) • Each sequence in the sequence tree can be considered as either a sequence-extended sequence or an itemset-extended sequence. {} I={a, b} a a, a a, b (a, b) Figure 1:The Lexicographic Sequence Tree a, a, a a, a, b a, (a, b)

  7. The SPAM Algorithm (Lexicographic Tree for Sequences) ({a, b, c}, {a, b}) • For example: Sa= • sequence-extended sequence of Sa. • itemset-extended sequence of Sa. =S-Step =I-Step ({a, b, c}, {a, b}) ({a, b, c}, {a, b}) S-Step I-Step ……… ({a, b, c}, {a, b}, {d}) ({a, b, c}, {a, b, d})

  8. The SPAM Algorithm (Lexicographic Tree for Sequences) • Sn, the set of candidate items that are considered for a possible S-step extensions of node n (abbreviated s-extensions). Example : S({a})={a, b, c, d} a S-Step a, a a, b Sequence for each customer a, c a, d

  9. The SPAM Algorithm (Lexicographic Tree for Sequences) • In, which identifies the set of candidate items that are considered for a possible I-step extensions (abbreviated, i-extensions). Example : I({a})={b, c, d} a I-Step (a, b) (a, c) (a, d) Sequence for each customer

  10. The SPAM Algorithm (Depth First Tree Traversal) I={a, b} • A standard depth-first manner. • supD(S)  minSup, store S and repeat DFS. {} a • supD(S) < minSup,not need to repeat DFS. a, a a, b (a, b) • A huge search space. a, a, a a, a, b a, (a, b) Figure 1: The Lexicographic Sequence Tree

  11. The SPAM Algorithm(Pruning) • S-step PruningSuppose we are at node ({a}) in the tree and suppose that S({a}) = {a, b, c, d}, I({a}) = {b, c, d}. Suppose that ({a}, {c}) and ({a}, {d}) are not frequent. By Apriori principle; ({a},{a},{c}) ({a},{b},{c}) ({a},{a,c}) ({a},{b,c})({a},{a},{d})({a},{b},{d}) ({a},{a,d}) ({a},{b,d})are not frequent Hence, when we are at node ({a}, {a}) or ({a}, {b}), we do not have to perform I-step or S-step using items c and d, i.e. S({a},{a}) = S({a},{b}) = {a, b} I({a},{a}) = {b}, and I({a}, {b}) =  .

  12. a a,a a,b a,c a,d {a,b} {a,c} {a,d} a,a,a a,{b,d} a,{b,c} a,a,b a,{a,d} a,b,d a,a,c a,b,a a,{a,c} a,a,d a,b,c a,b,b a,{a,b}

  13. The SPAM Algorithm(Pruning) • I-step Pruning (Example)Let us consider the same node ({a}) described in the previous section. The possible itemset-extended sequences are ({a, b}), ({a, c}), and ({a, d}). If ({a, c}) is not frequent, then ({a, b, c}) must also not be frequent by the Apriori principle. Hence, I({a, b}) = {d}, S({a,b}) = {a, b}, and S({a, d}) = {a, b}.

  14. a {a,d} {a,b} {a,c} {a,b,d} ………… {a,d},a {a,d},b {a,b},a {a,b,c} {a,b},d {a,b},b {a,b},c

  15. Data Representation (Data Structure) • Our algorithm uses a vertical bitmap representation of the data. • A vertical bitmap is created for each item in the dataset, and each bitmap has a bit corresponding to each transaction in the dataset. • If item i appears in transaction j, then the bit corresponding to transaction j of the bitmap for item i is set to one; otherwise, the bit is set to zero.

  16. Data Representation (Data Structure) • If the size of a sequence is between 2k + 1 and 2k+1, we consider it as a 2K+1-bit sequence. • (the size of a sequence =3), k=1, 21+1-bit=4-bit CID TID 1 1 1 3 1 6- - 2 2 2 4 - -- - 3 5 3 7 - -- - Dataset sorted by CID and TID

  17. Data Representation (Candidate Generation) S-step Process • We first generate a bitmap from B(sa) such that all bits less than or equal to k are set to zero, and all bits after k are set to one. • We call this bitmap a transformed bitmap. • We then AND the transformed bitmap with the item bitmap. • The resultant bitmap has the properties described above and it is exactly the bitmap for the generated sequence.

  18. Data Representation (Candidate Generation) S-step Process S-step result & process

  19. Data Representation (Candidate Generation) I-step Process result &

  20. Experimental Evaluation(ComparisonWith SPADE and PrefixSpan) • We compared the three algorithms on several small, medium, and large datasets for various minimum support values. • This set of tests shows that SPAM outperforms SPADE by about a factor of 2.5 on small datasets and better than an order of magnitude for reasonably large datasets. • PrefixSpan outperforms SPAM slightly on very small datasets, but on large datasets SPAM outperforms PrefixSpan by over an order of magnitude.

  21. Conclusion • We presented an algorithm to quickly find all frequent sequences in a list of transactions. • The algorithm utilizes a depth-first traversal of the search space combined with a vertical bitmap representation to store each sequence. • Experimental results demonstrated that our algorithm outperforms SPADE and PrefixSpan on large datasets

More Related