1 / 34

Mining Sequential Patterns

Mining Sequential Patterns. Presenters: Qian Bai, Jiguo Jiang. Mining Sequential Patterns. Introduction The Algorithm Aprioriall, AprioriSome, DynamicSome Performance Conclusions. Introduction. Background Problem Statement An Example Related Work. Background.

sfulks
Télécharger la présentation

Mining Sequential Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Sequential Patterns Presenters: Qian Bai, Jiguo Jiang Qian Bai, Jigou Jiang

  2. Mining Sequential Patterns • Introduction • The Algorithm • Aprioriall, AprioriSome, DynamicSome • Performance • Conclusions Qian Bai, Jigou Jiang

  3. Introduction • Background • Problem Statement • An Example • Related Work Qian Bai, Jigou Jiang

  4. Background • Customer purchase patterns • Buy computer, then buy software • Rent “Star War”, then “Empire Strikes Back”, and then “Return of the Jedi” • Buy “Fitted Sheet and flat sheet and pillow cases”, followed by “comforter”, and then followed by “drapes and ruffles” • Web access patterns • Open www.yorku.ca, then open www.cs.yorku.ca/mail Qian Bai, Jigou Jiang

  5. Background (Continue) • The sequential pattern mining problem was first introduced by Agrawal and Srikant • Definition: Given a set of sequences, each of which sequence consists of a list of elements and each element consists of a set of items, and given a user-specified min-support threshold, sequential pattern mining is to find all frequent subsequences, i.e., the subsequences whose occurrence frequency in the set of sequences is no less than min-support Qian Bai, Jigou Jiang

  6. Problem Statement • After reading the three papers about “Mining Sequential Patterns”, we focus on a database D of customer transactions • Each transaction consists of the following fields: • Customer-id • Transaction-time • Items purchased in the transaction Note: • No customer has more than one transaction with the same transaction time. • We do not consider quantities of items bought in a transaction Qian Bai, Jigou Jiang

  7. Problem Statement (Continue) • Terminology: • Itemset: a non-empty set of items. (30, 40, 50), (60) • Sequence: ordered list of itemsets. < (30, 40, 50) (60) > • Sequence Length: number of itemsets in a sequence. • Contained: A sequence (a1, a2, …, aN) is contained in another sequence (b1, b2, …, bM) if there exist integers i1<i2<…<iN such that a1bi1, a2bi2, …, aNbiN • < (30) (40 50) > is contained in < (70) (30 80) (40 50 60) > • < (30) (50) > is NOT contained in < (30 50) > Qian Bai, Jigou Jiang

  8. Problem Statement (Continue) • Terminology (Continue): • Maximal Sequence: A sequence is maximal if it is not contained in any other sequence • Support: A customer supports a sequence s if s is contained in the customer-sequence for this customer. It is the fraction of total customers who support this sequence • Litemset: (Large itemset) An itemset satisfying the minimum support • Large sequence: A sequence satisfying the minimum support constraint is called a large sequence Qian Bai, Jigou Jiang

  9. Problem Statement (Continue) Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimum support. Each such maximal sequence represents a sequential pattern Qian Bai, Jigou Jiang

  10. An Example • A Database sorted by Customer ID and Transaction Time Qian Bai, Jigou Jiang

  11. An Example (Continue) • Customer-Sequence Version of the Database Note: • Patterns are not necessarily contiguous. • Some sequences, such as < (30) >, < (30) (40) > though having minimum support, are not in the answer because they are not maximal Qian Bai, Jigou Jiang

  12. Related Work • Differences between Association Rule Mining in Customer Transaction Database and Sequential Pattern Mining • Association Rules Mining: • Finding what items are bought together • Finding intra-transaction patterns • Patterns are unordered set of items • Sequential Patterns Mining: • Finding what items are bought in different transactions • Finding inter-transaction patterns • Patterns are ordered list of sets of items Qian Bai, Jigou Jiang

  13. Algorithm • Sort phase • Sort database with customer-id as the major key and transaction-time as the minor key • Litemset phase • Scan database to find the set of all 1 sequence litemsets L1 based on the given minimum support • Map large itemsets to a set of contiguous integers by treating litemsets as single entities. Example: {30} {40} {70} {40 70} {90} can be mapped to {1} {2} {3} {4} {5} Qian Bai, Jigou Jiang

  14. Algorithm(Continue) • Transformation phase • Replace each transaction by the set of 1-sequence litemsets that it contains • Delete customer sequences that contain no 1-sequence litemset • Keep the same total number of customers • Example: given (30) (90) (40) (70) (40 70) are 1-sequence litemsets Qian Bai, Jigou Jiang

  15. Algorithm(Continue) • Sequence phase • Find the frequent sequences • Three algorithms:AprioriAll, AprioriSome, DynamicSome • Maximal phase • Delete sequences that are subsequences of other large sequences • Combine with the sequence phase in AprioriSome and DynamicSome algorithm • Example: given sequences {1} {2} {3} {4} {1 2} {1 3} {1 2 3}, the maximal sequences will be {4} {1 2 3} Qian Bai, Jigou Jiang

  16. Algorithm AprioriAll • Main idea • All of the subsets of a frequent sequence must be frequent sequences too • If a set is not frequent sequence, then its supersets will not be frequent sequences • Example • {1 2 3} is a frequent sequence, {1} {2} {3} {1 2} {2 3} must be frequent sequences. • {1} is not a frequent sequence, then {1 2} { 1 3} … are not frequent sequences. Qian Bai, Jigou Jiang

  17. AprioriAll (Continue) • Step 1: k = 2 • Step 2: Form Ck using Apriori-generate function • Step3: Scan database and generate Lk from Ck based on the minimum support • Step 4: If Lkis not empty, set k = k+1. Then repeat step 2 and step 3 Qian Bai, Jigou Jiang

  18. AprioriAll (Continue) • Apriori-generate • Join two sequences in Lk-1 to generate Ck • Step 1: for each two sequences in Lk-1 that have the same 1st to k-2th itemsets, select the 1 to k-1 litemset from the first sequence, and join with the last litemset from another sequence • Step 2: delete all sequences in Ck if some of their sub sequences are not in Lk-1 • Example Given L3 = {1 2 3}{2 3 4}{1 2 4}{1 3 4}{1 3 5} • step 1: C4 = {1 2 3 4} {1 3 4 5} {1 3 5 4}{1 2 4 3} • step 2: C4 = {1 2 3 4} Qian Bai, Jigou Jiang

  19. AprioriAll (Continue) • Example: min_sup = 3 • Large sequence = {1 2 3}{1 4} Qian Bai, Jigou Jiang

  20. AprioriSome • Intuition: the subsets of a frequent sequence will not be in the final maximum sequences Example: Suppose {2 3} { 3 4} { 1 2} { 1 2 3} are frequent sequences, then the final maximum sequences are {3 4} and {1 2 3} Qian Bai, Jigou Jiang

  21. AprioriSome (Continue) • Step1: set C1= L1, last =1, k=2 • Step 2: forward phase • Step 2.1: generate Ck from either Lk-1 or Ck-1 • Step 2.2: if k=next(last), scan database to generate Lk based on the minimum support, and set last =k • Step 2.3: if both Ck and Llast are not empty, increase k by 1, and repeat from step 2.1 • Step 3: back ward phase • Step 3.1: decrease k by 1. If Lk is empty, delete sequences in Ck contained in Li where i>k. Scan database again to generate Lk based on the given minimum support. If Lk is not empty, delete sequences in Lk contained in Li where i>k. • Step 3.2: if k>1, repeat from step 3.1. • Step 4: union all the sequences in L Qian Bai, Jigou Jiang

  22. AprioriSome (Continue) • Efficiency: highly depends on the next(k) function • Tradeoff between counting non-maximal sequences versus counting extensions of small candidate sequences. • A special cases: next(k) = k+1 • Example: based on the ratio of the number of Lk to the number of Ck, we decide the value of k Qian Bai, Jigou Jiang

  23. AprioriSome (Continue) • Example: next(k) = 2k, min_sup=2 • Answers: {1 2 3 4}{1 3 5}{4 5} Qian Bai, Jigou Jiang

  24. DynamicSome • Intuition: same idea as AprioriSome • Differences between two algorithms Qian Bai, Jigou Jiang

  25. DynamicSome (Continue) • Step 1: generate L1 to Lstep based on Apriori algorithm • Step 2: forward phase • Step 2.1: Set k = step • Step 2.2: scan db to generate Ck+step using otf-generate(Lk,Lstep,c), and then generate Lk+step from Ck+step based on the given minimum support • Step 2.3: if Lk is not empty, set k = k+step and repeat from step 2.2 • Step 3: intermediate phase • Generate all the missing Ck based on Lk-1 or Ck-1 • Step 4: backward phase which is same as that of AprioriSome Qian Bai, Jigou Jiang

  26. DynamicSome (Continue) • On-the-fly candidate generation • c = <c1 c2 ..cn>, Lk and Lj • Xk = subseq(Lk,c) • For all sequences x belong to Xk do • End = min{j|x is contained in <c1 c2 …cj> • Xj = subseq(Lj,c) • For all sequences x belong to Xj • Start = max{j|x is contained in <cj cj+1 …cn> • Answer = join of Xk with Xj if Xk.end< Xj.start Qian Bai, Jigou Jiang

  27. DynamicSome (Continue) • Example C = <{1} {2} {3 7} {4}> L2 = <1 2><1 3><3 4> Thus, result = <1 2 3 4> Qian Bai, Jigou Jiang

  28. DynamicSome (Continue) • Example: step = 2, min_sup = 2 • Answers: {1 2 3 4}{1 3 5}{4 5} Qian Bai, Jigou Jiang

  29. Performance Qian Bai, Jigou Jiang

  30. Performance (Continue) Note:The result of DynamicSome was not ploted for low values of minimum support since it generated too many candidates and ran out of memory. Qian Bai, Jigou Jiang

  31. Performance (Continue) Qian Bai, Jigou Jiang

  32. Performance (Continue) Qian Bai, Jigou Jiang

  33. Performance (Continue) Qian Bai, Jigou Jiang

  34. Conclusions Question? • The problem of mining sequential patterns from a database of customer transactions was introduced and three algorithms for solving this problem was presented. • Two of the algorithms, AprioriSome and AprioriAll, have comparable performance, although AprioriSome performs a little better for the lower values of the minimum support. • Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. Qian Bai, Jigou Jiang

More Related