Spring 2014 Presentation by: Thomas Little

Mining Sequential Patterns Rakesh Agrawal & Ramakrishnan Srikant Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995. Spring 2014 Presentation by: Thomas Little *with slides adapted from Dan Brown’s 2011 presentation.

Outline • Introduction • Problem Description • Finding Sequential Patterns • Performance • Conclusion • Final Exam Questions 1

Introduction • Bar-code technology allows the collection of massive amounts of sales data (basket data). • A typical data record consists of: • transaction date • items bought • customer-id 3

Introduction - Cont. • The problem of mining sequential patterns over this data is introduced. • So far, we have seen frequent pattern mining in the context of association rules, where we were interested in what items were purchased in the same transaction. These are intra-transactional patterns. 4

Introduction - Cont. • The problem of sequential pattern mining is concerned with inter-transactional patterns. • A pattern in the first case consists of a set of unordered items: {a,c,d,g} • A pattern in the second case is an ordered list of sets of items: <{a},{c,d},{g}> 5

Introduction - Cont. • An example of a sequential pattern: Customers typically rent* “Star Wars”, then “The Empire Strikes Back”, followed by “Return of the Jedi”. *Note that these rentals do not need to be consecutive. Customers who rent other videos in between also support this sequential pattern. 6

Introduction - Cont. • Elements of a sequential pattern can be sets of items as well. For example: “Fitted sheet, flat sheet, and pillow cases”, followed by “comforter”, followed by “drapes and ruffles”. 7

Problem Description • We are given a database Dof customer transactions. • Each transaction consists of the fields: • customer-id • transaction-time • items purchased in the transaction 9

Problem Description • No customer has more than one transaction with the same transaction-time. • Quantities of items bought are not considered: each item is a binary variable representing whether an item was bought or not. 10

Problem Description (Terminology and definitions) • Itemset: non-empty set of items. Each itemset is mapped to an integer. • Sequence: Ordered list of itemsets. • Customer Sequence: List of customer transactions ordered by increasing transaction time. • A customer supports a sequence if the sequence is contained in the customer-sequence. • Support for a Sequence: Fraction of total customers that support a sequence. 11

Problem Description (Terminology and definitions) - Cont. • Maximal Sequence: A sequence that is not contained in any other sequence. • Large Sequence: Sequence that meets minisup. • Lengthof a sequence: The # of itemsets in the sequence. A sequence of length k is called a k-sequence. • The support for an itemseti is defined as the fraction of customers who bought the items in i in a single transaction. • an itemset with minimum support is called a large itemset or Litemset. 12

Problem Description (Terminology and definitions) - Cont. • Note that each itemset in a large sequence must have minimum support. Therefore, any large sequence must be a list of Litemsets. 13

Problem Description - Cont. • Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain specified minimum support. • Each such sequence represents a sequential pattern. 14

Problem Description Example: Seq with minimum support Note: Use Minisup of 25%, no less than two customers must support the sequence < (10 20) (30) > Does not have enough support (Only by Customer #2) < (30) >, < (70) >, < (30) (40) > … are not maximal. 15

Finding Sequential Patterns • The problem of finding sequential patterns is split into five phases: • Sort Phase • Large itemset (Litemset) Phase • Transformation Phase • Sequence Phase • Maximal Phase 17

Finding Sequential Patterns: 1. Sort Phase • The DB is sorted, with customer-id as the major key and transaction-time as the minor-key. • This step implicitly converts the original transaction DB into a DB of customer sequences. • Recall, a Customer Sequence is a list of customer transactions ordered by increasing transaction time. 18

Finding Sequential Patterns: 2. Litemset Phase • In this phase we find the set of all Large itemsets (Litemsets) L. • We are also simultaneously finding the set of large 1-sequences, since this set is just: {< l > | l ∈ L } 19

Finding Sequential Patterns: 2. Litemset Phase - Cont. • In Apriori, the support for an itemset was defined as the fraction of transactions in which an itemset is present. • In the sequential pattern finding problem, the support is the fraction of customers who bought the itemset in any one of their possibly many transactions. 20

Finding Sequential Patterns: 2. Litemset Phase - Cont. • The set of Litemsets is mapped to a set of contiguous integers. • By treating Litemsets as single entities, two Litemsets can be compared for equality in constant time, reducing the time required to check if a sequence is contained in a customer sequence. 21

Finding Sequential Patterns: 2. Litemset Phase - Cont. • Example with the minimum support 40% 22

Finding Sequential Patterns: 3. Transformation Phase • As we shall see later, we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence. In order to make this test fast, the customer sequences are transformed into an alternative representation. 23

Finding Sequential Patterns: 3. Transformation Phase - Cont. • Each transaction is replaced by the set of all Litemsets contained in the transaction. • Transactions with no Litemsets are dropped. (But empty customer sequences still contribute to the total customer count) • A customer sequence is now represented by a list of sets of Litemsets 24

Finding Sequential Patterns: 3. Transformation Phase - Cont. Note: (10 20) dropped because of lack of support. (40 60 70) replaced with set of litemsets {(40),(70),(40 70)} (60 does not have minisup). 25

Finding Sequential Patterns 4. Sequence Phase Overview Seed set of large sequences Create candidate sequences Scan data to find support of candidate sequences Determine large sequences 26

Finding Sequential Patterns 4. Sequence Phase • Use the set of Litemsets to find the desired sequences. • Two families of algorithms are presented: • Count-all • Count-some 27

Finding Sequential Patterns 4. Sequence Phase • Count-all algorithms count all the large sequences, including non-maximal sequences, which are pruned out in the maximal phase. 28

Finding Sequential Patterns 4. Sequence Phase • Count-some algorithms try to avoid counting non-maximal sequences by first counting longer sequences in a forward phase, then counting the sequences skipped in a backward phase. 29

Finding Sequential Patterns 4. Sequence Phase: AprioriAll L1 = {large 1-sequences}; //result of Litemset phase for (k = 2; Lk-1 ≠ {}; k++) do begin Ck = New candidates generated from Lk-1 foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. end Answer = Maximal Sequences in ∪k Lk Notation: Lk: Set of all large k-sequences , Ck: Set of candidate k-sequences. 30

Finding Sequential Patterns 4. AprioriAll Candidate Generation • The apriori-generate function takes as argument Lk-1, the set of all large (k-1)-sequences. The function works as follows: First, join Lk-1 with Lk-1: insert into Ck select p.litemset1 , ..., p.litemsetk-1, q.litemsetk-1 from Lk-1 p, Lk-1 q where p.litemset1 = q.litemset1, . . ., p.litemsetk-2 = q.litemsetk-2 ; • Next, delete all sequences c ∈ Ck such that some (k-1)-subsequence of c is not in Lk-1 31

Finding Sequential Patterns 4. AprioriAll Candidate Generation Example • <1 2 4 3> is pruned out because the subsequence <2 4 3> is not in L3. The authors cite a previous paper for the proof of correctness of the candidate generation procedure. 32

Finding Sequential Patterns 4. AprioriAll Maximal Phase • Having found the set S of all large sequences in the sequence phase, the following algorithm can be used to find the maximal sequences. Let n = length of the longest sequence for ( k = n; k > 1; k --) foreach k-sequence skdo Delete from S all subsequences of sk • Authors claim data-structures and an algorithm exist to do this efficiently (hash-trees), citing two of their earlier papers. 33

Finding Sequential Patterns 4. AprioriAll Example 34

Finding Sequential Patterns 4. AprioriSome • AprioriSome uses the function “next” to determine which sequences to skip. Let hitk = |Lk| / |Ck| (i.e., ratio of large k-sequences to candidate k-sequences) function next(k: integer) //k is the length of seq counted last pass begin if (hitk < 0.666) return k + 1; elseif (hitk < 0.75) return k + 2; elseif (hitk < 0.80) return k + 3; elseif (hitk < 0.85) return k + 4; elsereturn k + 5; end • next returns the length of sequences to count in the next pass. 35

Finding Sequential Patterns 4. AprioriSome Forward Phase L1 = {large 1-sequences}; //Result of Litemset phase C1 = L1; last = 1; //We last counted Clast for (k = 2; Ck-1 ≠ {} and Llast ≠ {}; k++) do begin if (Lk-1 known) then Ck = New candidates generated from Lk-1 else Ck = New candidates generated from Ck-1 if (k== next(last) ) thenbegin // (next k to count?) foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. last = k; end end 36

Finding Sequential Patterns 4. AprioriSome Backward Phase for (k--; k>=1; k--) do if (Lk not found in forward phase) then begin Delete all sequences in Ck contained in some Li , i>k; foreach customer-sequence c in DTdo Increment the count of all candidates in Ck that are contained in c Lk = Candidates in Ck with minimum support end else // Lk already known Delete all sequences in Lk contained in some Li , i>k; Answer = Uk Lk //(Maximal Phase not Needed) *Notation: DT; Transformed database 37

Finding Sequential Patterns 4. AprioriSome Example 38

Finding Sequential Patterns 4. AprioriSome Example - Cont. Minimum Support = 40% (2 customer sequences). 39

Finding Sequential Patterns 4. AprioriSome Example Answer: <1 2 3 4> , <1 3 5> , <4 5> 40

Finding Sequential Patterns 4. DynamicSome • Like AprioriSome, it skips counting candidate sequences of certain lengths in the forward phase. • AprioriSome generates Ck from Lk-1 or Ck-1 • DynamicSome generates Ck “on the fly” based on large sequences found from the previous passes and the customer sequences read from the database. 41

Finding Sequential Patterns 4. DynamicSome • In the initialization phase, count only sequences up to and including step variable length If step is 3, count sequences of length 1, 2 and 3 • In the forward phase, we generate sequences of length 2 × step, 3 × step, 4 × step, etc. on-the-fly based on previous passes and customer sequences in the database • While generating sequences of length 9 with a step size 3: While passing the data, if sequences s6∈ L6 and s3∈ L3 are both contained in the customer sequence c in hand, and they do not overlap in c, then < s6.s3 > is a candidate (6+3)-sequence 42

Finding Sequential Patterns 4. DynamicSome • In the intermediate phase, generate the candidate sequences for the skipped lengths • If we have counted L6 and L3 , and L9 turns out to be empty: we generate C7 and C8 , count C8 followed by C7 after deleting non-maximal sequences, and repeat the process for C4 and C5. • The backward phase is identical to AprioriSome. 43

Finding Sequential Patterns 4. DynamicSome Initialization Phase // step is an integer ≥ 1 L1 = {large 1-sequences}; // Result of litemset phase for ( k = 2; k <= step and Lk􀀀-1 ≠ {}; k++ ) do begin Ck = New candidates generated from Lk􀀀-1; foreach customer-sequence c in DTdo Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. end 44

Finding Sequential Patterns 4. DynamicSome Forward Phase for ( k = step; Lk􀀀 ≠ {}; k+= step ) do begin //find Lk+step from Lk and Lstep Ck+step ={}; foreach customer-sequence c in DTdo begin X = otf-generate(Lk , Lstep , c) // foreach sequence x ∈ X’, increment its count in Ck+step (adding it to Ck+step if necessary). end Lk+step = Candidates in Ck+step with minimum support. end 45

Finding Sequential Patterns 4. DynamicSome OTF-Generate // c is the customer sequence < c1c2...cn > Xk = subseq(Lk, c); forall sequences x ∈ Xkdo x.end = min{ j | x ⊆< c1c2...cj > }; Xj = subseq(Lj , c); forall sequences x ∈Xjdo x.start = max{ j | x ⊆< cjcj+1...cn >}; Answer = join of Xkwith Xj with the join condition: Xk.end < Xj.start; 46

Finding Sequential Patterns 4. DynamicSome OTF-Generate cont Let otf-generate be called with L2 and let c = <{1} {2} {3 7} {4}>. Thus, c1 = {1}, c2= {2}, etc. Thus, the result of the join with the join condition: X2.end < X2.start (where X2 denotes the seq. of length 2) is the single sequence: <1 2 3 4> 47

Finding Sequential Patterns 4. DynamicSome intermediate Phase for ( k--; k > 1; k-- ) do if (Lk not yet determined) then if (Lk-1 known) then Ck = New candidates generated from Lk-1; else Ck = New candidates generated from Ck-1; 48

Finding Sequential Patterns 4. DynamicSome Example • Let step = 2 use L2 and L2 as argument in otf-generate to get C4 49

Spring 2014 Presentation by: Thomas Little