ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis)

ACCTG 6910Building Enterprise & Business Intelligence Systems(e.bis) Sequential Pattern Mining Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

Sequential Patterns Given: A Transaction Database { cid, tid, date, item } Find: inter-transaction patterns among customers Example: customers typically rent “ Star Wars”, then “Empire Strikes Back” and then “Return of the Jedi”

Sequential Patterns cid tid date item 1 1 01/01/2000 30 1 2 01/02/2000 90 2 3 01/01/2000 40,70 2 4 01/02/2000 30 2 5 01/03/2000 40,60,70 3 6 01/01/2000 30,50,70 4 7 01/01/2000 30 4 8 01/02/2000 40,70 4 9 01/03/2000 90 5 10 01/01/2000 90

Sequential Patterns Itemset : is a non-empty set of items, e.g., {30} , {40, 70}. Sequence: is an ordered list of itemsets, e.g. <{30} {40,70}> , <{40,70} {30} >. Size of sequence is the number of itemsets in that sequence.

Sequential Patterns cid tid date item 1 1 01/01/2000 30 1 2 01/02/2000 90 2 3 01/01/2000 40,70 2 4 01/02/2000 30 2 5 01/03/2000 40,60,70 3 6 01/01/2000 30,50,70 4 7 01/01/2000 30 4 8 01/02/2000 40,70 4 9 01/03/2000 90 5 10 01/01/2000 90 Each transaction of a customer can be viewed as an itemset A customer’s sequences contains the customer’s ordered itemsets

Sequential Patterns cid customer sequence 1 <{30} {90} > 2 <{40,70} {30} {40,60,70}> 3 <{30,50,70}> 4 <{30} {40,70} {90}> 5 <{90}>

Sequential Patterns Sequence <a1 a2 ….an> is contained in sequence <b1 b2 ….bm> if there exist indexes i1<i2….<in such that a1 bi1, a2 bi2, …, and an bin. E.g., <{3} {4,5} {8}> is contained in < {3,8}{4,5,6} {8}> Is <{3} {4,5} {8}> contained in <{7} {3,8} {9}{4,5,6} {8}> ? Is <{3} {4,5} {8}> contained in <{7} {9} {4,5,6} {3,8} {8}> ? Is <{3} {4,5} {8}> contained in <{7} {9} {3,8}{4,5,6} > ?

Sequential Patterns • cid customer sequence • 1 <{30} {90} > • 2 <{40,70} {30} {40,60,70}> • 3 <{30,50,70}> • 4 <{30} {40,70} {90}> • <{90}> • A customer supports sequence s if s is contained in the • sequence for this customer. • E.g., customers 1 and 4 support sequence <{30} {90}>

Sequential Patterns • cid customer sequence • 1 <{30} {90} > • 2 <{40,70} {30} {40,60,70}> • 3 <{30,50,70}> • 4 <{30} {40,70} {90}> • <{90}> • The support for a sequence s is defined as the fraction of • total customers who support s . • E.g., customers 1 and 4 support sequence <{30} {90}> • Supp(<{30} {90}>) = 2/5 = 40%

Sequential Patterns • cid customer sequence • 1 <{30} {90} > • 2 <{40,70} {30} {40,60,70}> • 3 <{30,50,70}> • 4 <{30} {40,70} {90}> • <{90}> • Supp(<{40,70}>) = 2/5 = 40% • Supp({40,70}) = 3/10 = 30%

Sequential Patterns Mining Given: A Transaction Database { cid, tid, date, item } Find: All sequences that have support larger than user-specified minimum support Apriori property: if a sequence is large then all sequences contained in that sequence should be large.

Sequential Patterns Mining • Identify all Large 1-Sequences • Repeat until there is no more Candidate k-Sequences • Identify all Candidate k-Sequences using Large (k-1)-Sequences • Join:Two large (k-1)-sequences, L1 amd L2, that are joinable • must satisfy the following conditions: • L1(1)=L2(1) and L1(2)=L2(2) and …. L1(K-2)=L2(K-2) • L1(K-1) L2(K-1) Prune :prune candidate k-sequences generated in step 2-1 that have sub-sequences not large. Determine Large k-Sequences from Candidate k-Sequences

Sequential Patterns Mining cid customer sequence 1 <{30} {90} > 2 <{40,70} {30} {40,60,70}> 3 <{30,50,70}> 4 <{30} {40,70} {90}> 5 <{90}> Minimum Support: 40%

Sequential Patterns Mining cid customer sequence 1 <{30} {90} > 2 <{40,70} {30} {40,60,70}> 3 <{30,50,70}> 4 <{30} {40,70} {90}> 5 <{90}> Minimum Support: 40% Large 1-Sequence: <{30}> support=4/5=80% <{40}> support=2/5=40% <{70}> support=3/5=60% <{90}> support=3/5=60% <{40,70}> support=2/5=40%

Sequential Patterns Mining Large 1-Sequence: <{30}> support=4/5=80% <{40}> support=2/5=40% <{70}> support=3/5=60% <{90}> support=3/5=60% <{40,70}> support=2/5=40% Candidate 2-Sequence: <{30} {40}> <{30} {70}> <{30} {90}> <{30} {40,70}> <{40} {30}> <{40} {70}> <{40} {90}> <{40} {40,70}> <{70} {30}> <{70} {40}> <{70} {90}> <{70} {40,70}> <{90} {30}> <{90} {40}> <{90} {70}> <{90} {40,70}> <{40,70} {30}> <{40,70} {40}> <{40,70} {70}> <{40,70} {90}>

Sequential Patterns Mining Candidate 2-Sequence: <{30} {40}> <{30} {70}> <{30} {90}> <{30} {40,70}> <{40} {30}> <{40} {70}> <{40} {90}> <{40} {40,70}> <{70} {30}> <{70} {40}> <{70} {90}> <{70} {40,70}> <{90} {30}> <{90} {40}> <{90} {70}> <{90} {40,70}> <{40,70} {30}> <{40,70} {40}> <{40,70} {70}> <{40,70} {90}> Large 2-Sequence: <{30} {40}> support=2/5=40% <{30} {70}> support=2/5=40% <{30} {90}> support=2/5=40% <{30} {40,70}> support=2/5=40%

Sequential Patterns Mining Large 2-Sequence: <{30} {40}> support=2/5=40% <{30} {70}> support=2/5=40% <{30} {90}> support=2/5=40% <{30} {40,70}> support=2/5=40% Candidate 3-Sequence: <{30} {40} {70}> <{30} {40} {40,70}> <{30} {70} {40}> <{30} {70} {40,70}> <{30} {40,70} {40}> <{30} {40,70} {70}> <{30} {40} {90}> <{30} {90} {40}> <{30} {70} {90}> <{30} {90} {70}> <{30} {90} {40,70}> <{30} {40,70} {90}> Prune: All sub-sequences of a candidate k-sequence should be large. Candidate 3-Sequence: No candidate 3-sequence. Stop.

Summary • What is a sequential pattern? • What is support for a sequential pattern? • How to mine sequential patterns? • What are the similarities and dissimilarities between association rules and sequential patterns mining?

ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis)