1 / 34

Mining Sequential Patterns

Mining Sequential Patterns. Rakesh Agrawal & Ramakrishnan Srikant Presented By: Saleha Rizvi Sadia Masud. Outline. Introduction Terminology Formal Definition of MSP Finding Sequential Patterns 3 Algorithms Performance Evaluation Conclusion. Introduction.

aduck
Télécharger la présentation

Mining Sequential Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Sequential Patterns Rakesh Agrawal & Ramakrishnan Srikant Presented By: Saleha Rizvi Sadia Masud

  2. Outline • Introduction • Terminology • Formal Definition of MSP • Finding Sequential Patterns • 3 Algorithms • Performance Evaluation • Conclusion

  3. Introduction • X->Y i.e. if ‘x’ exists ‘y’ will also follow in future with specified support. (laptop-> battery) • When things happen over time and in a sequence All transactions have a timestamp

  4. Introduction Assumptions • No more than one transaction at same time. • Quantity not important, occurrence of an item is. • If a customer bought same item in different transaction but in same sequence its counted only once. Properties: • Find inter-transaction patterns • Patterns consist of orderedlist of items (a, b) !=(b ,a) • Patterns need not be consecutive/contiguous e.g. T= (a, b, c, d) all (a, c, d) , (a, d), (b, d) are valid sequences of T • Element of patterns could be a single item or sets of items

  5. Terminology • Transaction: set of items bought at a time <customer id, time, items bought > • Itemset: non-empty set of items • eg. itemset I: (i1i2…im), each ij = an item • Sequence: ordered list of itemsets • eg. sequence s <s1s2…sn >, each si= an itemset Example For John Time 1: <(laptop, battery)> Time 2 : <(laptop bag)> Time 3 : <(Norton Antivirus)> D – database of transactions ( t2 ) ~ <(laptop bag)> Sequence for John becomes <(laptop, battery) (laptop bag) (Norton Antivirus)>

  6. Terminology • Sequence containership: a sequence a1a2…an is contained in another sequence b1b2…bm iff each itemset ‘ai’ is a subset of any itemset of ‘bi’ a = < (4) (7 9 10) (1 3 6) > b = <(8) (4 2 )(2 7 8 9 10) ( 2 1 4 3 7 9 6)> c = < (4 11) (7 9 10) (1 3 6) > a contained in b. c !contained in b d= <(4)(2)> 4 and 2 bought in different transactions e= <(4,2)> 4 and 2 bought in same transaction d not contained in e and vice versa

  7. Terminology • Maximal Sequence : not contained in other sequence. a = < (4) (7 9 10) (1 3 6) > not maximal -contained in b c= < (4) (7 9 10) (1 3 6 5) > -maximal b = <(8) (4 2 )(2 7 8 9 10) ( 2 1 4 3 7 9 6)> • Customer’s support for sequence : customer supports ‘s’ if ‘s’ contained in customer sequence e.g. John supports sequence ‘r’ but not ‘s’ John’s sequence <(laptop, battery) (laptop bag) (Norton Antivirus)> r=<(laptop )(laptop bag)> s=<(laptop )(laptop bag) (notebook cam)>

  8. Terminology • Support for sequence : Number of customers who support that sequence • e.g. database - 100 customer • 10 customer support sequence ‘r’ • Then, support for sequence r is 10% • different from ARM where support is in term of occurrences!! • Large Sequence: Sequence satisfying minimum support • Length of a sequence: No. of itemsets in a sequence i.e. k-sequence => sequence with k itemsets.

  9. Formal definition of MSP • Given D (Database of transactions) “Find maximal sequences that meet user specified minimum support”. • Involves 2 steps • Find all maximal sequences • Eliminate those which do not meet minimum support .

  10. Finding Sequential Patterns • Solution has 5 Phases • Sort Phase • Litemset phase • Transformation phase • Sequence Phase • Maximal phase

  11. Sort Phase • Sort the Database in ascending order • customer id -> primary key • transaction time ->secondary key • All transactions belonging to one customer in contiguous tuples. • Express the database as a set of customer-sequences.

  12. An Example

  13. Example: Customer Sequence Minimum support = 40% ~ 2 customers < (30)(90)> and <(30)(40 70)> => have min support & maximal => Desired Sequential Patterns <(30)>, <(30) (40)>, <(40)(70)>, etc. => have min support but not maximal customer sequences with no minimum support : <(10 20 )(30)>, <(30 50 70)>

  14. Litemset Phase • Some Definitions: • Support for an itemset: fraction of customers who bought the items in that itemset in a singletransaction; • litemset or large itemset: An itemset with minimum support • Note: A large sequence must be a list of litemsets (i.e., each itemset in a large sequence must have minimum support) • Find the set of all litemsets L (I.e., all large 1-sequences) • Using Apriori Algorithm • Note: support count should be incremented only once per customer • Map the set of litemsets to a set of contiguous integers (Optimizes time)

  15. Transformation Phase • Each transaction is replaced by the set of all litemsets in that transaction. • If no litemsets in a transaction => delete transaction. • If a customer sequence does not contain any large itemset => delete • Note: it still counts towards total number of customers. • Each litemset is then replaced by its mapped integer from litemset phase

  16. Transformation Phase Not a large itemset Only large itemsets A large itemset

  17. Sequence Phase • Use the litemsets from transformed DB to find the desired sequences • To achieve this: • Multiple passes over data • In each pass use the large sequences from the previous pass to generate new potentially large sequences/candidate sequences (like ARM). • Find support count for candidate sequences • End of pass find which of these candidate sequences are actually large/meet minimum support.

  18. Maximal Phase • Find the maximal sequences among the set of large sequences • Often combined with the Sequence Phase • S = set of all large sequences, n = length of the longest sequence • Algorithm to find maximal sequences:

  19. Sequence Phase: Detailed • Two Approaches • Count All : Includes Counting of non-maximal sequences => Pruned out in Maximal Phase • Count Some : Count Only maximal sequences; avoid counting non-maximal sequences. (No Maximal Phase).

  20. Sequence Phase (Contd.) • Three Algorithms: • AprioriAll (Count All) • AprioriSome (Count Some) • DynamicSome (Count Some)

  21. Algorithm AprioriAll • In the first pass, output of litemset phase is used to initialize the set of large 1-sequences

  22. AprioriAll Candidate Generation • Apriori-generate function: • Join step: join Lk-1 with Lk-1 • Prune step: delete all sequences c in Ck such that some (k-1)-subsequence of c is not in Lk-1 • Eg. <1 2 4 3> is pruned because <2 4 3> is not in L3

  23. Another Example … • Minimum support 40% (i.e., 2 customer sequences) • Get L1, L2, L3, L4 from Sequence Phase • Maximal Phase: Get the maximal large sequences: <1 2 3 4>, <1 3 5> and <4 5> => Seq. Patterns

  24. AprioriSome • Two Phases • Forward • Count candidate sequences of certain length • Backward • Delete non-maximal sequences • Count candidates of lengths not in forward phase • Delete non-maximal sequences from forward phase

  25. DynamicSome • Three Phases • Forward • Count candidate sequences of certain length • Intermediate Phase • Generate candidates from large sequences and database (customer sequences) –On the fly candidate generation • Backward • Delete non-maximal candidates from forward phase • Count candidate sequences of lengths not used in forward phase. • Delete non-maximal candidate sequences from backward phase

  26. Candidate Generation • Difference in Candidate Generation • AprioriAll- counts all length candidates/sequences (non-maximal too) • AprioriSome- counts some length sequences based on previous large ones. Prunes non-maximal sequences • DynamicSome-generate candidates using large sequences and database (customer sequences). Doesn’t prune the non-maximal sequences

  27. Performance Evaluation-Execution Time(ET) w.r.t. minimum support • ET increases with decrease in min support for all algorithms due to increase in number of large sequences. • DynamicSome performs worse • Generates and counts larger no of candidates (no pruning of non-maximal sequences) • AprioriSome and AprioriAll are comparable. • AprioriSome performs a little better for lower minimum support values

  28. Performance EvaluationContd.. • AprioriAll counts all large sequences,including non-maximal sequences (which are later pruned out); • AprioriSome avoids counting many non-maximal sequences by counting longer sequences first (major advantage over AprioriAll)

  29. Performance Evaluation- ET Vs Scale Up • Scale up-Number of customers-Number of transactions per customer Levels of minimum Support • Excellent scale-up property: Execution Time scales linearly with the increase in number of customers.

  30. Performance Evaluation- Scale Up Support : No of transactions • Number of transactions per customer increased keeping no. of items per transaction fixed • Execution time increases gradually • Large sequences increase with increase in customer sequence

  31. MSP vs ARM • Similarities • In both the cases we mine patterns of the form • X->Y which means if X exists Y also exists with certain support • Dissimilarities • Order is important in MSP but not in ARM • ARM => intra transaction MSP => inter transaction (over a period) • ARM => no timestamp

  32. Application of MSP • Discovering frequent buying patterns • Large Retail Organizations, Catalog Companies, etc. • Analysis of patients medical records /predict sequential symptoms of a disease • Web-access pattern

  33. Conclusion • Advantage • Complete: finds ALL sequential patterns • Scaling: can handle very large databases with millions of customer sequences • Disadvantage • Can’t find patterns related to items of interest (i.e. CAR) • Future Extensions • Sequential Patterns across item categories • Transposition of constraints into the discovery • item constraints: sequential patterns with specific items • Time constraints: minimum and maximum gap between adjacent elements of the patterns

  34. References • R.Agrawal and R.Srikant. Mining Sequential Patterns.In Proc. Of thr 11th Int’l Conference on Data Engineering, Taipei, Taiwan, March 1995. • R.Agrawal and R.Srikant .Mining Sequential Patterns: Generalizations and Performance ImprovementProceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, ,Avignon, France, March 25-29, 1996 • M Wojciechowski . Discovering and Processing Sequential Patterns in Databases.VII Conference on Extending Database Technology - PhD Workshop, Konstanz, Germany, 2000.

More Related