연관 규칙 탐사와 그 응용

연관 규칙 탐사와 그 응용 성신여자대학교 전산학과 박 종수 jpark@cs.sungshin.ac.kr

Data Mining in the KDD Process Association Rule의 정의 Mining Association Rules in Transaction Databases Algorithm Apriori & DHP Generalized Association Rules Cyclic Association Rules and Negative Associations. Interestingness Measurement Sequential Patterns and Path Traversal Patterns 연구 방향 및 참고 Homepages 차 례

Overview of the steps constituting the KDD process Data Selection Preprocessing Target Data Preprocessed Data Transformation Transformed Data Patterns Data Mining Interpretation/ Evaluation Knowledge

Prediction Classification Regression Time Series Knowledge Discovery Deviation Detection Database Segmentation Clustering Association Rules Summarization Visualization Text mining Types of Data-Mining Problems

[Bread], [Butter] [Milk] (12.5%, 90%) 90% : confidence factor of the rule (not 100%) 12.5%: support for the rule, the fraction of transactions in database Association Rule Ex: the statement that 90% of transactions that purchase bread and butter also purchase milk. antecedent consequent Find all rules that have “Diet Coke” as consequent. Find all rules that have “bagels” in the antecedent. Find the “best” k rules that have “bagels” in the consequent.

I : a set of literals called items. T: a set of items such that T  I, transaction. An association rule is an implication of the form X  Y, where X  I, Y  I and X Y = ø. X  Y [support, confidence] 연관규칙의 정의

Transaction Databases에서 연관 규칙 탐사 • Applications: pattern association, market analysis, etc • Given • data of transactions • each transaction has a list of items purchased • Find all association rules: the presence of one set of items implies the presence of another set of items. • - e.g., people who purchased hammers also purchased nails. • Measurement of rule strength • Confidence: X & YZ has 90% confidence if 90% of customers who bought X and Y also bought Z. • Support: useful rules(for business decision) should have some minimum transaction support.

Two Steps for Association Rules • Determining“large itemsets” • Find all combinations of items that have transaction support above minimum support • Researches have been focussed on this phase. • Generating rules • for eachlarge itemset Ldo • for eachsubset c of Ldo • if (support(L) / support(L - c)  minimum confidence) then • output the rule (L - c) c, • with confidence = support(L)/support(L - c) • and support = support(L);

Association Rules Focus on data structures to speed up scanning the database Hash tree, Trie, Hash table, etc. minimum support Candidate Itemsets Large Itemsets Scan Database How to generate candidate itemsets minimum confidence Apriori method: join step + prune step

C 1 L1 Database D Itemset Sup. {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset Sup. {A} 2 {B} 3 {C} 3 {E} 3 TID Items 100 A C D 200 B C E 300 A B C E 400 B E Scan D C2 C2 L2 Itemset {A B} {A C} {A E} {B C} {B E} {C E} Itemset Sup. {A B} 1 {A C} 2 {A E} 1 {B C} 2 {B E} 3 {C E} 2 Itemset Sup. {A C} 2 {B C} 2 {B E} 3 {C E} 2 Scan D C3 C3 L3 Scan D Itemset {B C E} Itemset Sup. {B C E} 2 Itemset Sup. {B C E} 2 minimum support = 2

Algorithms for Mining Association Rules • AIS(Agrawal et al., ACM SIGMOD, May ‘93) • SETM(Swami et al., IBM Tech. Rep., Oct ‘93) • Apriori(Agrawal et al., VLDB, Sept ‘94) • OCD(Mannila et al., AAAI workshop on KDD, July, ‘94) • DHP(Park et al., ACM SIGMOD, May ‘95) • PARTITION(Savasere et al., VLDB, Sept ‘95) • Mining Generalized Association Rules(Srikant et al., VLDB, Sept ‘95) • Sampling Approach(Toivonen, VLDB, Sept ‘96) • DIC(dynamic itemset counting, Brin et al., ACM SIGMOD, May ‘97) • Cyclic Association Rules(zden et al., IEEE ICDE, Feb ‘98) • Negative Associations(Savasere et al., IEEE ICDE, Feb ‘98)

Lk: Set of Large k-itemsets Ck:Set of Candidate k-itemsets Step; C1  L1  C2  L2, ..., Ck Lk Input File: Transaction File, Output: Large itemsets Algorithm Apriori L1 = {large 1-itemset} for ( k=2; Lk-1  Ø; k++) do begin Ck= apriori-gen(Lk-1); forall transactions t D do begin Ct = subset(Ck, t); forall candidates c  Ctdo c.count++; end Lk= {c  Ck| c.count  minsup} end Answer = Uk Lk;

Join step Prune step Apriori-gen(Lk-1) insert into Ck select p.item1, p.item2, ..., p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1= q.item1, ..., p.itemk-2= q.itemk-2, p.itemk-1< q.itemk-1 forall itemsets c  Ckdo forall (k-1)-subsets s of c do if ( s  Lk-1) then delete c from Ck;

예: L3로부터C4를 생성하는 과정. Join step L3 = {{1, 2 ,3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}일때, 후보 4-항목집합 = { {1 2 3 4}, {1 3 4 5}} Prune step: - {1, 2, 3, 4}의 3-subset = {{1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}} - {1, 3, 4, 5}의 3-subset = {{1,3,4}, {1,3,5}, {1,4,5}, {3,4,5}} 각 {1,4,5},{3,4,5}  L3이므로 {1, 3, 4, 5}는 pruning!! C4 = {{1, 2, 3, 4}} Ex: Generation of Candidate Itemsets

각 레벨의 후보집합에 대해 Hash Tree 형성. 예: C2 = {{A,B},{A,C},{A,T} {B,C}, {B,D},{C,D}}의 Hash Tree Data Structure for Ck C2 Level 1 A B C B C C D Level 2 C,D 중간노드 A,B A,T A,C B,C B,D 잎노드

H Hash Table 와 후보 2-항목집합 C 를 생성하는 예 (DHP) 2 2

TID Items 100 A C D 200 B C E 300 A B C E 400 B E Counting support in a hash tree • {A C} Discard • {B C} {B E} {C E} Keep {B C E} • {A C} {B C} {B E} {C E} Keep {B C E} • {B E} Discard • D3 = { <200, B C E>, <300, B C E> } • C2 count L2 • {A C} 2 {A C} • {B C} 2 {B C} • {B E} 3 {B E} • {C E} 2 {C E} • s = 2 • L2와 D3의 예 (DHP)

Finding associations between items at any level of the taxonomy. Rules: People who buyclothestend to buy shoes. (  ) People who buyouterwear tend to buy shoes. ( o ) People who buyjacket tend to buy shoes. (  ) Clothes Footwear Outerwear Shirts Shoes Hiking Boots Jackets Ski Pants Generalized Association Rules

I = { i1, i2, …, im}: set of literals, D: set of transactions, T: a set of taxonomy, DAG(Directed Acyclic Graph) 일때, X  Y [confidence, support], where X  I, Y  I, XY = , and no item in Y is an ancestor of any item in X. (X, Y: any level of taxonomy T ) Step 1. Find all sets of items whose support is greater than minimum support. 2. Generate association rules, whose confidence is greater than minimum confidence. 3. Prune all uninteresting rules from this set with respect to the R-interesting. Problem Statement

Using new interest measure, R-interesting: Prune out 40% to 60% of the rules as “redundant “ rules. Example: * 가정: Taxonomy: Skim milkis-aMilk, Milk Cereal ( 8% support, 70% confidence), Skim milk의 판매량 = milk판매량의 1/4 일 때, * Skim milk  Cereal 에 대해, Expectation: 2% support, 70% confidence Actual support & confidence: 약 2% support, 70% confidence ==>redundant & uninteresting!! Interestingness of Generalized Rules

Beer and chips are sold together primarily between 6PM and 9PM. Association rules could also display regular hourly, daily, weekly, etc., variation that has the appearance of cycles. An association rule X  Y holds in time unit ti, if the support of X  Y in D[i] exceeds MinSup and the confidence of X  Y in D[i] exceeds MinConf. It has a cycle c = (l, o), a length l and an offset o. “coffee  doughnuts” has a cycle (24, 7), if the unit of time is an hour and “coffee  doughnuts” holds during the interval 7AM-8AM everyday (I.e., every 24 hours). Cyclic Association Rules

A rule : “60% of the customers who buy potato chips do not buy bottled water.” Negative rule: X Y such that (a) support(X) and support(Y) are greater than minimum support MinSup; and (b) the rule interest measure is greater than MinRI. The interest measure RI of a negative association rule, X Y , E[support(X)] is the expected support of an itemset X. Negative Association Rules

Incremental Updating, Parallel and Distributed Algorithms • 데이타베이스 연관규칙 탐사를 위한 점진적 평가기법. • (김의경등, 한국정보과학회 ‘95 가을 학술 발표 논문지) • Fast updating algorithms, FUP (Cheung et al., IEEE ICDE, ‘96). • Partitioned derivation and incremental updating. • PDM (Park et al., ACM CIKM, ‘95): • Use a hashing technique(DHP-like) to identify candidate k-itemsets from the local databases. • Count Distribution (Agrawal & Shafer, IEEE TKDE, Vol 8, No 6, ‘96): • An extension of the Apriori algorithm. • May require a lot of messages in count exchange. • FDM(Cheung et al., IEEE TKDE, Vol 8, No 6, ‘96). • Observation:If an itemset X is globally large, there exists a partition Disuch that X and all its subsets are locally large at Di. • Candidate set are those which are also local candidates in some component database, plus some message passing optimizations.

The following three rules are examples of real rules generated from real data: On Thursdays, grocery store consumers often purchase diapers and beer together. Useful rule: high quality, actionable information. Customers who purchases maintenance agreements are very likely to purchase large appliances. Trivial rule When a new hardware store opens, one of the most commonly sold items is toilet rings. Inexplicable rule When is Market Basket Analysis useful?

Interestingness Measurement for Association Rules (I) • Two popular measurements: support and confidence • The longer (itemset), the fewer (support). • Use taxonomy information for pruning redundant rules • A rule is “redundant” if its support and confidence are close to their expected values based on an ancestor of the rule. • Example: ”milk  cereal” vs. “skim milk  cereal”. • More effective than that based on statistical significance. • Interestingness of Patterns • If a patterncontradictsthe set of hard beliefs of the user, then this pattern is always interesting to the user. • The more a pattern “affects” the belief system, the more interesting it is.

Improvement (Interest ) How much better a rule is at predicting the result than just assuming the result in the first place. Co-occurrence than implication. Symmetric. Conviction How far ”condition and result” deviates from independence InterestingnessMeasurement (II)

Improvement Improvement = 1: condition과 result의 item이 completely independent! Improvement < 1: worse rule! Improvement > 1: better rule! Conviction Conviction = 1: condition과 result의 item이 completely unrelated. Conviction > 1: better rule!! Conviction =  : completely related rule Range of measurement

Examples of such a pattern: Customers typically rent “Star Wars”, then “Empire Strikes Back”, and then “Return of the jedi”. Note that these rentals need not to be consecutive. 수강신청: 관광과 여가(1학기) 수도권과 주택문제(2학기) 증권시장(3학기) 주가 변동 패턴: 삼성전자 주가 상승 LG전자 주가 상승  보해양조 주가 상승 구매패턴: 양복  와이셔츠  검정색 구두 ? 의료진단에서 질병 발생 순서 패턴 환자 치료에서 진료 및 투약 패턴 Sequential Patterns

An itemset is a non-empty set of items. A sequence is an ordered list of itemsets. Customer Id Customer Sequence 1 <(30) (90)> 2 <(10 20) (30) (40 60 70)> 3 <(30 50 70)> 4 <(30) (40 70) (90)> 5 <(90)> Sequential Patterns with support > 25% <(30) (90)> <(30) (40 70)> Mining Sequential Patterns

Sort Phase major key: customer-id, minor key: transaction-time Litemset Phase litemset = an itemset with minimum support Transformation Phase A customer sequence is represented by a list of sets of litemsets Sequence Phase ( Apriori 알고리즘의 응용) Candidate sequences ==> Large sequences Maximal Phase a sequences is maximal if s is not contained in any other sequence The Algorithm for Sequential Patternsby Agrawal and Srikant, 1995 ICDE

Understanding user access patterns in a distributed information providing environment such as WWW, Hitel, etc. help improving the system design lead to better marketing decisions Capturing user access patterns mining path traversal patterns capturing user traveling behavior improving the quality of such services Mining Path Traversal Patterns

Traversal patterns 1 A 12 2 O B 13 6 15 14 5 C E U V 3 11 7 4 D G 8 10 Maximal forward references {ABCD, ABEGH, ABEGW, AOU, AOV} 9 H W 1. Find large reference sequences. 2. Find maximal reference sequences.

연관 규칙 탐사 Sampling approach, parallel method, distributed algorithm등의 연구 Candidate itemsets을 효율적으로 관리하고 scanning에 효과적인 자료구조 연구 규칙의 흥미도 또는 중요도 측정 연관 규칙의 응용으로 구체적인 적용 방법. Other patterns pattern의 정의와적용에 관한 문제 연구 Similarity search WWW에서 path traversal patterns등의 연구 연구 방향

Some Data Mining Systems and Homepages • Quest (IBM Almaden: Agrawal, et al.): • large DB-oriented association, classification, sequential patterns, similar sequences, etc. • “http://www.almaden.ibm.com/cs/quest/” • DBMiner: (SFC: Han, et al.): • Interactive, multi-level characterization, classification, association & prediction. • “http://db.cs.sfu.ca/DBMiner/” • KDD (GTE: Piatetsky-Shapiro, et al.): • multi-strategy, strong rules, statistical approaches, etc. • KD Mine: “http://info.gte.com/~kdd/index.html” • Other Homepages for Data Mining • Rakesh Agrawal: “http://www.almaden.ibm.com/cs/people/ragrawal/” • Usama Fayyad: “http://www.research.microsoft.com/~fayyad/” • Heikki Mannila: “http://www.cs.Helsinki.Fl/~mannila/” • Jiawei Han: “http://fas.sfu.ca/cs/people/Faculty/Han/” • Data Mining and Knowledge Discovery Journal: “http://www.research.microsoft.com/research/datamine/”의 Editorial Board

연관 규칙 탐사와 그 응용

연관 규칙 탐사와 그 응용

Presentation Transcript