Bing Liu, Wynne Hsu, Yiming Ma

Integrating Classification and Association Rule Mining Bing Liu, Wynne Hsu, Yiming Ma Presented By: Salil Kulkarni Muhammad Talha

Introduction • Classification rule mining: • Discover a small set of rules in the database to form an accurate classifier • There is a pre-determined target • Association rule mining: • Find all rules in the database that satisfy some minimum support and minimum confidence • No pre-determined target

Associative Classification Framework • Aims to integrate the above data mining techniques efficiently while preserving the accuracy of the classifier • How? • The algorithm focuses only on those association rules whose right hand side is restricted to the classification class attribute also referred to as CARs (class association rules) • CARs are generated based Apriori algorithm • Thus data mining in this framework involves three steps: • Discretizing continuous attributes • Generating all the CARs • Building a classifier based on the CARs

Contributions of the Framework • New way to build accurate classifiers • Applies association rule mining to classification tasks • Solve the understandability problem • Helps discover rules not discovered by existing classification systems • No need to load the database into memory

Definitions • Dataset D :- A normal relational table with N cases described by l distinctattributes. • Attributes can be continuous or categorical • An item :- (attribute, integer-value) pair • Datacase d:- Set of items with a class label • Let I be the set of all items in D • Let Y be the set of all class labels in D

Definitions [contd.] • A datacase d contains X(a set of items), where X is a subset of I, if X is also a subset of d • CAR : Class association rule is an implication of the form X -> y, where X is a subset of I, and y belongs to Y

CBA-RG Algorithm(Basic Concepts) • Finds all ruleitems that satisfy the specified minsup condition • A ruleitem is represented as <condset, y>, where condset is set of items, and y is the class label • k-ruleitem is a ruleitem that has k items in its condset • Support count of the condset(condsupCount) is the number of cases in dataset that contain the condset • Support count of the ruleitem(rulsupCount) will be the number of cases that contain condset and are labeled with class y

CBA-RG Algorithm(Basic Concepts)[contd.] • Support(ruleitem) = (condsupCount/|D|) * 100; • Confidence(ruleitem) = (rulsupCount/condsupCount) * 100; • For all ruleitems that have the same condset, the ruleitem with the highest confidence is chosen as the possible rule (PR), incase of a tie, a ruleitem is selected at random • Each frequent ruleitem from the set of frequent ruleitems is of the form <(condset, condsupCount), (y, rulesupCount)>

CBA-RG Algorithm • The algorithm finds all frequent 1-ruleitems denoted by F1 • From F1, the function genRules(F1) generates CAR1 • CAR1 is subjected to pruning, however pruning is optional • While Fk-1 is non-empty, it generates candidates C using the candidateGen(Fk-1 ) using the rules Fk-1 generated in the k-1th pass over the data • Then the algorithm scans the database, and updates the support counts or condsupCount of the ruleitems. It also updates the rulsupCount if the class of the data case matches the class of the ruleitem

1 F1 = {large 1-ruleitems}; 2 CAR1 = genRules(F1); 3 prCAR1 = pruneRules(CAR1); 4 for (k = 2; Fk-1 != ; k++) do 5 Ck= candidateGen(Fk-1); 6 for each data case d D do 7 Cd= ruleSubset(Ck, d); 8 for each candidate c Cddo 9 c.condsupCount++; 10 if d.class = c.class then c.rulesupCount++ 11 end 12 end 13 Fk = {c Ck| c.rulesupCount minsup}; 14 CARk= genRules(Fk); 15 prCARk= pruneRules(CARk); 16 end 17 CARs = Uk CARk; 18 prCARs = U prCARk; CBA-RG Algorithm

Building the Classifier • Definition: Total order on rules Given two rules ri and rj, ri  rj also called as ri has higher precedence than rj, if • ri has higher confidence than rj or, • Their confidences are same but the support of ri is greater than rj or • Their confidences and supports are the same, but ri was generated earlier than rj

Building the Classifier • Basic Idea: • Choose a set of high precedence rules from the set of all generated rules to cover the dataset D • Format of Classifier: • <r1, r2, …, rn, default_class> where, ri  R, ra  rb if b > a

Building the Classifier • Stages in the Naïve version of the classifier (M1): • Stage 1: Sort the rules in R according to precedence relation Purpose: Ensure that rules with the highest precedence are chosen by the classifier • Stage 2: For every r in R in the sorted sequence • Go through every case d in the dataset • If r covers d, i.e. r satisfies the conditions of d, then assign a unique d.id to the datacase

Building the Classifier [contd.] • Stage 2 contd. : • If r correctly classifies atleast one case, then r is marked as it could be a potential rule in the final classifier. All datacases that are covered by r are then removed from the dataset • Majority class of the remaining training data is selected as default class • Total number of errors are computed for the classifier • Halts when it runs out of rules, or training cases

Building the Classifier[contd.] • Stage 3: • All rules that fail to improve the accuracy of the classifier are discarded • First rule where least number of errors are recorded acts as a cut-off rule, and rules after this rule are deleted from the classifier • Set of the undiscarded rules, and the default class of the last rule for the classifier

1 R = sort(R); 2 for each rule r R in sequence do 3 temp = ; 4 for each case d D do 5 if d satisfies the conditions of r then 6 store d.id in temp and mark r if it correctly classifies d; 7 if r is marked then 8 insert r at the end of C; 9 delete all the cases with the ids in temp from D; 10 selecting a default class for the current C; 11 compute the total number of errors of C; 12 end 13 end 14 Find the first rule p in C with the lowest total number of errors and drop all the rules after p in C; 15 Add the default class associated with p to end of C, and return C (our classifier).

Building the Classifier[contd.] • Two main conditions satisfied by the algorithm • Condition 1: Each data case is covered by the highest precedence rule among all the rules that can cover the case • Condition 2: Every rule in C correctly classifies at least one training case

Performance concern • M1 is a simple algorithm, but it makes multiple passes over the dataset • For a large dataset resident on the disk, it may be very inefficient to use M1 • Next, the authors propose a version of the algorithm that takes slightly more than 1 pass over the dataset

CBA-CB M2 • An improved version of M1 algorithm • M1 makes one pass over the remaining data for each rule • M2 finds the best rule in R to cover each case • Only slightly more than one pass • M2 consists of three stages

Stage 1 • For each dD • Find two highest precedence rules • cRule correctly classifies d • wRule wrongly classifies d • U :{set of all cRules} • classCasesCovered[class] attribute of a Rule: # of cases covered for each class

Stage 1 • Update cRule.classCasesCovered[r/d.class]++ • Add cRule to U:{set of all cRules} • If cRule  wRule • Mark d to be covered by cRule (Condition 1) • Mark cRule to indicate that it classifies d correctly (Condition 2) • Add cRule to Q: { set of cRules corresponding wRules } • Else • Store <d.id, d.class, cRule, wRule> • Add to A: {set of above data structure}

Stage 2 • Handle cases d not covered in Stage 1 • Second pass over the Database • Only slightly more than one pass • Determine all rules that classify the remaining data cases wrongly with higher precedence than cRule of d

Stage 2 • For each <d.id, d.class, cRule, wRule> A • If wRule marked • cRule of at least one data case (condition 2) • Mark d to be covered by wRule (condition 1) • wRule.classCasesCovered[d.class]++ • cRule.classCasesCovered[d.class]— • Already in Q because it is cRule of some case • Else find all rules that classify d wrongly with higher precedence than cRule in U :{set of all cRules} (scan D) • For each rule w, • Store <d.id, d.class, cRule> in w.replace since it may replace cRule to cover d • Update w.classCasesCovered[d.class]++ • Add w to Q

Stage 3 • Choose final set of rules for classifier C • Step 1: Choose set of potential rules to form classifier • Sort Q according to precedence (condition 1)

Stage 3: Step 1 • For each rQ • Discard any rule r that no longer correctly classifies a case correctly (condition 2) • For each entry r.replace <cRule, d.id, d.class> • If d.id covered by previous (higher precedence) rule then r does not replace cRule • Else replace cRule by r • r.classCasesCovered[d.class]++ • cRule.classCasesCovered[d.class]--

Stage 3 : Step 1 • For each rQ (continued) • Compute ruleErrors • Number of errors made by selected rules so far • Compute defaultClass • Majority class in remaining data cases • Compute defaultError • totalErrors= ruleErrors+ defaultError • Insert <r, defaultClass, totalErrors > intoend of C

Stage 3 : Step 2 • Discard rules that introduce more errors after rule p in C with least totalErrors • Add defaultClass of p to end of C • Return final Classifier C without totalErrors

Empirical Evaluation • CBA and C4.5 (tree and rules) classifiers were compared • 26 datasets from UCI ML Repository were used. • minconf was set to 50% • minsup has a strong impact on the accuracy of the classifier • Too high then rules with high confidence may be discarded and CARs fail to cover all cases • It was observed from experiments • minsup of 1-2% , the classifier built is more accurate than C4.5

Empirical Evaluation • In reported experiments, minsup set to 1% • Limit rules in memory to 80,000 • Continuous attributes discretized using the Entropy method

Results

Observations • CBA superior to C4.5rules for 16 out 26 • No difference between rule pruning or without pruning • M2 much more efficient than M1

Two Important Results • All rules in 16 of the 26 datasets could not be found • due to the 80,000 limit • the classifiers built are still quite still accurate • When the limit reaches 60,000 in the 26 datasets, the accuracy of the resulting classifiers starts to stabilize • When CBA run on dataset cases on disk by increasing cases up to 32 times (e.g., 160,000 cases). • Experimental results show that both CBA-RG and CBA-CB (M2) have a linear scaleup.

Related Work • Several researchers tried to build classifiers with extensive search • None use rule mining • CBA-CB related to Michalski • Finds best rule for each class and remove cases • Applied recursively until no cases left • Heuristic search • No encouraging results

Best rules are local because remove case after it is found Results not good. Local rules over fit data Best rules are global because generated using all cases Better results Michalski vs. CBA-CB

Conclusion • Presents new framework to construct an accurate classifier based on classification association rules.

Bing Liu, Wynne Hsu, Yiming Ma