Association Rule Mining

Association Rule Mining Muhammad Ali Yousuf http://dsc.itmorelia.edu.mx/~maliyusuf/ Ali_morelia@yahoo.com

Contenidos • Part (A): Maximal frequent itemset, Generation of Association rules • Part (B): AR Miner Software • Part (C): Horizontal v/s Vertical Database Layout • Part (D): Frequent Closed ItemSets

Rule: 1. If a set of items (itemset) is not frequent, no superset of this itemset can be frequent. • 2. If a set of items (itemset) is frequent, every subset of this itemset will be frequent. • Now we define the concept of a "Maximal frequent itemset": • Maximal Frequent itemset is a frequent itemset which is not a subset of any other frequent set,hence the term maximal.

Another property of a maximal frequent itemset is that all its subsets will be frequent, hence if we can modify our Apriori Algorithm to identify the maximal frequent itemsets , that will be a great optimisation. • Here we list an example which looks at extracting a maximal frequent itemset .

We have the transactions as follows: • Trans Id Items 1 A C T W 2 C D W 3 A C T W 4 A C D W 5 A C D T W 6 C D T

min_support = 50 % (i.e. there should be atleast 3 occurences(out of 6) of any pattern to be considered frequent) • Now we generate the frequent itemset lattice in a bottom up approach: (We have omitted the lines showing the formation of k-itemset (itemsets with k members) for simplicity )

Explanation • Here we start from the empty set, and go up each level retaining the itemsets which are frequent ( atleast 3/6 ooccurences) and throwing away the rest. • As per the Apriori algorithm we will be throwing away following itemsets since their support is < 3. • Itemsets: AD , DT at level 2, and CDT at level 3

Rule • At each level, whenever we see a frequent itemset to be a superset of another frequent itemset at lower level, we ignore or rule out that subset (itemset), since we know that the superset by definition will include that subset. • Hence while generating the association rules we will consider all the proper subsets of this frequent itemset (which will be the maximal one).

Rule • E.g at level 3, the itemset "ACT" is formed by the grouping of AC and AT, hence we can rule out itemsets "AC" and "AT" from the lower level; • If we continue this process, we will be left with only 2 frequent itemsets (maximal) :ACTW and CDW

Rule • To summarise what we have done so far: • Step 1:Find frequent itemsets. We have found the maximal frequent itemsets , all possible subsets of this will be frequent • Step 2: Generate all possible association rules • (A) Look at all frequent itemsets • (B) For each frequent itemset 'X' look at all its proper subsets 'Y',such that 'Y' is not empty or equal to 'X'.

Observation: For each frequent k-itemset , there will be (2^k) - 2 proper subsets. • Hence those many rules to be tested. • To test a rule : Y ==> X - Y i.e. to check if the confidence of the rule >= min_confidence • Lets take an e.g.: • LET X = CDW k = 3 • Number of proper subsets = 2^3 - 2 = 6 • to generate all possible association rules with their confidence:

Thus we have seen the need to identify maximal frequent itemsets and how to generate all possible association rules from it

Introduction to AR Miner

Introduction to Data Storage • We will take a small example of supermarket data. • Suppose the itemssold by a (very very small) shop are green apples, red apples,oranges, bananas, and grapes. • Also suppose that in this morning youhad three customers, one bought green apples and grapes, one boughtonly oranges, and the last one bought oranges and grapes.

Thisactivity can be represented in the .asc format as follows: • 1 green apples • 2 red apples • 3 oranges • 4 bananas • 5 grapes • BEGIN_DATA • 1 5 • 3 • 3 5 • END_DATA

There are two distinct parts of this file, the first one contains alisting of all the items you can sell, or otherwise said, of all theitems that could participate in a transaction. • This part looks is: • 1 green apples • 2 red apples • 3 oranges • 4 bananas • 5 grapes

The format is pretty simple. It must consist of a positive numberfollowed by a string (which can contain blank spaces). It is importantthat the numbers be assigned in increasing order starting from1. • Empty lines are allowed to appear in this section. This sectionenumerates all entities described by the data and between whichARMiner will later be used to look for association rules.

The second part consists of the actual data: • BEGIN_DATA • 1 5 • 3 • 3 5 • END_DATA

In our case we had 3 transactions and these are each represented on aseparate line. The first transaction involved green apples and grapes • and they are represented by the numbers associated in the firstsection, that is 1 for green apples and 5 for grapes. • Note that this section must beenclosed between a BEGIN_DATA and END_DATA lines. Anything appearingafter the END_DATA line will be ignored. • Blank lines are allowed toappear in this section.

Note that although the numbers appearing ineach line are sorted this is not required by the format. • You can listthe numbers in any order and the file can still be processedcorrectly, however we suggest to always list the numbers in atransaction in increasing order, this way the processing of the fileby asc2db will be done more efficiently.

Census data • SSN# Age Sex Married Num_kids Income • 006 26 M No 0 25000$ • 345 54 F Yes 2 55000$ • 743 37 M Yes 1 80000$

What Can You Do With It? Let's Look at Each Column: • SSN#: this is unique for each entry, there is no sense to look forassociation rules involving SSN#, at least not in this data, sinceeach SSN# appears only once in the whole data. So we can simply ignorethis field for mining purposes.

What Can You Do With It? Let's Look at Each Column: • Age: this attribute can take a variety of values. ARMiner cannothandle such attributes easily, in fact it only considers binaryattributes. We need to discretize this attribute, replacing forexample ages 0-21 with "very young age", 22-35 with "young age",35-55 with "middle age", etc

What Can You Do With It? Let's Look at Each Column: • Sex: this has two values: "male" and "female", so we could create twoattributes out of it.

What Can You Do With It? Let's Look at Each Column: • Married: again we can create two attributes: "married" and "notmarried"

What Can You Do With It? Let's Look at Each Column: • Num_kids: this also has to be discretized, maybe in "no kids", "onekid", "several kids".

What Can You Do With It? Let's Look at Each Column: • Income: we could also discretize this into "small", "average", and"high".

What Can You Do With It? Let's Look at Each Column: • The discretization should be made such that it will identify clearlythe ranges that present interest for the person who will dothe mining of this data.

What Can You Do With It? Let's Look at Each Column: • With these changes we could represent the above data in .asc formatas:

BEGIN_DATA 2 5 8 9 12 3 6 7 11 13 3 5 7 10 14 END_DATA 1 very young age 2 young age 3 middle age 4 old age 5 male 6 female 7 married 8 not married 9 no kids 10 one kid 11 several kids 12 small income 13 average income 14 high income

How To Start AR Miner • For a quick start, first launch the server: • java -jar Server.jar • and then launch the client application: • java -jar Client.jar • You can login as admin with password renimra.

Support and Confidence • Support: The fraction of transactions T supporting an itemset X with respect to database D is called the support of X, supp(X) = |{T  D | X T }| / |D |

Support and Confidence • The support of a rule X => Y is defined as, supp(X=>Y) = supp(X U Y)

Support and Confidence • Confidence: The confidence of this rule is defined as conf(X=>Y) = supp(X U Y) / supp(X)

Association Rule Mining