Mining Approximate Frequent Itemsets in the Presence of Noise

Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi

Title Statement This paper introduces an approach to implement noise tolerantfrequent itemset mining of the binary matrix representation of the database

Index • Introduction to Frequent Itemset Mining • Frequent Itemset Mining • Binary Matrix Representation Model • Problems • Motivation • Proposed Model • Proposed Algorithm • AFI Mining vs. Exact Frequent Itemset Mining • Related Works • Experimental Results • Discussion • Conclusion

Introduction to Frequent Itemset Mining • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • Originally developed to discover association rules • Applications • Bio-molecular applications: • DNA sequence analysis, protein structure analysis • Business applications: • Market basket analysis, sale campaign analysis

The Binary Matrix Representation Model • Model for representing relational databases • Rows correspond to objects • Columns correspond to attributes of the objects • ‘1’ indicates presence • ‘0’ indicates absence • Frequent Itemset mining is a key technique for analyzing such data • Apply Apriori algorithm

Problem with Frequent Itemset Mining • The traditional model for mining frequent itemsets requires that every item must occur in each supporting transaction • NOT a practical assumption! • Real data is typically subject to noise • Reasons for noise • Human error • Measurement error • Vagaries of human behavior • Stochastic nature of studied biological behavior

Effect of Noise • Fragmentation of Patterns by Noise • Discover multiple small fragments of the true itemset • Miss the true itemset itself! • Example • Exact frequent itemset mining algorithm will miss the main itemset ‘A’ • Observe three fragmented itemsets – Itemset 1,2 and 3 • Fragmented itemsets may not satisfy the minimum support criteria and will therefore be discarded

Mathematical Proof of Fragmentation From: Significance and Recovery of block structures in binary matrices with noise - by X. Sun & A.B. Nobel With probability 1, M(Y) <= 2logan− 2loga(logan) when n is sufficiently large i.e. in the presence of noise, only a fraction of the initial block of 1s can be recovered Where – Matrix X: contains actual values recorded in the absence of any noise Matrix Z: binary noise matrix whose entries are independent Bernoulli’s random variable such that Z ~ Bern(p) for 0<=p<=0.5 M(Y): is the largest k such that Y contains k transactions having k common items Y = X xor Z, a = (1 − p)−1

Motivation • The failure of classical frequent itemset mining to detect simple patterns in the presence of random errors (i.e. noise) compromises the ability of these algorithms to detect association, cluster items or build classifiers when such errors are present

Possible Solutions • Let the matrix contain a small fraction of 0s DRAWBACK:“Free riders” like column h (for matrix C) and row 6 (for matrix B) SOLUTION: Limit the number of 0s in each row and column

Proposed Model • Use Approximate Frequent Itemset (AFI) • AFI characteristics • Sub-matrix contains large fraction of 1s • Supporting transaction should contain most of the items i.e. number of 0s in every row must fall below user defined threshold (єr) • Supporting item should occur in most of the transaction i.e. number of 0s in every column must fall below user defined threshold (єc) • Number of rows > minimum support

AFI • Mathematical definition • For a given binary matrix D having I0 items and T0 transactions, an itemset I c I0 is an approximate frequent itemset AFI(єr,єc) if there exists a set of transactions T c T0 with |T| ≥ |T0|.minsup such that • Similarly, define weak AFI(є)

AFI example • A, B and C are weak AFI (0.25) • A: valid AFI(0.25,0.25) • B: weak AFI(*,0.25) • C: weak AFI(0.25,*)

Drawback of AFI • AFI criteria violates the Apriori property! • Apriori Property: all sub-itemsets of a frequent itemset must be frequent • But, sub-itemset of an AFI need not be AFI e.g. A is a valid AFI for minSupport = 4, but {b,c,e}, {b,c,d} etc are not valid AFIs • PROBLEM – now minimum support can not be used as a pruning technique • SOLUTION – a generalization of Apriori properties for noisy conditions (called Noise Tolerant Support Pruning)

Proposed Model • Use Approximate Frequent Itemset (AFI) • Noise Tolerant Support Pruning – to prune and generate candidate itemsets • 0/1 Extension - to count the support of a noise - tolerant itemset based on the support set of its sub-itemsets

Noise Tolerant Support Pruning • For a given єr, єc and minsup the noise tolerant pruning support for a length-k itemset is- Proof

0/1 Extensions • Starting from singleton itemsets, generate (k+1) itemsets from k itemsets in sequential manner • The number of 0s allowed in the itemset grows with the length of the itemset in a discrete manner • 1 Extension If then the transaction set of a (k+1) itemset I is the intersection of the transaction sets of its length k subsets • 0 Extension If then the transaction set of a (k+1) itemset I is the union of the transaction sets of its length k subsets Proof

Proposed Algorithm

AFI vs. Exact Frequent Itemset AFI Mining єr, єc = 1/3; n=8; minsup =1

AFI vs. Exact Frequent Itemset Exact Frequent Itemset Mining 1-candidates 2-candidates Freq 1-itemsets Freq 2-itemsets MinSup = 0.5 i.e. 4 transactions n = 8

AFI vs. Exact Frequent Itemset - Result

Related Works • Yang et al. (2001) proposed two error-tolerant models, termed weak error-tolerant itemsets or weak ETI [which is equivalent to weak AFI] and strong ETI which is [equivalent to AFI(єr,*)] DRAWBACK No efficient pruning technique – rely on heuristics and sampling techniques Do not preclude columns of 0 • Steinbach et al. (2004) proposed a “support envelope” which is a tool for exploration and visualization of the high-level structures of association patterns. A symmetric ETI model is proposed such that the same fraction of errors are allowed in both rows and columns. DRAWBACK Implements same error co-efficient for rows and columns i.e. єr= єc Admits only a fixed number of 0s in the itemset. Fraction of noise does not vary with size of itemset sub-matrix

Related Works • Seppänen and Mannila (2004) proposed to mine the dense itemsets in the presence of noise where the dense itemsets are the itemsets with a sufficiently large sub-matrix that exceeds a given density threshold of attributes present. DRAWBACK Enforces the constraint that all sub-itemsets of a dense itemset must be frequent – will fail to identify larger itemsets that have sufficient support because all sub-itemsets might not have enough support Requires repeated scans of the database

Experimental Results - Scalability • Scalability • Database of 10,000 transactions and 100 items • Run time increases as noise tolerance increases • Reducing item wise error constraint leads to greater reduction in run time as compared to transaction wise error constraint

Experimental Results – Synthetic Data • Quality Testing for single cluster • Create data with an embedded pattern • Add noise by flipping each entry with probability p where 0.01 ≤ p ≤ 0.2

Experimental Results – Synthetic Data • Quality Testing for multiple clusters • Create data with multiple embedded pattern • Add noise by flipping each entry with probability p where 0.01 ≤ p ≤ 0.2

Experimental Results – Real World Data • Zoo Data Set • Database contained 101 instances and 18 attribute • All the instances are classified into 7 classes e.g. Mammals, fish etc

Discussion • Advantages • Flexibility of placing constraints independently along rows and columns • Generalized Apriori technique for pruning • Avoids repeated scans of database by using 0/1 extension

Summary • The paper outlines an algorithm for mining approximate frequent itemsets from noisy data • It introduces • an AFI model • Generalized Apriori property for pruning • The proposed algorithm generates more useful itemsetscompared to existing algorithms and is also computationally more efficient

Thank You!

Extra Slides for Questionnaire

Applying Apriori Algorithm Data base D 1-candidates Freq 1-itemsets Scan D Min_sup=2 2-candidates Counting 3-candidates Freq 2-itemsets Scan D Scan D Freq 3-itemsets

Noise Tolerant Support Pruning - Proof

0/1 Extensions Proof • Number of zeroes allowed in an itemset grows with the length of the itemset

Mining Approximate Frequent Itemsets in the Presence of Noise

Mining Approximate Frequent Itemsets in the Presence of Noise

Presentation Transcript

Parallel Mining of Maximal Frequent Itemsets form Databases

Algorithms for Mining Maximal Frequent Itemsets -- A Survey

Mining Frequent Itemsets over Uncertain Databases

The Concept of Maximal Frequent Itemsets

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Fast Algorithms for Mining Frequent Itemsets

Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism

Efficient Algorithms for Mining Share-Frequent Itemsets

Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms

Text clustering using frequent itemsets

Fast and Memory Efficient Mining of Frequent Closed Itemsets

Mining Frequent Itemsets over Uncertain Databases

Fast Algorithms for Mining Frequent Itemsets

Approximate Frequent Itemset Mining for Streaming Data on FPGA

Frequent Itemsets Mining in Distributed Wireless Sensor Networks

gApprox: Mining Frequent Approximate Patterns from a Massive Network

Fast Algorithms for Mining Frequent Itemsets