A Tree-based Approach for Frequent Pattern Mining from Uncertain Data

A Tree-based Approach for Frequent Pattern Mining from Uncertain Data Carson Kai-Sang Leung, Mark Anthony F. Mateo, and Dale A. Brajczuk PAKDD 2008

Outline • Motivation • UF-Growth algorithm • Construction of the UF-Tree • Mining of Frequent Patterns from the UF-Tree • Improvements to UF-Growth algo. • Experimental Results • Conslusion

Motivation • Over the past decade, there have been numerous studies on mining frequent patterns from precise data. • However, there are situations in which users are uncertain about the presence or absence of some items. suspicion

UF-Growth Algorithm • The algorithm consists of two operations: • The construction of UF-tree • The mining of frequent patterns from UF-tree

Construction of the UF-Tree minsup = 1 Scan DB Scan DB 1 1 1

Mining of Frequent Patterns from the UF-Tree • expSup({a,e}) = (1*0.72*0.9)+(2*0.71875*0.9) =1.94175 • expSup({d,e}) = (1*0.72*0.71875)+(2*0.71875*0.72) =1.5525 • {a,e} and {d,e} are frequent {e}-projected DB

(Cont.) • expSup({d,e}) in {d,e}-projected DB is 0.5175=0.71875*0.72 • expSup ({a,d,e})=3*0.5175*0.9=1.39725 • {a}, {a,d}, {a,d,e}, {a,e}, {b}, {b,c}, {c}, {d}, {d,e}, and {e} {d,e}-projected DB {e}-projected DB

Improvements to UF-Growth Algorithm • The UF-tree above may appear to require a large amount of memory • Improvement • To increase the chance of path sharing, we discretize and round the expected support of each tree node up to kdceimal places

(Cont.) • The iprovedUF-growth does not need to bulid subsequent UF-trees for any non-singleton patterns. • To enumerate all its subsets {a,e}, {a,d,e}, {d,e} with their expected supports equal 0.648, 0.46575 and 0.5175 so far. {e}-projected DB To enumerate all its subsets and {a,e}, {a,d,e}, {d,e} with their accumulative expected supports equal 1.94175, 1.39725 and 1.5525

Experimental Results

(Cont.)

Conclusion • Improvement 1. method may cause false positive.

A Tree-based Approach for Frequent Pattern Mining from Uncertain Data