Mining Discriminative Frequent Patterns for Feature Construction

Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree How to find good features from semi-structured raw data for classification Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure

Feature Construction • Most data mining and machine learning model assume the following structured data: • (x1, x2, ..., xk) -> y • where xi’s are independent variable • y is dependent variable. • y drawn from discrete set: classification • y drawn from continuous variable: regression • When feature vectors are good, differences in accuracy among learners are not much. • Questions: where do good features come from?

Frequent Pattern-Based Feature Extraction • Data not in the pre-defined feature vectors • Transactions • Biological sequence • Graph database Frequent pattern is a good candidate for discriminative features So, how to mine them?

A discovered pattern NSC 4960 NSC 699181 NSC 40773 NSC 164863 NSC 191370 FP: Sub-graph (example borrowed from George Karypis presentation)

NN DT Petal.Length< 2.45 | setosa Petal.Width< 1.75 SVM versicolor virginica LR Any classifiers you can name Frequent Pattern Feature Vector Representation P1 P2 P3 Data1 1 1 0 Data2 1 0 1 Data3 1 1 0 Data4 0 0 1 ……… Mining these predictive features is an NP-hard problem. 100 examples can get up to 1010 patterns Most are useless

Example • 192 examples • 12% support (at least 12% examples contain the pattern), 8600 patterns returned by itemsets • 192 vs 8600 ? • 4% support, 92,000 patterns • 192 vs 92,000 ?? • Most patterns have no predictive power and cannot be used to construct features. • Our algorithm • Find only 20 highly predictive patterns • can construct a decision tree with about 90% accuracy

Data in “bad” feature space Discriminative patterns A non-linear combination of single feature(s) Increase the expressive and discriminative power of the feature space An example y 1 1 0 x 1 1 Data is non-linearly separable in (x, y)

New Feature Space Map Data to a Different Space • Solving Problem 0 1 1 ItemSet: F: x=0,y=0 Association rule F: x=0  y=0 1 1 F Mine & Transform 1 1 0 x 1 1 y • Data is linearly separable in (x, y, F)

Computational Issues • Measured by its “frequency” or support. • E.g. frequent subgraphs with sup ≥ 10% or ≥ 10% examples contain these patterns • “Ordered” enumeration: cannot enumerate “sup = 10%” without first enumerating all patterns > 10%. • NP hard problem, easily up to 1010 patterns for a realistic problem. • Most Patterns are Non-discriminative. • Low support patterns can have high “discriminative power”. Bad! • Random sampling not work since it is not exhaustive. • Most patterns are useless. Random sample patterns (or blindly enumerate without considering frequency) is useless. • Small number of examples. • If subset of vocabulary, incomplete search. • If complete vocabulary, won’t help much but introduce sample selection bias problem, particularly to miss low support but high info gain patterns

DataSet Mined Discriminative Patterns 1 2 4 Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ select mine NN represent F1 F2 F4 Data1 1 1 0 Data2 1 0 1 Data3 1 1 0 Data4 0 0 1 ……… DT Petal.Length< 2.45 | setosa Petal.Width< 1.75 SVM versicolor virginica LR Any classifiers you can name Conventional Procedure Two-Step Batch Method • Mine frequent patterns (>sup) • Select most discriminative patterns; • Represent data in the feature space using such patterns; Build classification models. Feature Construction and Selection

DataSet Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ mine Two Problems • Mine step • combinatorial explosion 2. patterns not considered if minsupport isn’t small enough 1. exponential explosion

Mined Discriminative Patterns 1 2 4 Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ select Two Problems 4. Correlation not directly evaluated on their joint predictability • Select step • Issue of discriminative power 3. InfoGain against the complete dataset, NOT on subset of examples

dataset Mine & SelectP: 20% Most discriminative F based on IG 1 N Y Mine & Select P:20% Mine & SelectP: 20% 5 2 N N Y Y Mine & Select P:20% Mine & Select P:20% 3 6 7 4 N N Y … Y N Y + + Few Data … Direct Mining & Selection via Model-based Search Tree Feature Miner Classifier Compact set of highly discriminative patterns 1 2 3 4 5 6 7 . . . • Basic Flow Global Support: 10*20%/10000=0.02% Divide-and-Conquer Based Frequent Pattern Mining Mined Discriminative Patterns

Analyses (I) • Scalability (Theorem 1) • Upper bound • “Scale down” ratio to obtain extremely low support pat: • Bound on number of returned features (Theorem 2)

Analyses (II) • Subspace is important for discriminative pattern • Original set: no-information gain if • C1 and C0: number of examples belonging to class 1 and 0 • P1: number of examples in C1 that contains “a pattern α” • P0: number of examples in C0 that contains the same pattern α • Subsets could have info gain: • Non-overfitting • Optimality under exhaustive search

Experimental Studies: Itemset Mining (I) dataset dataset Mine & SelectP: 20% Mine & SelectP: 20% Most discriminative F based on IG Most discriminative F based on IG 1 1 N N Y Y Mine & Select P:20% Mine & Select P:20% Mine & SelectP: 20% Mine & SelectP: 20% 5 5 2 2 N N N N Y Y Y Y Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Global Support: 10*20%/10000=0.02% Global Support: 10*20%/10000=0.02% 3 3 6 6 7 7 4 4 N N N N Y Y Y Y + + + + Few Data Few Data • Scalability Comparison

Experimental Studies: Itemset Mining (II) 4 Wins 1 loss much smaller number of patterns • Accuracy of Mined Itemsets

Experimental Studies: Itemset Mining (III) • Convergence

Experimental Studies: Graph Mining (I) • 9 NCI anti-cancer screen datasets • The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. • Active (Positive) class : around 1% - 8.3% • 2 AIDS anti-viral screen datasets • URL: http://dtp.nci.nih.gov. • H1: CM+CA – 3.5% • H2: CA – 1%

dataset dataset Mine & SelectP: 20% Mine & SelectP: 20% Most discriminative F based on IG Most discriminative F based on IG 1 1 N N Y Y Mine & Select P:20% Mine & Select P:20% Mine & SelectP: 20% Mine & SelectP: 20% 5 5 2 2 N N N N Y Y Y Y Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Global Support: 10*20%/10000=0.02% Global Support: 10*20%/10000=0.02% 3 3 6 6 7 7 4 4 N N N N Y Y Y Y + + + + Few Data Few Data Experimental Studies: Graph Mining (II) • Scalability

Experimental Studies: Graph Mining (III) • AUC and Accuracy AUC 11 Wins 10 Wins 1 Loss

Experimental Studies: Graph Mining (IV) • AUC of MbT, DT MbT VS Benchmarks 7 Wins, 4 losses

Summary • Model-based Search Tree • Integrated feature mining and construction. • Dynamic support • Can mine extremely small support patterns • Both a feature construction and a classifier • Not limited to one type of frequent pattern: plug-play • Experiment Results • Itemset Mining • Graph Mining • Software and Dataset available from: • www.cs.columbia.edu/~wfan

Mining Discriminative Frequent Patterns for Feature Construction