260 likes | 353 Vues
Explore the extraction of discriminative frequent patterns for feature construction from semi-structured data, aiming at enhancing classification accuracy. Learn how to identify good features using a model-based search tree and the challenges faced in finding predictive features with a high information gain. Discover how discriminative patterns can optimize the feature space and improve classification models. Uncover the techniques for direct mining and selection of these patterns to build compact and powerful feature sets. Dive into the complexities of computational issues, scalability, and transforming data spaces for efficient pattern mining.
E N D
Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree How to find good features from semi-structured raw data for classification Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure
Feature Construction • Most data mining and machine learning model assume the following structured data: • (x1, x2, ..., xk) -> y • where xi’s are independent variable • y is dependent variable. • y drawn from discrete set: classification • y drawn from continuous variable: regression • When feature vectors are good, differences in accuracy among learners are not much. • Questions: where do good features come from?
Frequent Pattern-Based Feature Extraction • Data not in the pre-defined feature vectors • Transactions • Biological sequence • Graph database Frequent pattern is a good candidate for discriminative features So, how to mine them?
A discovered pattern NSC 4960 NSC 699181 NSC 40773 NSC 164863 NSC 191370 FP: Sub-graph (example borrowed from George Karypis presentation)
NN DT Petal.Length< 2.45 | setosa Petal.Width< 1.75 SVM versicolor virginica LR Any classifiers you can name Frequent Pattern Feature Vector Representation P1 P2 P3 Data1 1 1 0 Data2 1 0 1 Data3 1 1 0 Data4 0 0 1 ……… Mining these predictive features is an NP-hard problem. 100 examples can get up to 1010 patterns Most are useless
Example • 192 examples • 12% support (at least 12% examples contain the pattern), 8600 patterns returned by itemsets • 192 vs 8600 ? • 4% support, 92,000 patterns • 192 vs 92,000 ?? • Most patterns have no predictive power and cannot be used to construct features. • Our algorithm • Find only 20 highly predictive patterns • can construct a decision tree with about 90% accuracy
Data in “bad” feature space Discriminative patterns A non-linear combination of single feature(s) Increase the expressive and discriminative power of the feature space An example y 1 1 0 x 1 1 Data is non-linearly separable in (x, y)
New Feature Space Map Data to a Different Space • Solving Problem 0 1 1 ItemSet: F: x=0,y=0 Association rule F: x=0 y=0 1 1 F Mine & Transform 1 1 0 x 1 1 y • Data is linearly separable in (x, y, F)
Computational Issues • Measured by its “frequency” or support. • E.g. frequent subgraphs with sup ≥ 10% or ≥ 10% examples contain these patterns • “Ordered” enumeration: cannot enumerate “sup = 10%” without first enumerating all patterns > 10%. • NP hard problem, easily up to 1010 patterns for a realistic problem. • Most Patterns are Non-discriminative. • Low support patterns can have high “discriminative power”. Bad! • Random sampling not work since it is not exhaustive. • Most patterns are useless. Random sample patterns (or blindly enumerate without considering frequency) is useless. • Small number of examples. • If subset of vocabulary, incomplete search. • If complete vocabulary, won’t help much but introduce sample selection bias problem, particularly to miss low support but high info gain patterns
DataSet Mined Discriminative Patterns 1 2 4 Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ select mine NN represent F1 F2 F4 Data1 1 1 0 Data2 1 0 1 Data3 1 1 0 Data4 0 0 1 ……… DT Petal.Length< 2.45 | setosa Petal.Width< 1.75 SVM versicolor virginica LR Any classifiers you can name Conventional Procedure Two-Step Batch Method • Mine frequent patterns (>sup) • Select most discriminative patterns; • Represent data in the feature space using such patterns; Build classification models. Feature Construction and Selection
DataSet Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ mine Two Problems • Mine step • combinatorial explosion 2. patterns not considered if minsupport isn’t small enough 1. exponential explosion
Mined Discriminative Patterns 1 2 4 Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ select Two Problems 4. Correlation not directly evaluated on their joint predictability • Select step • Issue of discriminative power 3. InfoGain against the complete dataset, NOT on subset of examples
dataset Mine & SelectP: 20% Most discriminative F based on IG 1 N Y Mine & Select P:20% Mine & SelectP: 20% 5 2 N N Y Y Mine & Select P:20% Mine & Select P:20% 3 6 7 4 N N Y … Y N Y + + Few Data … Direct Mining & Selection via Model-based Search Tree Feature Miner Classifier Compact set of highly discriminative patterns 1 2 3 4 5 6 7 . . . • Basic Flow Global Support: 10*20%/10000=0.02% Divide-and-Conquer Based Frequent Pattern Mining Mined Discriminative Patterns
Analyses (I) • Scalability (Theorem 1) • Upper bound • “Scale down” ratio to obtain extremely low support pat: • Bound on number of returned features (Theorem 2)
Analyses (II) • Subspace is important for discriminative pattern • Original set: no-information gain if • C1 and C0: number of examples belonging to class 1 and 0 • P1: number of examples in C1 that contains “a pattern α” • P0: number of examples in C0 that contains the same pattern α • Subsets could have info gain: • Non-overfitting • Optimality under exhaustive search
Experimental Studies: Itemset Mining (I) dataset dataset Mine & SelectP: 20% Mine & SelectP: 20% Most discriminative F based on IG Most discriminative F based on IG 1 1 N N Y Y Mine & Select P:20% Mine & Select P:20% Mine & SelectP: 20% Mine & SelectP: 20% 5 5 2 2 N N N N Y Y Y Y Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Global Support: 10*20%/10000=0.02% Global Support: 10*20%/10000=0.02% 3 3 6 6 7 7 4 4 N N N N Y Y Y Y + + + + Few Data Few Data • Scalability Comparison
Experimental Studies: Itemset Mining (II) 4 Wins 1 loss much smaller number of patterns • Accuracy of Mined Itemsets
Experimental Studies: Itemset Mining (III) • Convergence
Experimental Studies: Graph Mining (I) • 9 NCI anti-cancer screen datasets • The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. • Active (Positive) class : around 1% - 8.3% • 2 AIDS anti-viral screen datasets • URL: http://dtp.nci.nih.gov. • H1: CM+CA – 3.5% • H2: CA – 1%
dataset dataset Mine & SelectP: 20% Mine & SelectP: 20% Most discriminative F based on IG Most discriminative F based on IG 1 1 N N Y Y Mine & Select P:20% Mine & Select P:20% Mine & SelectP: 20% Mine & SelectP: 20% 5 5 2 2 N N N N Y Y Y Y Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Mine & Select P:20% Global Support: 10*20%/10000=0.02% Global Support: 10*20%/10000=0.02% 3 3 6 6 7 7 4 4 N N N N Y Y Y Y + + + + Few Data Few Data Experimental Studies: Graph Mining (II) • Scalability
Experimental Studies: Graph Mining (III) • AUC and Accuracy AUC 11 Wins 10 Wins 1 Loss
Experimental Studies: Graph Mining (IV) • AUC of MbT, DT MbT VS Benchmarks 7 Wins, 4 losses
Summary • Model-based Search Tree • Integrated feature mining and construction. • Dynamic support • Can mine extremely small support patterns • Both a feature construction and a classifier • Not limited to one type of frequent pattern: plug-play • Experiment Results • Itemset Mining • Graph Mining • Software and Dataset available from: • www.cs.columbia.edu/~wfan