Three Challenges in Data Mining
E N D
Presentation Transcript
Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU
Why Data Mining? • Parkinson’s Law of Data Data expands to fill the space available for storage • Disk-storage version of Moore’s law Capacity 2 t / 18 months • Available data grows exponentially!
Outline • Motivation of 3 challenges • More records (rows) • More attributes (columns) • More subject domains • Some answers to the challenges • Thesis work • Generalized P-Tree structure • Kernel-based semi-naïve Bayes classification • KDD-cup 02/03 and with Csci 366 students • Data with graph relationship • Outlook: Data with time dependence
Examples • More records • Many stores save each transaction • Data warehouses keep historic data • Monitoring network traffic • Micro sensors / sensor networks • More attributes • Items in a shopping cart • Keywords in text • Properties of a protein (multi-valued categorical) • More subject domains • Data mining hype increases audience
Algorithmic Perspective • More records • Standard scaling problem • More attributes • Different algorithms needed for 1000 vs. 10 attributes • More subject domains • New techniques needed • Joining of separate fields • Algorithms should be domain-independent • Need for experts does not scale well • Twice as many data sets • Twice as many domain experts?? • Ignore domain knowledge? • No! Formulate it systematically
Some Answers to Challenges • Large data quantity (Thesis) • Many records • P-Tree concept and its generalization to non-spatial data • Many attributes • Algorithm that defies curse of dimensionality • New techniques / Joining separate fields • Mining data on a graph • Outlook: Mining data with time dependence
Challenge 1: Many Records • Typical question • How many records satisfy given conditions on attributes? • Typical answer • In record-oriented database systems • Database scan: O(N) • Sorting / indexes? • Unsuitable for most problems • P-Trees • Compressed bit-column-wise storage • Bit-wise AND replaces database scan
P-Trees: Ordering Aspect • Compression relies on long sequences of 0 or 1 • Images • Neighboring pixels are probably similar • Peano-ordering • Other data? • Peano-ordering can be generalized • Peano-order sorting
Impact of Peano-Order Sorting • Speed improvement especially for large data sets • Less than O(N) scaling for all algorithms
So Far • Answer to challenge 1: Many records • P-Tree concept allows scaling better than O(N) for AND (equivalent to database scan) • Introduced effective generalization to non-spatial data (thesis) • Challenge 2: Many attributes • Focus: Classification • Curse of dimensionality • Some algorithms suffer more than others
Curse of Dimensionality • Many standard classification algorithms • E.g., decision trees, rule-based classification • For each attribute 2 halves: relevant irrelevant • How often can we divide by 2 before small size of “relevant” part makes results insignificant? • Inverse of • Double number of rice grains for each square of the chess board • Many domains have hundreds of attributes • Occurrence of terms in text mining • Properties of genes
Possible Solution • Additive models • Each attribute contributes to a sum • Techniques exist (statistics) • Computationally intensive • Simplest: Naïve Bayes • x(k) is value of kth attribute • Considered additive model • Logarithm of probability additive
Semi-Naïve Bayes Classifier • Correlated attributes are joined • Has been done for categorical data • Kononenko ’91, Pazzani ’96 • Previously: Continuous data discretized • New (thesis) • Kernel-based evaluation of correlation
Results • Error decrease in units of standard deviation for different parameter sets • Improvement for wide range of correlation thresholds: 0.05 (white) to 1 (blue)
So Far • Answer to challenge 1: More records • Generalized P-tree structure • Answer to challenge 2: More attributes • Additive algorithms • Example: Kernel-based semi-naïve Bayes • Challenge 3: More subject domains • Data on a graph • Outlook: Data with time dependence
Standard Approach to Data Mining • Conversion to a relation (table) • Domain knowledge goes into table creation • Standard table can be mined with standard tools • Does that solve the problem? • To some degree, yes • But we can do better
“Everything should be made as simple as possible, but not simpler” Albert Einstein
Claim: Representation as single relation is not rich enough • Example: Contribution of a graph structure to standard mining problems • Genomics • Protein-protein interactions • WWW • Link structure • Scientific publications • Citations Scientific American 05/03
Data on a Graph: Old Hat? • Common Topics • Analyze edge structure • Google • Biological Networks • Sub-graph matching • Chemistry • Visualization • Focus on graph structure • Our work • Focus on mining node data • Graph structure provides connectivity
Protein-Protein Interactions • Protein data • From Munich Information Center for Protein Sequences (also KDD-cup 02) • Hierarchical attributes • Function • Localization • Pathways • Gene-related properties • Interactions • From experiments • Undirected graph
Questions • Prediction of a property (KDD-cup 02: AHR*) • Which properties in neighbors are relevant? • How should we integrate neighbor knowledge? • What are interesting patterns? • Which properties say more about neighboring nodes than about the node itself? But not: *AHR: Aryl Hydrocarbon Receptor Signaling Pathway
Possible Representations • OR-based • At least one neighbor has property • Example: Neighbor essential true • AND-based • All neighbors have property • Example: Neighbor essential false • Path-based (depends on maximum hops) • One record for each path • Classification: weighting? • Association Rule Mining: Record base changes AHR essential AHR essential AHR not essential
Association Rule Mining • OR-based representation • Conditions • Association rule involves AHR • Support across a link greater than within a node • Conditions on minimum confidence and support • Top 3 with respect to support: (Results by Christopher Besemann, project CSci 366)
Classification Results • Problem (especially path-based representation) • Varying amount of information per record • Many algorithms unsuitable in principle • E.g., algorithms that divide domain space • KDD-cup 02 • Very simple additive model • Based on visually identifying relationship • Number of interacting essential genes adds to probability of predicting protein as AHR
KDD-Cup 02: Honorable Mention NDSU Team
Outlook: Time-Dependent Data • KDD-cup 03 • Prediction of citations of scientific papers • Old: Time-series prediction • New: Combination with similarity-based prediction
Conclusions and Outlook • Many exciting problems in data mining • Various challenges • Scaling of existing algorithms (more records) • Different properties in algorithms become relevant (more attributes) • Identifying and solving new domain-independent challenges (more subject areas) • Examples of general structural components that apply to many domains • Graph-structure • Time-dependence • Relationships between attributes