1 / 50

Graph Mining Applications to Machine Learning Problems

Graph Mining Applications to Machine Learning Problems. Max Planck Institute for Biological Cybernetics Koji Tsuda. Graphs …. A. C. G. C. UA. CG. CG. U. U. U. U. Graph Structures in Biology. Compounds. DNA Sequence RNA Texts in literature. H. C. C. C. H. H. O. C. C. H.

dylan-hunt
Télécharger la présentation

Graph Mining Applications to Machine Learning Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda

  2. Graphs…

  3. A C G C UA CG CG U U U U Graph Structures in Biology • Compounds • DNA Sequence • RNA • Texts in literature H C C C H H O C C H C H H Amitriptyline inhibits adenosine uptake

  4. Substructure Representation • 0/1 vector of pattern indicators • Huge dimensionality! • Need Graph Mining for selecting features • Better than paths (Marginalized graph kernels) patterns

  5. Overview • Quick Review on Graph Mining • EM-based Clustering algorithm • Mixture model with L1 feature selection • Graph Boosting • Supervised Regression for QSAR Analysis • Linear programming meets graph mining

  6. Quick Review of Graph Mining

  7. Graph Mining • Analysis of Graph Databases • Find all patterns satisfying predetermined conditions • Frequent Substructure Mining • Combinatorial, Exhaustive • Recently developed • AGM (Inokuchi et al., 2000), gspan (Yan et al., 2002), Gaston (2004)

  8. Graph Mining • Frequent Substructure Mining • Enumerate all patterns occurred in at least m graphs • :Indicator of pattern k in graph i Support(k): # of occurrence of pattern k

  9. Gspan (Yan and Han, 2002) • Efficient Frequent Substructure Mining Method • DFS Code • Efficient detection of isomorphic patterns • Extend Gspan for our works

  10. Enumeration on Tree-shaped Search Space • Each node has a pattern • Generate nodes from the root: • Add an edge at each step

  11. Support(g): # of occurrence of pattern g Tree Pruning • Anti-monotonicity: • If support(g) < m, stop exploring! Not generated

  12. Discriminative patterns:Weighted Substructure Mining • w_i > 0: positive class • w_i < 0: negative class • Weighted Substructure Mining • Patterns with large frequency difference • Not Anti-Monotonic: Use a bound

  13. Multiclass version • Multiple weight vectors • (graph belongs to class ) • (otherwise) • Search patterns overrepresented in a class

  14. EM-based clustering of graphs Tsuda, K. and T. Kudo: Clustering Graphs by Weighted Substructure Mining. ICML 2006, 953-960, 2006

  15. EM-based graph clustering • Motivation • Learning a mixture model in the feature space of patterns • Basis for more complex probabilistic inference • L1 regularization & Graph Mining • E-step -> Mining -> M-step

  16. :Feature vector of a graph (0 or 1) Probabilistic Model • Binomial Mixture • Each Component :Mixing weight for cluster :Parameter vector for cluster

  17. Function to minimize • L1-Regularized log likelihood • Baseline constant • ML parameter estimate using single binomial distribution • In solution, most parameters exactly equal to constants

  18. E-step • Active pattern • E-step computed only with active patterns (computable!)

  19. M-step • Putative cluster assignment by E-step • Each parameter is solved separately • Use graph mining to find active patterns • Then, solve it only for active patterns

  20. Solution • Occurrence probability in a cluster • Overall occurrence probability

  21. Important Observation For active pattern k, the occurrence probability in a graph cluster is significantly different from the average

  22. Mining for Active Patterns F • F is rewritten in the following form • Active patterns can be found by graph mining! (multiclass)

  23. Experiments: RNA graphs • Stem as a node • Secondary structure by RNAfold • 0/1 Vertex label (self loop or not)

  24. Clustering RNA graphs • Three Rfam families • Intron GP I (Int, 30 graphs) • SSU rRNA 5 (SSU, 50 graphs) • RNase bact a (RNase, 50 graphs) • Three bipartition problems • Results evaluated by ROC scores (Area under the ROC curve)

  25. Examples of RNA Graphs

  26. ROC Scores

  27. No of Patterns & Time

  28. Found Patterns

  29. Summary (EM) • Probabilistic clustering based on substructure representation • Inference helped by graph mining • Many possible extensions • Naïve Bayes • Graph PCA, LFD, CCA • Semi-supervised learning • Applications in Biology?

  30. Graph Boosting Saigo, H., T. Kadowaki and K. Tsuda: A Linear Programming Approach for Molecular QSAR analysis. International Workshop on Mining and Learning with Graphs, 85-96, 2006

  31. Graph Regression Problem • Known as QSAR problem in chemical informatics • Quantitative Structure-Activity Analysis • Given a graph, predict a real-value • Typically, features (descriptors) are given

  32. QSAR with conventional descriptors

  33. Motivation of Graph Boosting • Descriptors are not always available • New features by obtaining informative patterns (i.e., subgraphs) • Greedy pattern discovery by Boosting + gSpan • Linear Programming (LP) Boosting for reducing the number of graph mining calls • Accurate prediction & interpretable results

  34. C C C C C C C C C C O Molecule as a labeled graph

  35. C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cl Cl C C C C O O QSAR with patterns

  36. Sparse regression in a very high dimensional space • G: all possible patterns (intractably large) • |G|-dimensional feature vector x for a molecule • Linear Regression • Use L1 regularizer to have sparse α • Select a tractable number of patterns

  37. Problem formulation We introduce ε-insensitive loss and L1 regularizer m: # of training graphs d = |G| ξ+, ξ- : slack variables ε: parameter

  38. Dual LP • Primal: Huge number of weight variables • Dual: Huge number of constraints LP1-Dual

  39. Column Generation Algorithm for LP Boost (Demiriz et al., 2002) • Start from the dual with no constraints • Add the most violated constraint each time • Guaranteed to converge Constraint Matrix Used Part

  40. Finding the most violated constraint • Constraint for a pattern (shown again) • Finding the most violated one • Searched by weighted substructure mining

  41. Algorithm Overview • Iteration • Find a new pattern by graph mining with weight u • If all constraints are satisfied, break • Add a new constraint • Update u byLP1-Dual • Return • Convert dual solution to obtain primal solution α

  42. Speed-up by adding multiple patterns (multiple pricing) • So far, the most violated pattern is chosen • Mining and inclusion of top k patterns at each iteration • Reduction of the number of mining calls A Linear Programming Approach for Molecular QSAR Analysis

  43. Speed-up by multiple pricing

  44. Clearly negative data A Linear Programming Approach for Molecular QSAR Analysis

  45. Inclusion of clearly negative data LP2-Primal l: # of clearly negative data z: predetermined upperbound ξ’ : slack variable

  46. Experiments • Data from Endocrine Disruptors Knowledge Base • 59 compounds labeled by real number and 61 compounds labeled by a large negative number • Label (target) is a log translated relative proliferative potency (log(RPP)) normalized between –1 and 1 • Comparison with • Marginalized Graph Kernel + ridge regression • Marginalized Graph Kernel + kNN regression

  47. Results with or without clearly negative data LP2 LP1

  48. Extracted patterns Interpretable compared with implicitly expressed features by Marginalized Graph Kernel

  49. Summary (Graph Boosting) • Graph Boosting simultaneously generate patterns and learn their weights • Finite convergence by column generation • Potentially interpretable by chemists. • Flexible constraints and speed-up by LP.

  50. Concluding Remarks • Using graph mining as a part of machine learning algorithms • Weights are essential • Please include weights when you implement your item-set/tree/graph mining algorithms • Make it available on the web! • Then ML researchers can use it

More Related