1 / 44

Graph Mining Applications in Machine Learning Problems

Graph Mining Applications in Machine Learning Problems. Max Planck Institute for Biological Cybernetics Koji Tsuda. Serial Num. Name. Age. Sex. Address. …. 0001. ○○. 40. Male. Tokyo. …. 0002. ××. 31. Female. Osaka. …. Motivations for graph analysis.

keran
Télécharger la présentation

Graph Mining Applications in Machine Learning Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graph Mining Applicationsin Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda

  2. Serial Num Name Age Sex Address … 0001 ○○ 40 Male Tokyo … 0002 ×× 31 Female Osaka … Motivations for graph analysis • Existing methods assume ” tables” • Structured data beyond this framework → New methods for analysis

  3. Graphs..

  4. A C G C UA CG CG U U U U Graph Structures in Biology • Compounds • DNA Sequence • RNA • Texts in literature H C C C H H O C C H C H H Amitriptyline inhibits adenosine uptake

  5. Overview • Path representation • Graph Kernel … Its disadvantages • Substructure representation • Graph Mining • EM-based Graph Clustering (Tsuda and Kudo, ICML 2006)

  6. Path Representations &Marginalized Graph Kernels

  7. Marginalized Graph Kernels (Kashima, Tsuda, Inokuchi, ICML 2003) • Going to define the kernel function • Both vertex and edges are labeled

  8. Label path • Sequence of vertex and edge labels • Generated by random walking • Uniform initial, transition, terminal probabilities

  9. Path-probability vector

  10. A c D b E B c D a A Kernel definition • Kernels for paths • Take expectation over all possible paths! • Marginalized kernels for graphs

  11. Transition probability : Initial and terminal : omitted • : Set of paths ending at v • KV : Kernel computed from the paths ending at (v, v’) • KV is written recursively • Kernel computed by solving linear equations (polynomial time) A(v’) v v’ A(v) Computation

  12. Graph Kernel Applications • Chemical Compounds (Mahe et al., 2005) • Protein 3D structures (Borgwardt et al, 2005) • RNA graphs (Karklin et al., 2005) • Pedestrian detection • Signal Processing

  13. Strong points of MGK • Polynomial time computation O(n^3) • Positive definite kernel • Support Vector Machines • Kernel PCA • Kernel CCA • And so on…

  14. Drawbacks of graph kernels • Global similarity measure • Fails to capture subtle differences • Long paths suppressed • Results not interpretable • Structural features ignored (e.g. loops) • No labels -> kernel always 1

  15. Substructure Representation &Graph Mining

  16. Substructure Representation • 0/1 vector of pattern indicators • Huge dimensionality! • Need Graph Mining for selecting features patterns

  17. Graph Mining • Subfield of Data Mining • KDD, ICDM, PKDD • not popular in ICML, NIPS • Analysis of Graph Databases • Frequent Substructure Mining • Combinatorial algorithm • Recently developed • AGM (Inokuchi et al., 2000), gspan (Yan et al., 2002), Gaston (2004)

  18. Graph Mining • Frequent Substructure Mining • Enumerate all patterns occurred in at least m graphs • :Indicator of pattern k in graph i Support(k): # of occurrence of pattern k

  19. Enumeration on Tree-shaped Search Space • Each node has a pattern • Generate nodes from the root: • Add an edge at each step

  20. Support(g): # of occurrence of pattern g Tree Pruning • Anti-monotonicity: • If support(g) < m, stop exploring! Not generated

  21. Gspan (Yan and Han, 2002) • Efficient Frequent Substructure Mining Method • DFS Code • Efficient detection of isomorphic patterns • Extend Gspan for our works

  22. A labeled graph G b A C 3 0 b a B A c 2 1 Depth First Search (DFS) Code DFS Code Tree on G [2,0,A,b,A] [0,1,A,a,B] [1,2,B,c,A] [0,2,A,b,A] [0,1,A,a,B] [0,3,A,b,C] G0 G1 [2,0,A,b,A] [0,3,A,b,C] [0,3,A,b,C] Isomorphic [0,3,A,b,C] Non-minimum DFS code. Prune it.

  23. Discriminative patterns • w_i > 0: positive class • w_i < 0: negative class • Weighted Substructure Mining • Patterns with large frequency difference • Not Anti-Monotonic: Use a bound

  24. Multiclass version • Multiple weight vectors • (graph belongs to class ) • (otherwise) • Search patterns overrepresented in a class

  25. Summary of Graph Mining • Efficient way of searching patterns satisfying predetermined conditions • NP hard • But actual speed depends on the data • Faster for.. • Sparse graphs • Diverse kinds of labels

  26. EM-based clustering of graphs(Tsuda and Kudo, ICML 2006)

  27. EM-based graph clustering • Motivation • Learning a mixture model in the feature space of patterns • Basis for more complex probabilistic inference • L1 regularization & Graph Mining • E-step -> Mining -> M-step

  28. :Feature vector of a graph (0 or 1) Probabilistic Model • Binomial Mixture • Each Component :Mixing weight for cluster :Parameter vector for cluster

  29. Ordinary EM algorithm • Maximizing the log likelihood • E-step: Get posterior • M-step: Estimate using posterior probs. • Both are computationally prohibitive (!)

  30. Regularization • L1-Regularized log likelihood • Baseline constant • ML parameter estimate using single binomial distribution • In solution, most parameters exactly equal to constants

  31. E-step • Active pattern • E-step computed only with active patterns (computable!)

  32. M-step • Putative cluster assignment • Each parameter is solved separately • Naïve way: • solve it for all params and identify active patterns • Use graph mining to find active patterns

  33. Solution • Occurrence probability in a cluster • Overall occurrence probability

  34. Solution

  35. Important Observation For active pattern k, the occurrence probability in a graph cluster is significantly different from the average

  36. Mining for Active Patterns • Active pattern • Equivalently written as • F can be found by graph mining! (multiclass)

  37. Experiments: RNA graphs • Stem as a node • Secondary structure by RNAfold • 0/1 Vertex label (self loop or not)

  38. Clustering RNA graphs • Three Rfam families • Intron GP I (Int, 30 graphs) • SSU rRNA 5 (SSU, 50 graphs) • RNase bact a (RNase, 50 graphs) • Three bipartition problems • Results evaluated by ROC scores (Area under the ROC curve)

  39. Examples of RNA Graphs

  40. ROC Scores

  41. No of Patterns & Time

  42. Found Patterns

  43. Conclusion • Substructure representation is better than paths • Probabilistic inference helped by graph mining • Many possible extensions • Naïve Bayes • Graph PCA, LFD, CCA • Semi-supervised learning • Applications in Biology?

  44. Ongoing work..

More Related