1 / 28

Efficient Mining of Graph-Based Data

Efficient Mining of Graph-Based Data. Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue. Motivation. Structural/relational data Ease of graph representation.

Télécharger la présentation

Efficient Mining of Graph-Based Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue SRL Workshop

  2. Motivation • Structural/relational data • Ease of graph representation SRL Workshop

  3. Graph-Based Discovery Input Database Substructure S1 (graph form) Compressed Database T1 triangle shape C1 C1 S1 B1 object R1 R1 on square S1 S1 S1 shape T2 T3 T4 object B2 B3 B4 SRL Workshop

  4. triangle on circle square on on rectangle on on on triangle triangle triangle on on on square square square Algorithm • Create substructure for each unique vertex label Substructures: triangle (4), square (4), circle (1), rectangle (1) SRL Workshop

  5. triangle triangle on on circle square square on on rectangle on on on triangle triangle triangle rectangle on on on on square circle square square square triangle on on rectangle rectangle Algorithm • Expand best substructure by an edge or edge+neighboring vertex Substructures: SRL Workshop

  6. Algorithm • Keep only best beam-width substructures on queue • Terminate when queue is empty or #discovered substructures >= limit • Compress graph and repeat to generate hierarchical description Note: polynomially constrained SRL Workshop

  7. Evaluation Metric • Substructures evaluated based on ability to compress input graph • Compression measured using minimum description length (DL) • Best substructure S in graph G minimizes: DL(S) + DL(G|S) SRL Workshop

  8. Examples SRL Workshop

  9. Inexact Graph Match • Some variations may occur between instances • Want to abstract over minor differences • Difference = cost of transforming one graph to isomorphism of another • Match if cost/size < threshold SRL Workshop

  10. Parallel/Distributed Discovery • Divide graph into P partitions using Metis, distribute to P processors • Each processor performs serial Subdue on local partition • Broadcast best substructures, evaluate on other processors • Master processor stores best global substructures • Close to linear speedup SRL Workshop

  11. Graph-Based Concept Learning • One graph stores positive examples • One graph stores negative examples • Find substructure that compresses positive graph but not negative graph • (PosEgsNotCovered) + (NegEgsCovered) • Multiple iterations implements set-covering approach SRL Workshop

  12. shape object triangle on shape object square on object Concept-Learning Example SRL Workshop

  13. Concept-Learning Results • Chess endgames (19,257 examples) • Black King is (+) or is not (-) in check • 99.8% FOIL, 99.21% Subdue SRL Workshop

  14. More Concept-Learning Results • Tic-Tac-Toe endgames • + is win for X (958 examples) • 100% Subdue, 92.35% FOIL • Bach chorales • Musical sequences (20 sequences) • 100% Subdue, 85.71% FOIL SRL Workshop

  15. Graph-Based Clustering • Iterate Subdue until single vertex • Each cluster (substructure) inserted into a classification lattice Root SRL Workshop

  16. Name Body Cover Heart Chamber Body Temp. Fertilization mammal hair four regulated internal bird feathers four regulated internal reptile cornified-skin imperfect-four unregulated internal mammal Name four hair BodyCover amphibian moist-skin three unregulated external HeartChamber animal Fertilization BodyTemp regulated internal fish scales two unregulated external Clustering Example: Animals SRL Workshop

  17. Animals HeartChamber: four BodyTemp: regulated Fertilization: internal BodyTemp: unregulated Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Fertilization: external Name: amphibian BodyCover: moist-skin HeartChamber: three Name: fish BodyCover: scales HeartChamber: two Graph-Based Clustering Results SRL Workshop

  18. animals amphibian/fish mammal/bird reptile fish amphibian mammal bird Cobweb Results • Comparison of Subdue and Cobweb results • Subdue lattice produced better generalization, resulting in less clusters at higher levels • Subdue lattice identifies overlap between (reptile) and (amphibian/fish) SRL Workshop

  19. Clustering Example: DNA SRL Workshop

  20. DNA O | O == P — OH C — N C — C C — C \ O C \ N — C \ C O | O == P — OH | O | CH2 O \ C / \ C — C N — C / \ O C Graph-Based Clustering Results Coverage • 61% • 68% • 71% SRL Workshop

  21. Evaluation of Clusterings • Traditional evaluation: • Not applicable to hierarchical domains • Does not make sense to compare clusters in different subtrees • Not applicable to relational clusterings SRL Workshop

  22. Properties of Good Clusterings • Small number of clusters • Large coverage  good generality • Big cluster descriptions • More features  more inferential power • Minimal or no overlap between clusters • More distinct clusters  better defined concepts SRL Workshop

  23. New Evaluation Heuristic for Hierarchical Clusterings • Clustering rooted at C with c children Hi having |Hi| instances Hi,k • distance() measured by inexact graph match • Animals: SubdueCQ=2.6, CobwebCQ=1.7 SRL Workshop

  24. hyperlink hyperlink web_page web_page web_page hyperlink … home Graph-Based Data Mining: Application Domains • Biochemical domains • Protein data • DNA data • Toxicology (cancer) data • Spatial-temporal domains • Earthquake data • Aircraft Safety and Reporting System • Telecommunications data • Program source code • Web topology SRL Workshop

  25. Theoretical Analysis • Galois lattice [Lequiere et al.] • Conceptual graphs [Sowa et al.] • PAC analysis [Jappy et al.] SRL Workshop

  26. Graph-based Data Mining • Pattern (substructure) discovery • Hierarchical discovery • Distributed discovery • Concept learning • Clustering • Compression heuristic based on minimum description length SRL Workshop

  27. Future Work • Concept learning • Theoretical analysis • Comparison to ILP systems • Clustering • Classification lattice • Hierarchical relational conceptual clustering evaluation metric • Probabilistic substructures • Domains: WWW, source code SRL Workshop

  28. Subdue Source Code and Data http://cygnus.uta.edu/subdue SRL Workshop

More Related