Mining Scientific Data Sets Using Graphs

Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund Deshpande)

Outline • Mining Scientific Data-sets • Opportunities & Challenges • Using Graphs and Mining them • Pattern Discovery in Graphs • Putting Patterns to Good Use • Going Forward NGDM-02

Data Mining In Scientific Domain • Data mining has emerged as a critical tool for knowledge discovery in large data sets. • It has been extensively used to analyze business, financial, and textual data sets. • The success of these techniques has renewed interest in applying them to various scientific and engineering fields. • Astronomy • Life sciences • Ecosystem modeling • Fluid dynamics • Structural mechanics • … NGDM-02

Challenges in Scientific Data Mining • Most of existing data mining algorithms assume that the data is represented via • Transactions (set of items) • Sequence of items or events • Multi-dimensional vectors • Time series • Scientific datasets with structures, layers, hierarchy, geometry, and arbitrary relations can not be accurately modeled using such frameworks. • e.g., Numerical simulations, 3D protein structures, chemical compounds, etc. Need algorithms that operate on scientific datasets in their native representation NGDM-02

How to Model Scientific Datasets? • There are two basic choices • Treat each dataset/application differently and develop custom representations/algorithms. • Employ a new way of modeling such datasets and develop algorithms that span across different applications! • What should be the properties of this general modeling framework? • Abstract compared with the original raw data. • Yet powerful enough to capture the important characteristics. Labeled directed/undirected topological/geometric graphs and hypergraphs NGDM-02

Modeling Data With Graphs…Going Beyond Transactions Graphs are suitable for capturing arbitrary relations between the various elements. Data Instance Graph Instance Element Vertex Element’s Attributes Vertex Label Relation Between Two Elements Edge Type Of Relation Edge Label Relation between a Set of Elements Hyper Edge Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elementsshould be and the type of relationsto be modeled NGDM-02

Example: Protein 3D Structure PDB; 1MWP N-Terminal Domain Of The Amyloid Precursor Protein Alzheimer's disease amyloid A4 protein precursor NGDM-02

Example: Fluid Dynamics • Vertices  Vortices • Edges  Proximity NGDM-02

Graph Mining • Goal: Develop algorithms to mine and analyze graph data sets. • Finding patterns in these graphs • Finding groups of similar graphs (clustering) • Building predictive models for the graphs (classification) • Applications • Structural motif discovery • High-throughput screening • Protein fold recognition • VLSI reverse engineering A lot more … • Beyond Scientific Applications • Semantic web • Mining relational profiles • Behavioral modeling • Intrusion detection • Citation analysis • … NGDM-02

Finding Frequent Patterns in Graphs • A pattern is a relation between the object’s elements that is recurringover and over again. • Common structures in a family of chemical compounds or proteins. • Similar arrangements of vortices at different “instances” of turbulent fluid flows. • … • There are two general ways to formally define a pattern in the context of graphs • Arbitrary subgraphs (connected or not) • Induced subgraphs (connected or not) • Frequent pattern discovery translates to frequent subgraph discovery… NGDM-02

Finding Frequent Subgraphs:Input and Output Input • Database of graph transactions. • Undirected simple graph (no loops, no multiples edges). • Each graph transaction has labels associated with its vertices and edges. • Transactions may not be connected. • Minimum support threshold σ. Output • Frequent subgraphs that satisfy the minimum support constraint. • Each frequent subgraph is connected. NGDM-02

FSG Frequent Subgraph Discovery Algorithm Follows an Apriori-style level-by-level approach and grows the patterns one edge-at-a-time. NGDM-02

Computational Challenges Simple operations become complicated & expensive when dealing with graphs… • Candidate generation • To determine if we can join two candidates, we need to perform subgraph isomorphism to determine if they have a common subgraph. • There is no obvious way to reduce the number of times that we generate the same subgraph. • Need to perform graph isomorphism for redundancy checks. • The joining of two frequent subgraphs can lead to multiple candidate subgraphs. • Candidate pruning • To check downward closure property, we need subgraph isomorphism. • Frequency counting • Subgraph isomorphism for checking containment of a frequent subgraph • Key to FSG’s computational efficiency: • Uses an efficient algorithm to determine a canonical labelingof a graph and use these “strings” to perform identity checks. • Uses a sophisticated candidate generation algorithm that reduces the number of times each candidate is generated. • Uses an augmented TID-list based approach to speedup frequency counting. NGDM-02

Candidate Generation Based On Core Detection Multiple candidates for the same core! NGDM-02

Candidate Generation Based On Core Detection Multiple cores between two (k-1)-subgraphs NGDM-02

B B B A A v1 v2 B v0 Canonical Labeling v4 v3 v5 Label = “1 11 100 1000 01000” Label = “1 01 011 0001 00010” NGDM-02

DTP Dataset (chemical compounds)(Random 100K transactions) NGDM-02

DTP Dataset NGDM-02

Topology Is Not Enough (Sometimes) • Graphs arising from physical domains have a strong geometric nature. • This geometry must be taken into account by the data-mining algorithms. • Geometric graphs. • Vertices have physical 2D and 3D coordinates associated with them. NGDM-02

C B A gFSG—Geometric Extension Of FSG • Same input and same output as FSG • Finds frequent geometric connected subgraphs • Geometric version of (sub)graph isomorphism • The mapping of vertices can be translation, rotation, and/or scaling invariant. • The matching of coordinates can be inexact as long as they are within a tolerance radius of r. • R-tolerant geometric isomorphism. NGDM-02

Use Of Geometry • Transformation-invariant signatures enable quick identity checks • Normalized sum of distances from the center to each vertex • A sorted list of edge angles • To compare signatures, use a certain threshold • No canonical labeling • For the subgraph isomorphism, coordinates of vertices also work as strong constraints to narrow down the search space of combination. • Not only the vertex and edge labels, now the coordinates must be matched. • R-tolerance makes the problem of finding all patterns extremely hard. • Patterns that are 2R-isomorphic to each other will not be counted properly NGDM-02

Performance of gFSG • Different number of transactions randomly sampled from DTP dataset • Average transaction size about 23 • Minimum support 1.0% NGDM-02

A discovered pattern NSC 4960 NSC 699181 NSC 40773 NSC 164863 NSC 191370 Example NGDM-02

Putting Patterns to Good Use… NGDM-02

Laboratory and animal testing Production for clinical trials File IND Drug Development Cycle Drug screening/ rational drug design/ direct synthesis Idea for drug target Small scale production NGDM-02

1 2 Discover Frequent Select Discriminating Sub-graphs Features Graph Databases 3 4 Transform Graphs Learn a Classification in Feature Model Representation Graph Classification Approach NGDM-02

Chemical Compound Datasets • Predictive Toxicology Challenge (PTC) • Predicting toxicity (carcinogenicity) of compounds. • Bio assays on four kinds of rodents • 4 Classification Problems -- Approx 400 chemical compounds. • DTP AIDS Antiviral Screen (AIDS) • Predicting anti-HIV activity of compounds. • Assay to measure protection of human cells against HIV infection. • 3 Classification problems -- Approx 40,000 chemical compounds. • Anthrax • Predicting binding ability of compounds with the anthrax toxin. • Expensive molecular dynamics simulation • Collaboration with Dr. Frank Lebeda, USAMRIID • Approx 35,000 chemical compounds NGDM-02

Comparison with PTC and HIV (female mice) (HIV screening) NGDM-02

Anthrax NGDM-02

Comparison of Topological & Geometric Features on Anthrax Dataset NGDM-02

Most Discriminating Subbgraphs (a) On Toxicology (PTC) Dataset (b) On AIDS Dataset (c) On Anthrax Dataset NGDM-02

Moving Forwards • Graphs provide a powerful mechanism to represent relational and physical datasets. • Can be used as a quick prototyping tool to test out whether or not data-mining techniques can help a particular application area and problem. • Their benefits can be realized if there exists an extensive set of efficient and scalable algorithms to mine them… NGDM-02

Research on Pattern Discovery • Robust algorithms for mining 3D geometric graphs • extensive applications in proteomics • Algorithms for finding approximate patterns • allow for a limited number of changes • there is always variation in the physical world • Algorithms to find patterns in single large graphs • what is a pattern? • what is its support? • do we allow for overlap? • … NGDM-02

Research on Classification • Position specific models • a substructure at the surface of the protein is in general more important than the same substructure being buried • Efficient Graph-based kernel methods • algebra of graphs? • … NGDM-02

Research on Clustering • Efficient methods to compute graph similarities • spectral properties? • graph edit distance? • Graph consensus representations • Multiple graph “alignments” • … NGDM-02

Graphs, graphs, and more graphs… • Graphs with multi-dimensional labels • Stream graphs • phone-network connections • Hypergraphs • compact representation of set relations • Benchmarks and real-life test cases! NGDM-02

Thank you! http://www.cs.umn.edu/~karypis NGDM-02

Mining Scientific Data Sets Using Graphs