1 / 68

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Graph Data Analysis Overview. Methods for Mining Frequent Subgraphs Mining Variant and Constrained Substructure Patterns Graph Classification Graph Clustering Summary. Why Graph Mining?.

gwenllian
Télécharger la présentation

EECS 800 Research Seminar Mining Biological Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

  2. Graph Data Analysis Overview • Methods for Mining Frequent Subgraphs • Mining Variant and Constrained Substructure Patterns • Graph Classification • Graph Clustering • Summary

  3. Why Graph Mining? • Graphs are ubiquitous • Chemical compounds (Cheminformatics) • Protein structures, biological pathways/networks (Bioinformactics) • Program control flow, traffic flow, and workflow analysis • XML databases, Web, and social network analysis • Graph is a general model • Trees, lattices, sequences, and items are degenerated graphs • Diversity of graphs • Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) • Complexity of algorithms: many problems are of high complexity

  4. Graph Everywhere from H. Jeong et al Nature 411, 41 (2001) Yeast protein interaction network Aspirin Internet Co-author network

  5. p5 p2 y c b y p1 x a y y d b p4 p3 G1 q1 s1 s4 y b c y y b s2 q2 a a x y y b b s3 q3 G3 G2 Labeled Graphs • A labeled graph is a graph where each node and each edge has a label.

  6. p5 p2 s1 s4 y y c b b c y y s2 p1 x a a y y y q1 b d b y b s3 p4 p3 q2 G3 G1 a x g2 g3 y y y c b a b g1 q3 G2 G Pattern Matching • A graph G is subgraph isomorphic to a graph G’, denoted by G  G’, if • there exists a 1-1 mapping from nodes in G to G’ such that node labels, edges, and edge labels are preserved with the mapping. • A pattern is a graph. Pattern Gmatches G’ if G  G’ • Goccurs in G’ if G  G’. • With a label set, a graph space is a collection of graphs whose labels are from the set.

  7. Graph Pattern Mining • Frequent subgraphs • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold

  8. p5 p2 y c b y p1 x a y y d b p4 p3 G1 y y b c b q1 s1 s4 y b P3 b P2 y b c y y b y + s2 q2 x x a a a a x y y y f=3/3 x f=2/3 b b a f=2/3 b b b b + + P6 P5 s3 q3 + P4 G3 G2 Examples The induced subgraph isomorphism penalizes any unmatched edges  = 2/3 b y f=2/3 f=0/3 f=2/3 f = 1/3 f = 3/3 a y b P1 +: induced frequent subgraphs

  9. p5 p2 y c b y p1 x a y y d b p4 p3 G1 b y y y b a c b y q1 s1 s4 b y b P1 b P2 y b c y y b y s2 q2 x x a a a a x y y y f=3/3 x f=2/3 b b a f=2/3 b b b b P6 P5 s3 q3 P4 G3 G2 Examples Maximal frequent subgraph are ones that none of their supergraphs are frequent Other criteria for selecting subgraphs may be incorporated  = 2/3 f=2/3 ! P3 !: Maximal frequent subgraphs

  10. Example: Frequent Subgraphs GRAPH DATASET (A) (B) (C) FREQUENT PATTERNS (MIN SUPPORT IS 2) (1) (2)

  11. Applications • Mining biomolecular structures • Program control flow analysis • Mining XML structures or Web communities • Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

  12. Graph Mining Algorithms • Incomplete beam search – Greedy (Subdue) • Inductive logic programming (WARMR) • Graph theory-based approaches • Edge based • Path based • Tree based

  13. SUBDUE (Holder et al. KDD’94) • Start with single vertices • Expand best substructures with a new edge • Limit the number of best substructures • Substructures are evaluated based on their ability to compress input graphs • Using minimum description length (DL) • Best substructure S in graph G minimizes: DL(S) + DL(G\S) • Terminate until no new substructure is discovered

  14. WARMR(Dehaspe et al. KDD’98) • Graphs are represented by Datalog facts • atomel(C, A1, c), bond (C, A1, A2, BT), atomel(C, A2, c) : a carbon atom bound to a carbon atom with bond type BT • WARMR: the first general purpose ILP system • Level-wise search • Simulate Apriori for frequent pattern discovery

  15. Frequent Subgraph Mining Approaches • Edge-based approach • AGM/AcGM: Inokuchi, et al. (PKDD’00) • FSG: Kuramochi and Karypis (ICDM’01) • MoFa, Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) • FFSM: Huan, et al. (ICDM’03) • Path-based approach • PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) • Tree-based approach • Gaston: Nijssen and Kok (KDD’04) • SPIN: Huan, et al. (KDD’04)

  16. Properties of Graph Mining Algorithms • Search order • breadth vs. depth • Generation of candidate subgraphs • apriori vs. pattern growth • Elimination of duplicate subgraphs • passive vs. active • Support calculation • embedding store or not • Discover order of patterns • path  tree  graph

  17. Search DAG • Task: identify all frequently occurring subgraphs from a group of graphs, or a graph database • Support anti-monotonicity • Any supergraph of an infrequent subgraph is infrequent • Known as the Apriori property • Level-wise search • Keep all patterns with the same size in memory (poor memory utilization) • Depth-firstsearch • Better memory utilization • May repeatedly search patterns in the DAG (redundant candidates)

  18. Apriori-Based Approach (k+1)-edge k-edge G1 G G2 G’ … G’’ Gn JOIN

  19. Apriori-Based, Breadth-First Search • Methodology: breadth-search, joining two graphs • AGM (Inokuchi, et al. PKDD’00) • generates new graphs with one more node • FSG (Kuramochi and Karypis ICDM’01) • generates new graphs with one more edge

  20. FSG Algorithm • K = 1 • F1 = all frequent edges • Repeat • K = K + 1; • CK = join(FK-1) • FK = frequent patterns in CK • Until FK is empty

  21. Join: Key Operation • Join(L) =  join(P, Q) for all P, Q  L • Join(P, Q) = {G | P, Q,  G, |G| = |P| + 1, |P| = |Q|} • Two graphs P and Q are joinable if the join of the two graphs produces an non-empty set • Theorem: two graphs P and Q are joinable if P ∩ Q is a graph with size |P| -1 or share a common “core” with size P-1

  22. a e b a a a e e e b b a e a Multiplicity of Candidates • Case 1: identical vertex labels a a + e e b b a a

  23. b a c a a a a a b c a b c + a a a a a a a a a a c a a b a Multiplicity of Candidates • Case 2: Core contains identical labels Core: The (k-1) subgraph that is common between the joint graphs

  24. a a a a a b a a b a a b a + a a a a a b b a a b a b a Multiplicity of Candidates • Case 3: Core multiplicity

  25. PATH (Vanetik and Gudes ICDM’02, ’04) • Apriori-based approach • Building blocks: edge-disjoint path • Identify all frequent paths • Construct frequent graphs with 2 edge-disjoint paths • Construct graphs with k+1 edge-disjoint paths from graphs with k edge-disjoint paths • Repeat A graph with 3 edge-disjoint paths

  26. PATH Algorithm • K = 1 • F1 = all frequent paths • Repeat • K = K + 1; • CK = join(FK-1) • FK = frequent patterns in CK • Until FK is empty

  27. Challenges • Graph isomorphism • Two graphs may have the same topology though their layouts are different • Subgraph isomorphism • How to compute the support value of a pattern

  28. Graph Isomorphism • A graph is isomorphic if it is topologically equivalent to another graph

  29. Why Redundant Candidates? • All the algorithms may propose the same candidate several times. • We need to keep track of the identical candidates to • Avoid redundancy in results • Avoid redundant search

  30. An arbitrary set  Intuitions for Graph Normalization A Graph Space A 1-1 mapping A partial order defined on 

  31. GSPAN • A graph normalization is a 1-1 mapping of a graph space to an arbitrary space (usually a string space) • Deal with graph isomorphism using DFS code • Start with a singe edge • Depth first enumeration of a pattern space • Add one edge a time • Yan & Han, ICDM’02

  32. e0: (0,1) e1: (1,2) e2: (2,0) e3: (2,3) e4: (3,1) e5: (1,4) DFS Code • Flatten a graph into a sequence using depth first search 0 1 4 2 3 DFS code: (0, 1, x, a, y), (1, 2, y, b, x), (2, 0, x, a, x) , (2, 3, x, c, z), (3, 1, z, b, y), (1, 4, y, d, z)

  33. DFS Code

  34. DFS Lexicographic Order • Let Z be the set of DFS codes of all graphs. Two DFS codes a and b have the relation a  b (DFS Lexicographic Order in Z) if and only if one of the following conditions is true. Let a = (x0, x1, …, xm) and b = (y0, y1, …, yn),

  35. DFS Code Example • We have γ < β < α

  36. DFS Code Extension • Let a be the minimum DFS code of a graph G and b be a non-minimum DFS code of G. For any graph G’  G, we have • minDFS(G’) < b • a is a prefix of minDFS(G’) or minDFS(G’) < a • There is a 1-1 mapping from a graph to its minimum DFS code. • For every graph G’, there exists a G such that G’  G and minDFS(G) is a prefix of minDFS(G’)

  37. gSpan Code • Input: A graph database and a support threshold t • Output: All frequent patterns F gSpan: F1 = { frequent node labels}, K=1, gSpan_enumeration (F1, K, F) gSpan_enumeration (FK, K, F) K = K + 1; For each pattern P in Fk C = Candidates(P, K); F = F  C; gSpan_enumeration (C, K, F)

  38. How to Propose Candidates • Generate all supergraphs: • Candidate = {G | P  G, sup(G) >= t, |G| = k} • gSpan method: • Candidate = {G | P  G, sup(G) >= t, |G| = k, minDSP(P) is a prefix of minDSP(G) } • Right-most expansion

  39. p’2 P1 P2 P3 P4 P1 P2 P4 P3 P1 P4 P2 P3 b x p’1 y a x a a a x c b x x 0 b c b p’4 p’3 0 x x x y x b b c M1 M3 M2 (P’) y 0 0 x x 0 0 y x b b c p2 p4 x c b x p1 y a x b p3 (P) Graph Canonical Code in FFSM • The Canonical Code (θ)maps a graph G to a string. Code(M1): (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) Code(M1): (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) < Code(M2):(1, 1, a)(2, 1, x) (2, 2, b) (3, 2, x) (3, 3, c) (4, 1, x) (4, 2, y) (4, 4, b) < Code(M3): (1, 1, a)(2, 2, c) (3, 1, x) (3, 2, x) (3, 3, b) (4, 1, x) (4, 3, y) (4, 4, b) θ(P) = (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) • (i, j, Mi,j)  (k, l, Mk,l) if • i < k, or • i = k, j < l, or • i =k, j = l, Mi,j  Mk,l

  40. An arbitrary set  The Power of Graph Normalization A Graph Space A partial order defined on the graph space A 1-1 mapping A partial order defined on 

  41. MoFa (Borgelt and Berthold ICDM’02) • Extend graphs by adding a new edge • Store embeddings of discovered frequent graphs • Fast support calculation • Also used in other later developed algorithms such as FFSM and GASTON • Local structural pruning

  42. GASTON (Nijssen and Kok KDD’04) • Extend graphs directly • Store embeddings • Separate the discovery of different types of graphs • path  tree  graph • Simple structures are easier to mine and duplication detection is much simpler

  43. Graph Pattern Explosion Problem • If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property • An n-edge frequent graph may have 2n subgraphs • Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%

  44. Closed Frequent Graphs • Motivation: Handling graph pattern explosion problem • Closed frequent graph • A frequent graph G is closed if there exists no supergraph of G that carries the same support as G • If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) • Lossless compression: still ensures that the mining result is complete

  45. CLOSEGRAPH(Yan & Han, KDD’03) A Pattern-Growth Approach (k+1)-edge At what condition, can we stopsearching their children i.e., early termination? G1 k-edge G2 G If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’. … Gn

  46. Handling Tricky Exception Cases a b (pattern 1) b a a b c d c d a (graph 1) (graph 2) c d (pattern 2)

  47. Experimental Result • The AIDS antiviral screen compound dataset from NCI/NIH • The dataset contains 43,905 chemical compounds • Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI

  48. Discovered Patterns 20% 10% 5%

  49. Do the Odds Beat the Curse of Complexity? • Potentially exponential number of frequent patterns • The worst case complexty vs. the expected probability • Ex.: Suppose Walmart has 104 kinds of products • The chance to pick up one product 10-4 • The chance to pick up a particular set of 10 products: 10-40 • What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? • Have we solved the NP-hard problem of subgraph isomorphism testing? • No. But the real graphs in bio/chemistry is not so bad • A carbon has only 4 bounds and most proteins in a network have distinct labels

  50. Constrained Patterns • Density • Diameter • Connectivity • Degree • Min, Max, Avg

More Related