Steve Reinhardt, Interactive Supercomputing sreinhardt@interactivesupercomputing

A Multi-Level Parallel Implementation of a Program for Finding Frequent Patterns in a Large Sparse Graph Steve Reinhardt, Interactive Supercomputing sreinhardt@interactivesupercomputing.com George Karypis, Dept. of Computer Science, University of Minnesota

Outline • Problem definition • Prior work • Problem and Approach • Results • Issues and Conclusions

Graph Datasets • Flexible and powerful representation • Evidence extraction and link discovery (EELD) • Social Networks/Web graphs • Chemical compounds • Protein structures • Biological Pathways • Object recognition and retrieval • Multi-relational datasets

M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. In SIAM International Conference on Data Mining (SDM-04), 2004. http://citeseer.ist.psu.edu/article/kuramochi04finding.html Finding Patterns in GraphsMany Dimensions • Structure of the graph dataset • many small graphs • graph transaction setting • one large graph • single-graph setting • Type of patterns • connected subgraphs • induced subgraphs • Nature of the algorithm • Finds all patterns that satisfy the minimum support requirement • Complete • Finds some of the patterns • Incomplete • Nature of the pattern’s occurrence • The pattern occurs exactly in the input graph • Exact algorithms • There is a sufficiently similar embedding of the pattern in the graph • Inexact algorithms • MIS calculation for frequency • exact • approximate • upper bound • Algorithm • vertical (depth-first) • horizontal (breadth-first)

Size 6 Frequency = 1 Input Graph Size 7 Frequency = 6 Single Graph Setting • Find all frequent subgraphs from a single sparse graph. • Choice of frequency definition

vSIGRAM: Vertical Solution • Candidate generation by extension • Add one more edge to a current embedding. • Solve MIS on embeddings in the same equivalence class. • No downward-closure-based pruning • Two important components • Frequency-based pruning of extensions • Treefication based on canonical labeling

vSIGRAM: Connection Table • Frequency-based pruning. • Trying every possible extension is expensive and inefficient. • A particular extension might have been tested before. • Categorize extensions into equivalent classes (in terms of isomorphism), and record if each class is frequent or not. • If a class becomes infrequent, never try it in later exploration.

Parallelization • Two clear sources of parallelism in the algorithm • Amount of parallelism from each source not known in advance • The code is typical C code • structs, pointers, frequent mallocs/frees of small areas, etc. • nothing like the “Fortran”-like (dense linear algebra) examples shown for many parallel programming methods • Parallel structures need to accommodate dynamic parallelism • Dynamic specification of parallel work • Dynamic allocation of processors to work • Chose OpenMP taskq/task constructs • Proposed extensions to OpenMP standard • Support parallel work being defined in multiple places in a program, but be placed on a single conceptual queue and executed accordingly • ~20 lines of code changes in ~15,000 line program • Electric Fence was very useful in finding coding errors

Algorithmic Parallelism vSiGraM (G, MIS_type, f) 1. F ←  2. F1 ← all frequent size-1 subgraphs in G 3. for each F1 in F1 do 4. M(F1) ← all embeddings of F1 5. for each F1 in F1 do // high-level parallelism 6. F ← F vSiGraM-Extend(F1, G, f) return F vSiGraM-Extend(Fk, G , f) 1. F ←  2. for each embedding m in M(Fk) do // low-level parallelism 3. Ck+1 ← Ck+1 {all (k+1)-subgraphs of G containing m} 4. for each Ck+1 in Ck+1 do 5. if Fk is not the generating parent of Ck+1 then 6. continue 7. compute Ck+1.freq from M(Ck+1) 8. if Ck+1.freq < f then 9. continue 10. F ← F vSiGraM-Extend(Ck+1, G, f) 11.return F

Simple Taskq/Task Example main() { int val; #pragma intel omp taskq val = fib(12345); } fib(int n) { int partret[2]; if (n>2) #pragma intel omp task for(i=n-2; i<n; i++) { partret[n-2-i] = fib(i); } return (partret[0] + partret[1]); } else { return 1; } }

High-Level Parallelism with taskq/task // At the bottom of expand_subgraph, after all child // subgraphs have been identified, start them all. #pragma intel omp taskq for (ii=0; ii<sg_set_size(child); ii++) { #pragma intel omp task captureprivate(ii) { SubGraph *csg = sg_set_at(child,ii); expand_subgraph(csg, csg->ct, lg, ls, o); } // end-task }

Low-Level Parallelism with taskq/task #pragma omp parallel shared(nt, priv_es) { #pragma omp master { nt = omp_get_num_threads(); //#threads in par priv_es = (ExtensionSet **)kmp_calloc(nt, sizeof(ExtensionSet *)); } #pragma omp barrier #pragma intel omp taskq { for (i = 0; i < sg_vmap_size(sg); i++) { #pragma intel omp task captureprivate(i) { int th = omp_get_thread_num(); if (priv_es[th] == NULL) { priv_es[th] = exset_init(128); } expand_map(sg, ct, ams, i, priv_es[th], lg); } } } } // end parallel section; next loop is serial reduction for (i=0; i < nt; i++) { if (priv_es[i] != NULL) { exset_merge(priv_es[i],es); } } kmp_free(priv_es); } Implementation due to Grant Haab and colleagues from Intel OpenMP library group

Experimental Results • SGI Altix™ 32 Itanium2™ sockets (64 cores), 1.6GHz • 64 GBytes (though not memory limited) • Linux • No special dplace/cpuset configuration • Minimum frequencies chosen to illuminate scaling behavior, not provide maximum performance

Dataset 1 - Chemical

Dataset 2 – aviation

Performance of High-level Parallelism • When sufficient quantity of work (i.e., frequency threshold is low enough) • Good speed-ups to 16P • Reasonable speed-ups to 30P • Little or no benefit above 30P • No insight into performance plateau

Poor Performance of Low-level Parallelism • Several possible effects ruled out • Granularity of data allocation • Barrier before master-only reduction • Source: highly variable times for register_extension • ~100X slower in parallel than serial, … • but different instances from execution to execution • Apparently due to highly variable run-times for malloc • Not understood

Issues and Conclusions • OpenMP taskq/task were straightforward to use in this program and implemented the desired model • Performance was good to a medium range of processor counts (best 26X on 30P) • Difficult to gain insight into lack of performance • High-level parallelism 30P and above • Low-level parallelism

Backup

Datasets

Aviation Dataset • Generally, vSIGRAM is 2-5 times faster than hSIGRAM (with exact and upper bound MIS) • Largest pattern contained 13 edges.

Citation Dataset • But, hSIGRAM can be more efficient especially with upper bound MIS (ub). • Largest pattern contained 16 edges.

VLSI Dataset • Exact MIS never finished. • Longest pattern contained 5 edges (constraint).

Comparison with SUBDUE • Similar results with SEuS

Summary • With approximate and exact MIS, vSIGRAM is 2-5 times faster than hSIGRAM. • With upper bound MIS, however, hSIGRAM can prune a larger number of infrequent patterns. • The downward closure property plays the role. • For some datasets, using exact MIS for frequency counting is just intractable. • Compared to SUBDUE, SIGRAM finds more and longer patterns in shorter amount of runtime.

Thank You! • Slightly longer version of this paper is also available as a technical report. • SIGRAM executables will be available for download soon from http://www.cs.umn.edu/~karypis/pafi/

Complete Frequent Subgraph Mining—Existing Work So Far • Input: A set of graphs (transactions) + support threshold • Goal: Find all frequently occurring subgraphs in the input dataset. • AGM (Inokuchi et al., 2000), vertex-based, may not be connected. • FSG (Kuramochi et. al., 2001), edge-based, only connected subgraphs • AcGM (Inokuchi et al., 2002), gSpan (Yan & Han, 2002), FFSM (Huan et al., 2003), etc. follow FSG’s problem definition. • Frequency of each subgraph  The number of supporting transactions. • Does not matter how many embeddings are in each transaction.

What is the reasonable frequency definition? • Two reasonable choices: • The frequency is determined by the total number of embeddings. • Not downward closed. • Too many patterns. • Artificially high frequency of certain patterns. • The frequency is determined by the number of edge-disjoint embeddings (Vanetik et al, ICDM 2002). • Downward closed. • Since each occurrence utilizes different sets of edges, occurrence frequencies are bounded. • Solved by finding the maximum independent set (MIS) of the embedding overlap graph.

Edge-disjoint embeddings { E1, E2, E3 } { E1, E2, E4 } Create an overlap graph and solve MIS Vertex  Embedding Edge  Overlap Embedding Overlap and MIS E2 E1 E3 E4

OK. Definition is Fine, but … • MIS-based frequency seems reasonable. • Next question: How to develop mining algorithms for the single graph setting.

How to Handle Single Graph Setting? • Issue 1: Frequency counting • Exact MIS is often intractable. • Issue 2: Choice of search scheme • Horizontal (breadth-first) • Vertical (depth-first)

Issue 1: MIS-Based Frequency • We considered approximate (greedy) and upperbound MIS too. • Approximate MIS may underestimate the frequency. • Upper bound MIS may overestimate the frequency. • MIS is NP-complete and not be approximated. • Practically simple greedy scheme works pretty well. • Halldórsson and Radhakrishnan. Greed is good, 1997.

Issue 2: Search Scheme • Frequent subgraph mining  Exploration in the lattice of subgraphs • Horizontal • Level-wise • Candidate generation and pruning • Joining • Downward closure property • Frequency counting • Vertical • Traverse the lattice as if it were a tree.

hSIGRAM: Horizontal Method • Natural extension of FSG to the single graph setting. • Candidate generation and pruning. • Downward closure property  Tighter pruning than vertical method • Two-phase frequency counting • All embeddings by subgraph isomorphism • Anchor edge list intersection, instead of TID list intersection. • Localize subgraph isomorphism • MIS for the embeddings • Approximate and upper bound MIS give subset and superset respectively.

Lattice of Subgraphs T1 size k + 1 size k T2 TID( ) = { T1, T3 } T3 TID( ) = { T1, T2, T3 } TID List Recap TID( ) = { T1, T2, T3 } TID( )  TID( ) ∩ TID( ) ∩ TID( ) = { T1, T3 }

Lattice of Subgraphs size k + 1 size k Anchor Edges • Each subgraph must appear close enough together. • Keep one edge for each. • Complete embeddings require too much memory. • Localize subgraph isomorphism.

Treefication • : a node in the search space (i.e., a subgraph) • Based on subgraph/supergraph relation • Avoid visiting the same node in the lattice more than once. Treefied Lattice Lattice of Subgraphs size k + 1 size k size k - 1

Steve Reinhardt, Interactive Supercomputing sreinhardt@interactivesupercomputing