Structure discovery in PPI networks using pattern-based network decomposition

Structure discovery in PPI networks using pattern-based network decomposition Philip Bachman and Ying Liu BIOINFORMATICS System biologyVol.25 no. 14 2009 May 15, 2009

Outline • Introduction • Graph-Theoretic concepts • The algorithm • Algorithm and results • Pattern-based network decomposition • Conclusion

Introduction • The large, complex networks of interactions between proteins provide a lens through which one can examine the structure and function of biological systems. • Previous analyses of these networks: • Large-scales statistical analysis of holistic network properties • Small-scale analysis of local topological features

Introduction • Investigation of meso-scale network structure has been hindered by the computational complexity of structure search in networks. • In this article, an efficient algorithm for performing sub-graph isomorphism queries on a network and show its computational advantage.

Graph-Theoretic concepts • A graph G as G=(V,E) • Given an ordering o of vertices, one can produce the adjacency matrix Mo • If two graph Gi and Gj are isomorphic, then there exist orderings oi of Giand oj of Gj such that Moi=Moj • A canonical labeling:

Graph-Theoretic concepts • One canonical label :interpreting all possible orderings of a graph’s vertices and selecting the ordering omax such that Momax is maximized.

Graph-Theoretic concepts • An automorphism of a graph G • mapping of the graph’s vertices onto each other • any two oi and oj such that Moi=Moj • Aut(G) : the set of all automorphisms of G • The automorphism orbit A, of a graph G • the maximal sets of the vertices in G that are closed under all mappings in Aut(G) • Refer to the automorphism orbit to which a vertex v belongs as AutOrb(v).

Graph-Theoretic concepts • Define a canonical ID for each automorphism orbit of a graph G :examining the canonical matrix for G , their order of first appearance in the matrix

The algorithm • To solve the following problem:Given a query graph Gq, find all sub-graphs in a source graph Gs that are isomorphic to Gq • Three components of this algorithm : • Basic backtracking search used in previous motif discovery algorithm. • Enhanced by the second, a sysmmetry-breaking technique present by (Grochow and Kellis, 2007). • Third components, a constraint set for each vertex v is added.

The algorithm • Constraint set : • For each vertex v in Gq, the set of all vertices in Gsthat are potentially mappableonto v under some mapping of Gqonto an isomorphic sub-graph of Gs. • quick elimination of candidate vertex pairs during the backtracking search. • Generating these constraint sets requires a constraint database that is populated by performing an initial set of specific sub-graph queries.

The algorithm : Backtracking • Find all sub-graphs in a source graph Gs that are isomorphic to a query graph Gs • For each vertex v in Gs and each vertex u in Gq • A mapping , I, of Gq onto Gs is initialized for each such (v,u) pair • Recursively extend I using each pair of compatible vertices v’ in Gs and u’ in Gq such that v’ and u’ are both adjacent to vertices to vertices already in I. • An isomorphism has been found once I maps all vertices in Gq onto compatible vertices in Gs.

The algorithm : symmetry-breaking • The number of mappings is equal to the number of automorphisms in Aut(Gq), which grow factorially with the number of vertices in Gq. • We can avoid these ‘repeated’ mappings by using symmetry-breaking constraints

The algorithm : Generating constraint sets for the source graph • The database used for per-vertex constraint generation contains entries for each query graph Gq: • The entry for Gqis referenced by lc(Gq) • The entry for each Gq is a list of constraints sets, one for each automorphism orbit of Gq • The entries for its automorphism orbits are referenced by their respective canonical IDs.

The algorithm : Generating constraint sets for the source graph • The constraint sets for each query graph Gq can be generated by performing a slight modified version of the basic backtracking • Checking each vertex vs in Gs against a representative vq from each automorphism orbit A of Gq • If an isomorphic mapping exits which maps vq onto vs, then vs is added to the constraint set for A • This allows the skipping of any future checks for some vs against some vq where vs is already in the constraint set for AutOrb(vq).

The algorithm : Generating constraint sets for the source graph • The database can be bootstrapped :If sub-graph with n vertices will be sampled, starting with all connected graphs of some small size k, and then using the generated database to accelerate the generation of the database entries for all connected graphs of size k+1 , and so on, until size n is reached.

The algorithm : Vertex constraints for a query sub-graph

The algorithm : Vertex constraints for a query sub-graph • The intersection of constraint performed in step 4 does not preclude any viable mapping of Gqonto Gs. • Source graph vertices fundamentally incompatible with a query graph vertex will fail to appear in one or more of the constraint sets associated with that vertex.

Algorithm and results Queries comprising form 8 to 19 vertices The ratio between the times for unconstrained and constrained searches across set of 100 randomly generated queries at each edge density and query size

Algorithm and results The cumulative times for processing all 400 queries of various densities at each query size

Algorithm and results • Due to the presence of pathological queries that required an impractical time to complete. • In this plot, the cumulative time for unconstrained searches appears to plateau due to our imposed limit on search time, with the maximum possible cumulative time at eatch graph size being 400,000 s, as we set the individual query time-out to 1000s. • It was common for queries with 15 or more nodes and with edge densities of 0.6 or 0.8 to time-out during unconstrained search. • Only one constrained search out of the 3600 performed timed-out. • Unconstrained search time was strongly affected by the edge-density of a query. The effect of density were drastically reduced when using our constraints.

Algorithm and results • The time required to fill the database • CPU time for exhaustive search of all seven and eight vertex graph. • The database was bootstrapped, and time required to do so was 4477 s. This is significantly < 37361 s.

Pattern-based network decomposition • An applications to using sub-graph queries that allows a PPI network to be decomposed into sub-network exhibiting specific structural patterns. • Create generalizations of their topological structure • Search for all appearances of each generalization in the source network

Generating and using query patterns • Given a specific form of inter-module interaction, deriving generalized query patterns: • A query pattern may be created at the smallest size which adequately represents some desired structural features • Allow to ‘expand’ to cover larger sub-graph, which still fit the interaction pattern that it was designed.

Generating and using query patterns Top row : a four-node clique covering all edges and nodes in a dense eight-node graph Bottom row : overlapping four-node cliques expanding to fully cover overlapping dense six-node graphs

Generating and using query patterns • How a dense module may be covered by a small clique query pattern and how a pair of interacting/overlapping modules may be covered by a small pair of overlapping cliques • A pattern generalization created using this property of graphs can be used to filter the source network. Illustration of four sample pattern generalizations

Generating and using query patterns • Filtering a source network using a generalized query pattern : • All instances of the query in the source network are found • All edges and vertices from the source network that do not appear in some instance of the query are removed • Inspect all regions of the source network that have topologies matching the pattern that the query was designed to represent

An application of pattern-based network • Core PPI network for Yeast • 17000 interactions between 5000 proteins • All of the generalizations that we searched for appeared as significantly enriched motif with respect to an ensemble of 100 random networks, indicating potential biological significance. • We occasionally had to determine boundaries between overlapping groups of proteins • Determined these boundaries heuristically, by looking for ‘bottlenecks’ between groups of densely interacting proteain.

An application of pattern-based network • The most immediate results and the largest extracted network(EN) came from using a four-node clique as the query pattern.

An application of pattern-based network • These components likely represent the most evolutionarily conserved core of the Yeast PPI network, It was shown by Wuchty et al. (2003) that proteins participating in four-node cliques have an evolutionary conservation rate that is over 400 times higher than that which would be expected.

Conclusion • An algorithm for one such problem, sub-graph isomorphism, that is more efficient than previous algorithms. • In concert with suitable query patterns that exploit some simple properties of graphs, query-based graph search can be used to examine network structure at a scale that reveals relationships within and between groups of interacting proteins.

Structure discovery in PPI networks using pattern-based network decomposition