On finding clusters in undirected simple graphs: application to protein complex detection

Today’s lecture will cover the following three topics • On finding clusters in undirected simple graphs: application to protein complex detection • DPClus software tool • Introduction to DPClusO • Concept of BiClustering • Concept of DNA sequencing

On finding clusters in undirected simple graphs: application to protein complex detection • Outline • Introduction • Some basic concepts • The proposed algorithm • The DPClus software • Results & Discussion • Conclusions

Introduction • There is no universal definition of a cluster. • But clustering is an important issue. • Consequently there are diverse definitions and various methods. • The major purpose of clustering is finding cohesive groups. • Here, we are going to discuss a graph clustering algorithm.

Introduction Regarding a graph, a cluster is a subgraph whose nodes are densely connected with each other compared to their connections with other nodes in the graph. This is a flexible definition of a cluster. Intuitively, we can recognize two clusters in this arbitrary graph. But it is difficult to draw a big graph revealing its clusters.

Introduction An E. coliprotein-protein interaction network---consisting of 3007 proteins and 11531 interactions (From Mori Lab NAIST, Japan) Some algorithm is needed to detect locally dense regions……

Introduction Md. Altaf-Ul-Amin, Yoko Shinbo, Kenji Mihara, Ken Kurokawa and Shigehiko Kanaya, “Development and implementation of an algorithm for detection of protein complexes in large interaction networks”, BMC Bioinformatics 7:207, April 2006.

Some basic concepts It is likely that two nodes belong to the same cluster have more common neighbors than two nodes that are not

Some basic concepts • The density d of a cluster is the ratio of the number of edges present in it and the maximum possible number of edges in it. • It is easy to realize that d = |E|/|E|max = 2*|E|/|N|*(|N|-1). • d is a real number ranging from 0 to 1.

d=0.9 d=1.0 Some basic concepts Density of the total graph = 0.241 The density of the complexes are relatively higher

Some basic concepts Considering density alone is not enough • Both the graphs consist of 8 nodes and both are of density 0.5 • But one of them seems to be a single cluster while the other is divided into two clusters Such situations can be tackled by keeping track of the periphery

Some basic concepts The cluster property of any node n with respect to any cluster k of density dk and size Nk is defined as follows: cpnk=|Enk|/(dk* |Nk|) Here, |Enk| is the total number of edges between the node n and each of the nodes of cluster k. Cluster property of node f = 0.2 Cluster property of node f  0.57

The proposed Algorithm • The proposed algorithm is a sequential constructive algorithm: • It initializes the complex/cluster by choosing a seed node. • It then repeatedly add other nodes on the basis of priority and some conditions. • The major methods of the algorithm • Choosing a seed node. • Selecting a priority node. • Checking necessary conditions before adding a node to a complex.

The proposed Algorithm • Inputs to the algorithm are: • The associated matrix of the network. • A minimum threshold density for the generated clusters. • A parameter to determine how we separate a complex from its periphery. • Output of the algorithm are : • Overlapping/non-overlapping complexes whose densities are more or equal to the given density.

The proposed Algorithm Flowchart of the proposed Algorithm -

0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 M = The proposed Algorithm Muv = 1 if there is an edge between nodes u and v and 0 otherwise.

1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 4 2 2 3 2 1 1 0 0 0 0 0 0 1 2 4 3 2 3 1 1 0 0 0 0 0 0 1 2 3 5 2 3 1 0 1 0 0 0 0 0 0 3 2 2 3 2 1 1 0 0 0 0 0 0 1 2 3 3 2 5 0 1 0 0 1 0 0 0 0 1 1 1 1 0 2 0 0 1 0 0 0 0 0 1 1 0 1 1 0 2 0 1 0 0 1 1 0 0 0 1 0 0 0 0 4 2 1 1 2 2 0 0 0 0 0 0 1 1 2 4 0 1 2 2 0 0 0 0 0 1 0 0 1 0 2 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 2 2 1 0 4 2 0 0 0 0 0 0 0 1 2 2 1 1 2 3 M2 = The proposed Algorithm (M2)uv for uv represents the number of common neighbor of the nodes u and v.

The proposed Algorithm 2 3 2 2 2 3 2 0 0 2 0 2 0 2 2 2 2 3 0 2 0 0 The weights of edges are derived by squaring the associated matrix of the graph

The proposed Algorithm 10 6 2 0 6 3 2 2 2 3 10 2 0 0 2 0 2 0 2 0 6 6 2 2 2 0 3 6 6 10 0 2 0 0 0 0 The weights of nodes (sum of the weights of the connecting edges)

The proposed Algorithm 10 Seed 6 2 0 6 3 2 2 2 3 10 2 0 0 2 0 2 0 2 0 6 6 2 2 2 0 3 6 6 10 0 2 0 0 0 0 Neighbors

The proposed Algorithm 10 6 2 0 6 3 2 2 2 3 10 2 0 0 2 0 2 0 2 0 6 6 2 2 2 0 3 6 6 10 0 2 0 0 0 0 cp of P3 = 1 Neighbors

The proposed Algorithm 10 6 d=1.0 2 0 6 3 2 2 2 3 10 2 0 0 2 0 2 0 2 0 6 6 2 2 2 0 3 6 6 10 0 2 0 0 0 0 Neighbors

The proposed Algorithm 10 6 d=1.0 2 0 6 3 2 2 2 3 10 2 0 0 2 0 2 0 2 0 6 6 2 2 2 0 3 6 6 10 0 2 0 0 0 0 cp of P5 = 1 Neighbors

The proposed Algorithm 10 6 d=1.0 2 0 6 3 2 2 2 3 10 2 0 0 2 0 2 0 2 0 6 6 2 2 2 0 3 6 6 10 0 2 0 0 0 0 cp of P1 = 1 Neighbors

The proposed Algorithm 10 6 d=1.0 2 0 6 3 2 2 2 3 10 2 0 0 2 0 2 0 2 0 6 6 2 2 2 0 3 6 6 10 0 2 0 0 0 0 cp of P4 = 0.75 Neighbors

Seed The proposed Algorithm 6 2 0 6 2 2 0 2 0 2 0 6 2 0 6 0 0 0 The remaining graph

The proposed Algorithm 6 d=1.0 2 0 6 2 2 0 2 0 2 0 6 2 0 6 0 0 0

The proposed Algorithm The remaining graph

The proposed Algorithm Clustering by the proposed algorithm

Results: Complexes in the E. coli PPI Network http://dip.mbi.ucla.edu/ DIP:339N GroEL DIP:1081N PrnP DIP:1025N CarB DIP:1026N CarA DIP:539N MalG DIP:508N MalE DIP:124N XerD DIP:726N XerC DIP:367N PntB DIP:366N PntA DIP:342N SbcC DIP:572N Gam -------------- --------- -------------- --------- -------------- --------- -------------- --------- The network of E. coli proteins consists of 363 interactions involving a total of 336 proteins

Results: Complexes in the E. coli PPI Network components of RNA polymerase (RpoA, RpoB, RpoC, Rsd, RpoZ RpoD, RpoN, FliA)

Results: Complexes in the E. coli PPI Network components of ATP synthetase (AtpA, AtpB, AtpE, AtpF, AtpG, AtpH, AtpL);

Results: Complexes in the E. coli PPI Network Proteins involved in cell division (FtsQ, FtsI, FtsW, FtsN, FtsK and FtsL)

Results: Complexes in the E. coli PPI Network components of DNA polymerase (DnaX, HolA, HolB, HolD, and HolC);

Results: Complexes in the S. cerevisiaePPI Network We extract a set of 12487 unique binary interactions involving 4648 proteins by discarding self-interactions of the PPI data obtained from ftp://ftpmips.gsf.de/yeast/PPI/.

Results: Details of a Group of Predicted Complexes Information on the complexes that are of size 6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode. We considered 15 functional classes: (1) Cell cycle and DNA processing, (2) Protein with binding function or cofactor requirement (structural or catalytic), (3) Protein fate (folding, modification, destination), (4) Biogenesis of cellular components, (5) Cellular transport, transport facilitation and transport routes, (6) Metabolism, (7) Interaction with the cellular environment, (8) Transcription, (9) Energy, (10) Cell rescue, defense and virulence, (11) Cell type differentiation, (12) Cellular communication/signal transduction mechanism, (13) Protein activity regulation, (14) Protein synthesis, and (15) Transposable elements, viral and plasmid proteins

Results: Hypergeometric distribution N= Total number of proteins in the network F= Number of proteins of a functional group in the network C= Number of proteins in a cluster k= Number of proteins of a functional group in a cluster The p-value of a cluster implies the probability that the proteins of the cluster have been randomly selected The lower the p-value the higher the statistical significance

P-value & Hyper geometric distribution 3 green and 4 red balls Put them in a box Randomly choose any 3 P1(# of red ball is 1) = P0(# of red ball is 0) = P3(# of red ball is 3) = P2(# of red ball is 2) = Notice that, P0 +P1+P2+P3=1

0 1 2 3 P-value & Hyper geometric distribution P1(# of red ball is 1) = P0(# of red ball is 0) = P3(# of red ball is 3) = P2(# of red ball is 2) =

P-value & Hyper geometric distribution P1(# of red ball is 1) = P0(# of red ball is 0) = P3(# of red ball is 3) = P2(# of red ball is 2) = P(# of red ball ≤ 1)= P0 +P1 P(# of red ball ≥ 2)=1-(P0 +P1) P(# of red ball ≥ k)=1-(P0 +P1+…+Pk-1) N=7, F=4, C=3

Results: Details of a Group of Predicted Complexes Information on the complexes that are of size 6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode. Protein YDR425w of complex 19 is related to cellular transport and YIP1, YGL198w, YGL161c and GCS1 are related to vesicular transport. Hence, we predict the function-unknown protein YPL095c of this complex is a transport related protein most likely related to vesicular transport.

Conclusions • In this work, we present an algorithm to detect locally dense regions in undirected simple graphs. • The algorithm can be used to detect protein complexes in large protein-protein interaction networks or co-expressed gene clusters based on microarray data. • It can also be used for protein/gene function prediction by way of finding complexes/clusters in networks consisting of function known and function unknown proteins. • Also, DPClus can be applied to other networks where finding cohesive groups is an agenda. The DPClus software is available at http://kanaya.naist.jp/DPClus/

2. The DPClus Software The DPClus software has been developed based on the proposed algorithm. Md. Altaf-Ul-Amin, Hisashi Tsuji, Ken Kurokawa, Hiroko Asahi, Yoko Shinbo, Shigehiko Kanaya, “DPClus: A Density-periphery Based Graph Clustering Software Mainly Focused on Detection of Protein Complexes in Interaction Networks”, Journal of Computer Aided Chemistry , Vol.7, 150-156, 2006. The DPClus software is available at http://kanaya.naist.jp/DPClus/

The DPClus Software The main window of DPClus

The DPClus Software The input file format 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 0 AtpB AtpA AtpG AtpE AtpA AtpH AtpB AtpH AtpG AtpH AtpE AtpH Adjacency matrix AtpA AtpB, AtpH AtpB AtpA , AtpH AtpH AtpB, AtpA, AtpG, AtpE AtpG AtpH, AtpE AtpE AtpG Corresponding network List of edges Adjacency list

On finding clusters in undirected simple graphs: application to protein complex detection