710 likes | 1.01k Vues
Biological Networks – Graph Theory and Matrix Theory. Ka-Lok Ng Department of Bioinformatics Asia University. Content. Topological Statistics of the protein interaction networks How to characterize a network ?
E N D
Biological Networks – Graph Theory and Matrix Theory Ka-Lok Ng Department of Bioinformatics Asia University
Content Topological Statistics of the protein interaction networks How to characterize a network ? – Graph theory, topological parameters (node degrees, average path length, clustering coefficient, and node degree correlation.) – Random graph, Scale-free network, Hierarchical network – Evolution of Biological Networks
Biological Networks - metabolic networks Metabolism is the most basic network of biochemical reactions, which generate energy for driving various cell processes, and degrade and synthesize many different bio-molecules.
Biological Networks - Protein-protein interaction network (PIN) Proteins perform distinct and well-defined functions, but little is known about how interactions among them are structured at the cellular level. Protein-protein interaction account for binding interactions and formation of protein complex. - Experiment – Yeast two-hybrid method, or co-immunoprecipitation Limitation: No subcellular location, and temporal information. Cliques – protein complexes ? www.utoronto.ca/boonelab/proteomics.htm
Biological Networks - PIN Yeast Protein-protein interaction network - protein-protein interactions are not random - highly connected proteins are unlikely to interact with each other. Not a random network • Data from the high- • throughput two-hybrid • experiment (T. Ito, et al. • PNAS (2001) ) • The full set containing • 4549 interactions among • 3278 yeast proteins • 87% nodes in the largest • component • kmax ~285 ! • Figure shows nuclear • proteins only
Biological Networks – Gene regulation networks In a gene regulatory network, the protein encoded by a gene can regulate the expression of other genes, for instance, by activating or inhibiting DNA transcription. These genes in turn produce new regulatory proteins that control other genes. Example of a genetic regulatory network of two genes (a and b), each coding for a regulatory protein (A and B).
Biological Networks – Gene regulation networks Transcription regulatory network in H. sapiens Data courtesy of Ariadne Genomics obtained from the literature search: 1449 regulations among 689 proteins Transcription regulatory network in E. coli Data (courtesy of Uri Alon) was curated from the Regulon database: 606 interactions between 424 operons (by 116 TFs) Transcription regulatory network in Yeast - From the YPD database: 1276 regulations among 682 proteins by 125 transcription factors (~10 regulated genes per TF) - Part of a bigger genetic regulatory network of 1772 regulations among 908 proteins
Biological Networks – Signal transduction networks Hormones (first message) Receptor cAMP, Ca++ (second message) phosphorylation • Nuclear transcription factor NF-kB • control of apoptosis (cell suicide), • development of B and T cells, • anti-viral and bacterial responses Oxidant-induced activation of NF-kB signal transduction
Biological Networks Biological networks are not randomly connected Underlying architecture clustering How to characterize ? An universal features across different species ?
Graph theory- The Bridge Obsession Problem Find a tour crossing every bridge just once Leonhard Euler (Switzerland), 1735. It turns out it is impossible. L. Euler 1707-1783 Bridges of Königsberg
Eulerian Cycle Problem • Find a cycle that visits everyedge exactly once • Linear time More complicated Königsberg
Find a cycle that visits every vertex exactly once Around the 20 famous cities in the world NP – complete Hamiltonian Cycle Problem Sir W. Hamilton (English mathematician) 1805 - 1865 Game invented by Sir William Hamilton in 1857
Mapping Problems to Graphs • Arthur Cayley (English mathematician) studied chemical structures of hydrocarbons in the mid-1800s • He used trees (acyclic connected graphs) to enumerate structural isomers Arthur Cayley
Real networks Many networks show scale-free behavior • World-Wide Web • Internet • Ecology network (food web) • Science collaboration network • Movie actor collaboration network • Cellular network • Network in linguistic • Power and neural network • Sexual contacts within a population (important for disease prevention!) • etc. Power law behavior
Graph Theory Binary relation PathwayCluster Hierarchical Tree A B Network representation. A network (graph) consists of a set of elements (vertices) and a set of binary relations (edges).
Graph Theory – Basic concepts Graphs G=(N,E) N={n1 n2,... nN} E={e1 e2,... eM} ek={ni nj} Nodes: proteins Edges: protein interactions Mutligraph ek={ni nj}+ duplicate edges i.e. em={ni nj} Nodes: proteins Edges: interactions of different sort: binding and similarity Hypergraphs Hyperedge: ex={ni, nj, nk ...} Nodes: proteins Edges: protein complexes Directed hypergraph Hyperedge: ex={ni, nj .. | nk nl ...} Nodes: substances Edges: chemical reactions A + B C +D eX={A, B .. | C, D ...} Directed graph ek={ni nj} Nodes: genes and their products Edges from A to B: gene regulation gene A regulates expression of gene B Different systems Different graphs
Graph Theory – Basic concepts Clustering coefficient Ci if A-B, B-C, then it is highly probable that A-C Node degree Components Complete graph (Clique) Shortest path length Two ways to compute Ci -Ei actual connections out of Ck2 possible connections -number of triangles that included i/ki(ki-1) Average clustering coefficient
1 3 2 4 Graph Theory – Vertex adjacency matrix Undirected graph ki 1 3 1 1 symmetric - ∞ means not directly connected - node i connectivity, ki= countj(mij = 1) Bipartite graph
G 1 a symmetric b 3 2 d 4 a b L(G) A(L(G)) = E(G) c d Graph Theory – Edge adjacency matrix a b c d a b c d c The edge adjacency matrix (E) of a graph G is identical to vertex adjacency matrix (A) of the line graph of G, L(G). That is the edge in G are replaced by vertices in L(G). Two vertices in L(G) are connected whenever the corresponding edges in G are adjacent. The labeling of the same graph G are related by a similarity transformation, P-1A(G1)P=A(G2).
Graph Theory – average network distance Interaction path length or average network distance, d • the average of the distances between all pairs of nodes • frequency of the shortest interaction path length, f(j) • determined by using the Floyd’s algorithm The average network diameter d is given by where j is the shortest path length between two nodes. Network diameter (global) Average network distance (local)
1 3 2 4 i j k Graph Theory – the shortest path The shortest path • Floyd algorithm, an O(N3) algorithm. For iteration n, • given three nodes i, j and k, it is shorter to reach j from i by passing through k Mnij=min{Mn-1ij, Mn-1ik+Mn-1kj} - search for all possible paths, e.g. 1-2, 1-2-3, 1-2-4, 2-3, 2-4
if there is a walk of length L between vertices i and j passing through the vertices r, s, …..z otherwise Graph Theory – number of the shortest path in a graph A nonvanishing element of A(G), Aij = 1, represents a walk of a length between the vertices i and j. Therefore, in general if there is a walk of length one between vertices i and j otherwise There are walks of various lengths which can be found in a given graph. Thus if there is a walk of length two between vertices i and j passing through the vertex k otherwise Therefore, the expression represents the total number of walks of the length 2 in G between the vertices i and j. For a walk of a length L, we
Graph Theory – Trace of a matrix Trace of the NxN matrix A In the case of the adjacency matrix for graph without loops, Tr A = 0 The trace of powers of A is a graph invariant where M is the number of edges, C3 is the number of three-membered cycles. In case of graph with n loops
Random Graph Theory = Graph Theory +Probability
Random Graph Theory = Graph Theory +Probability
Random Graph Theory= Graph Theory + Probability Random graph (Erdos and Renyi, 1960) N nodes labeled and connected by n edges • CN2 = N(N-1)/2 possible edges • possible graphs with N nodes and n edges N = 4 C6n N = 4 n 3 3 4 4 5 6
Random Graph Theory – Random network, Scale free network Connectivity distribution P(k) In a random network, the links are randomly connected and most of the nodes have degrees close to <k>=2E/N. The degree distribution P(k) vs. k is a Poission distribution, i.e. P(k) ~ <k>ke-<k>/k! In many real life networks, the degree distribution has no well-defined peak but has a power-law distribution, P(k) ~ k-g,where g is a constant. Such networks are known as scale-free network. Random network Log[P(k)] vs Log[k] plot has a peak homogenous nodes d ~ log N Scale-free network Log[P(k)] vs Log[k] plot is a line with negative slope inhomogenous nodes d ~ log(log N) Albert R. and Barabasi A.L.(2002)Rev. Mod. Phys. 74, 47 Random network Scale-free network http://physicsweb.org/box/world/
Example – metabolic pathways • WIT database (43 organisms), node = substrated, edge = reaction • scale-free network P(k)<k-g, with gin = 2.2, gout = 2.2 • similar scaling behavior of connectivity distribution • Fig. 2d, connectivity distribution averaged over 43 organisms • Suggested that metabolic networks belong to the class of scale- free networks http://ergo.integratedgenomics.com/IGwit/ It is interesting to notice that most of the real networks have 1 < g < 3.
Random Graph, Scale-free network, Hierarchical network • Hierarchical network - • coexistence of • modularity, • local clustering, and • scale-free behavior Node degree distribution • scaling Cave(k) ~ k-b • = 1for Deterministic hierarchical network model Clustering coefficient
Graph Theory – Network motifs • Compared the abundance of small loops in E. coli transcription regulatory network to its randomized counterpart • Treat the transcription network as directed graph • node = operon (a group of contiguous genes) • edge = from an operon that encode an TF to an operon regulated by that TF • Frequency of occurrences three types of motifs (feed-forward loops, single input module, and dense overlapping regulons) are much higher than the random network version • There are 13 types of 3-node connected, directed subgraphs • Feed-Forward Loops (FFL) were significantly over-represented(40 in real vs 7+/- 5 in random) Reference : S.S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, Nature Genetics, 31(1):64-68 (2002)
Graph Theory – Network motifs • Feed-forward loop • A TF X regulate a second TF Y, and both jointly regulated • one or more operons Z1,….. Zn. • Single input module (SIM) • A single TF X regulates a set of operons Z1,….. Zn. X is • usually autoregulate • Dense overlapping regulons (DOR) • A set of operons Z1,….. Zm are each regulated by a • combination of a set of TFs, X1,….. Xn.
1 3 2 K1 = 2 K2 = 5 4 Graph Theory – Node degree correlation • random graph models node degrees are uncorrelated • count the frequency P(K1,K2) that two proteins with connectivity K1 and K2 connected to each other by a link • compared it to the same quantity PR(K1,K2) measured in a randomized version of the same network. • The average node connectivity for a fixed K1 is given by,. where <> denotes the multiple sampling average, and the summation sums for all K2 with a fixed K1. - In the randomized version, the node degrees of each protein are kept the same as in the original network, whereas their linking partner is totally random.
Input - Database of Interacting Proteins (DIP) DIP http://dip.doe-mbi.ucla.edu DIP is a database that documents experimentally determined protein-protein interactions. We analyze the protein-protein interaction for seven different species, C.elegan, D. melanogaster, E. coli, H. pylori, H. sapiens, M. musculus and S. cerevisiae. - Look for general and different features of PIN for different species
C. elegans FLY H. sapiens M. musculus E . coli H. pylori YEAST Results – scale-free network study • Large standard deviation of k • Coefficient of determination, r2 = SSR/SST >0.90 • To account for the flat plateau and long tail behaviors, assume a short-length scale correction k0 and an exponential cut-off tail at kc yeast g ~ 2.1 fly g ~ 1.9
Results Highly connected proteins (k≧25) – yeast (39 sequences) and fly (317 sequences) Most of the sequences do not have high sequence similarity (E-value ≦ 0.01) different functions
Results These highly connected proteins are pair-wise compared in an all-against-all manner using gapped BLAST (16), and none of the sequences shown significant sequences similarity (E-value < 0.001) except the tryptophan protein and SEC27 protein, nuclear pore protein, 26S proteasome regulatory particle chain and DNA-directed RNA polymerase.
Results Fig. 4. The logarithm of the normalized frequency distribution of connected paths vs the logarithm of their length for S. cerevisiae(CORE), H. pylori, E. coil, H. sapiens, M. musculus and D. melanogaster.
Results – node degrees correlation 2.0 Highly connected proteins are unlikely to interact. 2.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 1.0 2.0 1.0 1.6
Results – Hierarchical structures Yeast E. coli Cave(k) ~ k-b The plots of Log Cave(k) vs Log k for the seven species. All the species exhibit a rather flat plateau for small values of k, and they fall rapidly for large k.
Results – identification of cliques identify protein complexes compute the clustering coefficients, find the cliques or pseudo-cliques
if there is a walk of length one between vertices i and j otherwise Identification of cliques Theorem Let A3ij be the (i,j)-th element of A3. Then a vertex Pi belongs to some clique if and only if A3ii ≠0. Example and The non-zero diagonal entries of A3 are a311, a322 and a344. Consequently, node 1, 2 and 4 belong to cliques. Since a clique must contain at least three vertices, the graph has only one clique.
Results - protein complexes Identification of the highest clique degree with protein complexes We had identified all possible cliques within the seven PINs. To identify the relation between cliques and protein complexes, we only considered cliques with the largest number of connected proteins in our preliminary study, and had succeeded in predicting some of the cliques did correspond to protein complexes (comparing data from the BIND database).
Evolution of Biological Networks • Databases – DIP and MIPS • Motif identification • - detecting all n-node subgraphs, • i.e. all 2-, 3-, 4- and some 5- • node (a set of 28 five-node • motifs) motifs in yeast PIN • the network consists of 3183 • yeast proteins encodes 1000 to • 1,000,000 copies of the specific • motif types
Evolution of Biological Networks • studied the conservation of 678 (47% of 1443) • yeast proteins with an ortholog in each of • five higher eukaryotes (A. thaliana, C.elegans, • D. melanlgaster, M. musculus and H.sapiens) • deposited in the InParanoid database • 47% of the 1443 fully connected • pentagons (#11), in yeast have each of their • five proteins components conserved in each • of the five higher eukaryotes • - this results blocks of cohesive motifs • tend to be evolutionary conserved
Evolution of Biological Networks Growth Model of a scale-free network PIN - New proteins nodes are added (genes duplication) - Preferential attachment Redundant links are lost (in an asymmetric fashion)
Evolution of Biological Networks Growth 1. start with m0 nodes 2. add a node with m edges 3. connect these edges to existing nodes at time step t : t+m0 nodes, tm edges Preferential attachment Probability q of connection to node i depends on the degree ki of this node. m0=3, m=2 This model leads to the power law distribution P(k) = 2m2k-3 ~ k-3
Summary Protein-protein interaction Network • PINs are not random networks, they have rather heterogeneous structures highly connected protein blastp shows that they do not share sequence similarity • The plots of Log[Pcum(k)] vs Log[k] study indicates that PINs are well approximate by scale-free networks • a ~ 2 A general biological evolution mechanism across species growth + preferential attachment model • The plots of Log[Pcum(k)] vs Log[k] for fly and yeast seems to have deviation at the small k and large k value modification of the growth + preferential attachment model • Highly connected proteins are unlikely to interact • Hierarchical network model is a better description for certain species’ PINs
Matrix Permutations A one-to-one mapping of the set {1,2,3…,n} onto itself is called a permutation. We denote the permutation s by s = j1j2…jn. The number of possible permutation is n!, and that the set of them is denoted by Sn. For example, S2 = {12, 21}, S3 = {123, 132, 213, 231, 312, 321}. Consider an arbitrary permutation s in Sn: s = j1j2…jn. We say s is even or odd according whether there is an even or odd number of pairs (i,k) for which i > k but i precedes k in s We then define the sign of s by, written sgn s, by if s is even if s is odd