Computational functional genomics

Computational functional genomics Lital Haham Sivan Pearl

Introduction • Piles of information but only flakes of knowledge. • The existing information: Collections of genomic sequences. Expression profiles Protein-protein interactions And many more…

Introduction • Computational biology strives to extract the maximal possible information from known sequences, by classifying them according to their homologous relationships, predicting their biochemical activity, cellular function, 3-dimensional structures and evolutionary origin.

The COG-Clusters of Orthologous Groups of proteins • Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes. • The purpose of COG is to serve as a platform for functional annotation of newly sequenced genomes and for study of genome evolution. • Reflects one-to-one, one-to-many and many-to-many relationships.

The COG-statistics • In 2003, there are 3307 COGs including 74059 proteins from 43 genomes. • Genomes from- Bacteria, Archaea and Eukaryota. • The database includes 17 functional groups.

The COG- make on your own • COG construction procedure is based on the notion that any group of at least 3 proteins from distant genomes that are more similar to each other than to any other protein from the same genomes, are most likely to belong to an orthologous family.

Detect and collapse paralogs Detect triangles of mutually genome specific best hits Merge triangles with a common side, to form COG The COG- make on your own All-against-all protein sequence comparison

The COG- make on your own

The COG- adding new genomes • The COGNITOR program adds new proteins to pre-existing COGs on the basis of multiple Best Hits. • 60-80% of the proteins of prokaryotes could be included.

The COG- more applications: • Detecting missed genes. • Convenient for variety of evolutionary-oriented analyses of protein families.

Methods • Experimental method: • Computational methods: Biochemical and genetic experiments Homology method (BLAST), mRNA expression Phylogenetic profile Fusion method (Rosetta stone analysis) Gene neighbour method

Homology method • Homology method: searches proteins whose AA sequences are similar. • 40-70% of new genome can be assigned to some function. • Involve identification of some molecular function.

mRNA expression • Analysis of correlated mRNA expression levels enables to establish functional linkages, by detecting changes in mRNA expression in different cell types, or different environments.

Phylogenetic profile • Describes the pattern of presence or absence of a particular protein, across a set of organisms. • Number of possible profiles: • This number far exceeds the protein families.

Phylogenetic profile • Why would two proteins always both be inherited into new species or neither inherited, unless the two function together? • If two proteins have the same phylogenetic profile, it is inferred that they have a functional link: engaged in a common pathway or complex.

1 1 1 Phylogenetic profile

Phylogenetic profile- example • Analysis of three proteins: RL7, FlgL and His5, according to their phylogenetic profiles. • RL7: more than half have functionassociated with the ribosome. • FlgL: more than half include various flagellarproteins and cell-wall maintenance proteins. • His5: more than half involved in amino acid metabolism.

Phylogenetic profile- example PgsA phospholipid synthesis YGGH hypothetical YBEX hypothetical RL34 ribosome L34 RL36 ribosome L36 RL27 ribosome L27 RL25 ribosome L25 YQCB hypothetical YABO hypothetical YCEC hypothetical RFH peptide release factor ClpB geat shock protein RL7 ribosome L7 RL15 ribosome L15 RL17 ribosome L17 PTH peptidyl-tRNA hydrolase RNC ribonuclease III YJFH hypothethocal RS14 ribosome S14 GidB glucose inhib. Division RL24 ribosome L24 DEF polypeptide deformylase RL20 ribosome L20 MesJ cell cycle protein RL19 ribosome L19 RL21 ribosome L21 RL9 ribosome L9 SmpB small protein B G3P3 dehydrogenase RL4 ribosome L4 NONE hypothtical GrpE co-chaperone

Keyword No. proteins No. neighbors in keyword group No. neighbors in random group Ribosome 60 197 27 Transcription 36 17 10 tRNA synthase and ligase 26 11 5 Membrane proteins 25 89 5 Flagellar 21 89 3 Iron, ferric, and ferritin 19 31 2 Galactose metabolism 18 31 2 Molybdoterin and Molybdenum, and molybdoterin 12 6 1 Hypothetical 1084 108226 8440 Phylogenetic profile Phylogenetic profiles link protein with similar keywords

Fusion method or the Rosetta stone analysis • Some pairs of interacting proteins have homologs in another organism, fused into a single protein chain. • When two separate proteins in one organism, A and B, are expressed as a fused protein in some other species, there is a high probability that A and B are linked in function.

Fusion method

The Rosetta Stone model

Fusion method –what is it good for? • Predicts protein pairs that have related biological functions. • Predicts potential protein-protein interactions. • Can turn up complexes of proteins, or protein pathways.

Fusion method –what is it good for?

Fusion method • The group searched the 4290 protein sequences of the E.coli genome. • The proteins could form at most (4290)(4289)/2 pair interactions. But we expect much less… • There were found 6809 candidate for pair interactions.

Fusion method –validation • Looking for a similar function in existing annotations that would imply at least functional interaction. • Of the E.coli pairs that were found in the Rosetta Stone analysis, 68% share at least one keyword in their annotations, whereas from E.coli proteins that were selected randomly, only 15% share a keyword.

Fusion method –validation • From a database containing protein pairs that have been found to interact (experimentally) – 6.4% are linked by Rosetta Stone sequences. • The phylogenetic profile method was applied to the interactions predicted by the fusion method. It found more than 8 times as many interactions suggested by the phylogenetic profile method, as for randomly chosen sets of interactions.

Fusion method –missing pairs • False negatives: There was no fusion of the interacting proteins. The fused protein disappeared during the course of evolution.

Fusion method –False alarms • False positives: False prediction of physical interactions when the proteins are fused, but are co-regulated and don’t interact. Cannot distinguish between homologs that bind and those that do not.

Fusion method –False alarms • The false positive rate in E.coli due to the inability to distinguish homologs is about 82%. • To reduce these errors: the “promiscuous” domains were found and removed during the analysis. • By filtering of only 5% of all domains, we can remove the majority of falsely predicted interactions.

Fusion method –False alarms

Neighbour method • Functional links between genes can be identified by examining whether the proximity of the genes is conserved across multiple genomes. • Powerful in uncovering functional linkages in prokaryotes where operons are common.

Neighbour method

Neighbour method- definitions • ‘close’: proximate genes are on the same strand within 300 bp, and transcribed in the same direction. • Direct link: two proximate genes that are also proximate in at least two other genomes of different phylogenetic groups. • Inferred link: two genes that are not close but with orthologs that are close in at least three other genomes of different phylogenetic groups.

Neighbour method- defenitions

Neighbour method • Proximity between genes is maintained mostly because it facilitates their co-transfer to another organism. • Example: restriction-modification systems.

Neighbour method- validation • Identification of links that are annotated in KEGG or COG – and calculate the fraction of those in the same functional pathway / category. • The functional correspondence is correlated to the minimal number of phylogenetic groups, in which the proximity is detected.

Neighbour method- validation N tradeoff

Neighbour method- example

Happy end??? • The group analyzed the 6,217 proteins of the yeast Saccharomyces combining several methods. • one can expect each protein to be functionally linked to perhaps 5–50 other proteins, giving 30,000–300,000 biologically meaningful links.

Happy end???

Networks • When methods of detecting functional linkages are applied to all the proteins of an organism, network of interacting, functionally linked proteins can be traced. • As methods improve for detecting protein linkages, it seems likely that most of the proteins will be included in the network.

Networks

פורים שמח

Computational functional genomics