430 likes | 455 Vues
Combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins. Learn the importance of homologous proteins, computational approaches to detecting homology, and how SMURFLite outperforms other methods in β-structural protein analysis.
E N D
SMURFLite combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone Noah M. Daniels | Raghavendra Hosur | Bonnie Berger | Lenore Cowen
What are Homologous Proteins? Proteins that preserve related structure (and often function) because they have evolved from a common ancestor. Pig Insulin (Pdb Id: 1m5a) Human Insulin (Pdb Id: 1mso)
Why is homology important? Common Ancestor Similar Structure Similar Function
Computational Approaches to Detecting Homology Sequence based methods work best when homology is not too distant S1 LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60 L+P +K+ V A WGKV + E G EAL R+ + +P T+ +F F DLS G+ +V S2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQV 55 These proteins aligned by BLAST have probably evolved from common ancestor
Structural Motifs Using Random Fields Can we get the benefit of pairwise correlations without having to throw away all sequence info?
The template is learned from solved structures in the PDB: Aligned with Matt
The template is learned from solved structures in the PDB:Aligned with Matt
Two beta tables are learned from amphipathic beta sheets that are not propellers from solved structures in the PDB. Two pairwise Exposed Residue Buried Residue http://bcb.cs.tufts.edu/propellers/si/
Sequences are scored by computing their best “threading” or “parse” against the template as a sum of HMM(score) + pairwise(score) No longer polynomial time (multi-dimensional dynamic programming) Tractable on propellers because paired beta-strands don’t interleave too much See: Menke, Berger and Cowen, “Markov Random Fields Reveal an N-terminal double beta-propeller motif as part of a bacterial hybrid two-component sensor system”, PNAS March 2010. http://smurf.cs.tufts.edu Computing a Score
What makes SMURF turn blue? :( Deeply interleaved β-strand pairs SMURF is exponential with the interleave number!
2 1 Interleave number is ß-strand complexity - - - -
How do we make SMURF happy? Simplify the dependency graph Only consider beta-strands up to an interleave threshold of i. i=0; ordinary HMM i=1; fast i=2; still fast i=3; still tractable i=4; getting too slow
SMURFLite ignores highly interleaved β-strand pairs Can we somehow weakly capture the pairwise information discarded?
Simulated Evolution (Kumar and Cowen, 2010) • An HMM is “only as good as the training data” – but is it? • Leverage our knowledge of evolution to construct new, artificial training data • Kumar and Cowen, Bioinformatics 2009 and 2010.
SE pipeline We showed that this improved performance for HMMs; How about for our new MRFs?
Pairwise evolution model Exposed Residue Buried Residue http://bcb.cs.tufts.edu/propellers/si/
SMURFLite: simplified, augmented MRF Identify beta-strand pairs Count their interleave number Augment the training profile with simulated evolution on beta-strand paired residues Exclude beta-strand pairs that are too interleaved from the MRF
5-bladed, 6-bladed, 7-bladed, and 8-bladed propeller folds. All 11 SCOP superfamilies in the mainly-beta Class that contain the word “barrel” in the description (doesn’t include 2 not structurally consistent). Results: The Dataset
SMURFLite handles β barrels and sandwiches Translation proteins
Interleaving still matters! Barwin-like endoglucanases
All this lets us search whole genomes Thermotogamaritima 1852 genes 207 ß-structural templates We find 139 “hits” 28 have solved structures in PDB 8 predictions agree with Zhang et al.; None contradicted Credit: K.O. Stetter & R. Rachel, Univ.Regensburg Gene Q9X087 (“putative uncharacterized protein”) only 20% identity with its closest solved BLAST hit (Rhoptry protein from Plasmodium yoelli yoelli). We predict it belongs to “beta-Galactosidase/gluconuridase domain” with p-value of 0.0006
In the end... MRFs outperform the competition on ß structures They get too complicated on some structures We can “fix” this by snipping out the hard parts But we don’t want to lose all that information, so we use Simulated Evolution to partially preserve it This lets us do whole-genome searches quickly We found some possible annotations in Thermotoga!
Where do we go from here? Dynamic programming is too slow Stochastic search for beta strand positions In between, standard dynamic programming to solve HMM
SMURFLite Thanks to Matt Menke, Anoop Kumar, and Jinbo Xu. This work was funded in part by NIH grant 1R01GM080330 (to Lenore Cowen) and 1R01GM081871 (to Bonnie Berger). bcb.cs.tufts.edu