SMURFLite: Enhancing Homology Detection for Beta-Structural Proteins Using Random Fields

SMURFLite combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone Noah M. Daniels | Raghavendra Hosur | Bonnie Berger | Lenore Cowen

What are Homologous Proteins? Proteins that preserve related structure (and often function) because they have evolved from a common ancestor. Pig Insulin (Pdb Id: 1m5a) Human Insulin (Pdb Id: 1mso)

Why is homology important? Common Ancestor Similar Structure Similar Function

Computational Approaches to Detecting Homology Sequence based methods work best when homology is not too distant S1 LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 60 L+P +K+ V A WGKV + E G EAL R+ + +P T+ +F F DLS G+ +V S2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQV 55 These proteins aligned by BLAST have probably evolved from common ancestor

A Greater Challenge: Detect Remote Homologs

Sequence data: how will we keep up?

HMM is trained from Sequence Alignment of Known Structures

Profile HMM

HMMs cannot capture nonlocal interactions

Markov random fields add nonlocality to HMMs

Let’s look at what this would mean for propeller folds

Structural Motifs Using Random Fields Can we get the benefit of pairwise correlations without having to throw away all sequence info?

The template is learned from solved structures in the PDB

The template is learned from solved structures in the PDB: Aligned with Matt

The template is learned from solved structures in the PDB:Aligned with Matt

Two beta tables are learned from amphipathic beta sheets that are not propellers from solved structures in the PDB. Two pairwise Exposed Residue Buried Residue http://bcb.cs.tufts.edu/propellers/si/

Sequences are scored by computing their best “threading” or “parse” against the template as a sum of HMM(score) + pairwise(score) No longer polynomial time (multi-dimensional dynamic programming) Tractable on propellers because paired beta-strands don’t interleave too much See: Menke, Berger and Cowen, “Markov Random Fields Reveal an N-terminal double beta-propeller motif as part of a bacterial hybrid two-component sensor system”, PNAS March 2010. http://smurf.cs.tufts.edu Computing a Score

What makes SMURF turn blue? :( Deeply interleaved β-strand pairs SMURF is exponential with the interleave number!

2 1 Interleave number is ß-strand complexity - - - -

β propellers have a maximum interleave of 3

β barrels range from an interleave of 4 to 8

How do we make SMURF happy? Simplify the dependency graph Only consider beta-strands up to an interleave threshold of i. i=0; ordinary HMM i=1; fast i=2; still fast i=3; still tractable i=4; getting too slow

SMURFLite ignores highly interleaved β-strand pairs

SMURFLite ignores highly interleaved β-strand pairs Can we somehow weakly capture the pairwise information discarded?

Simulated Evolution (Kumar and Cowen, 2010) • An HMM is “only as good as the training data” – but is it? • Leverage our knowledge of evolution to construct new, artificial training data • Kumar and Cowen, Bioinformatics 2009 and 2010.

β-strand Mutation Model

SE pipeline

SE pipeline We showed that this improved performance for HMMs; How about for our new MRFs?

Pairwise evolution model Exposed Residue Buried Residue http://bcb.cs.tufts.edu/propellers/si/

SMURFLite: simplified, augmented MRF Identify beta-strand pairs Count their interleave number Augment the training profile with simulated evolution on beta-strand paired residues Exclude beta-strand pairs that are too interleaved from the MRF

The SMURFLite Pipeline

Results: The Dataset

5-bladed, 6-bladed, 7-bladed, and 8-bladed propeller folds. All 11 SCOP superfamilies in the mainly-beta Class that contain the word “barrel” in the description (doesn’t include 2 not structurally consistent). Results: The Dataset

SMURFLite compared to HMMer, Raptorand HHpred

SMURFLite handles β barrels and sandwiches Translation proteins

SMURFLite compared to other programs

Interleaving still matters! Barwin-like endoglucanases

All this lets us search whole genomes Thermotogamaritima 1852 genes 207 ß-structural templates We find 139 “hits” 28 have solved structures in PDB 8 predictions agree with Zhang et al.; None contradicted Credit: K.O. Stetter & R. Rachel, Univ.Regensburg Gene Q9X087 (“putative uncharacterized protein”) only 20% identity with its closest solved BLAST hit (Rhoptry protein from Plasmodium yoelli yoelli). We predict it belongs to “beta-Galactosidase/gluconuridase domain” with p-value of 0.0006

In the end... MRFs outperform the competition on ß structures They get too complicated on some structures We can “fix” this by snipping out the hard parts But we don’t want to lose all that information, so we use Simulated Evolution to partially preserve it This lets us do whole-genome searches quickly We found some possible annotations in Thermotoga!

Where do we go from here? Dynamic programming is too slow Stochastic search for beta strand positions In between, standard dynamic programming to solve HMM

SMURFLite Thanks to Matt Menke, Anoop Kumar, and Jinbo Xu. This work was funded in part by NIH grant 1R01GM080330 (to Lenore Cowen) and 1R01GM081871 (to Bonnie Berger). bcb.cs.tufts.edu

SMURFLite: Enhancing Homology Detection for Beta-Structural Proteins Using Random Fields

SMURFLite: Enhancing Homology Detection for Beta-Structural Proteins Using Random Fields

Presentation Transcript

SG Cowen Technology Conference

NOAH

Mathematical Challenges in Protein Motif Recognition Bonnie Berger MIT

Noah

Venkatesan Guruswami Prasad Raghavendra

Noah

Jones - Daniels

noah

Raghavendra Madala

Noah

Christa Daniels

Pammi Raghavendra. Ph.D.

Pammi Raghavendra. Ph.D.

Noah

NOAH

Pammi Raghavendra. Ph.D.

Hosur Road

Daniels

Noah

Mathematical Challenges in Protein Motif Recognition Bonnie Berger MIT