1 / 30

Identifying Structural Motifs in Proteins

Identifying Structural Motifs in Proteins. Rohit Singh Joint work with Mitul Saha. The Big Picture: small motifs. Active Sites are preserved across proteins with similar functions. The Big Picture: large motifs. Even bigger motifs are often conserved. . Oh, BTW….

isanne
Télécharger la présentation

Identifying Structural Motifs in Proteins

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying Structural Motifs in Proteins Rohit Singh Joint work with Mitul Saha

  2. The Big Picture: small motifs Active Sites are preserved across proteins with similar functions

  3. The Big Picture: large motifs Even bigger motifs are often conserved.

  4. Oh, BTW… There are two different issues here: • Find the best match for the motif in the protein • Extensively studied in vision/graphics • Is the match “significant” ? • For small motifs a good match is more likely • What is probability of a match against a random protein being this good ? (cf. BLAST)

  5. What’s in it for a CS guy ? • The problem of matching two point-sets has many applications • Most current algorithms geared towards points that are indistinguishable (e.g. points on a mesh) • There are few rigorous results on the significance of matches

  6. So what have we done ? • Towards a more rigorous approach for scoring the quality of a match (between motif and protein) • Provide a method that is capable of finding the optimum match based on these criteria

  7. Problem Description • Given a motif and a protein, for each point in the motif, find a corresponding point in the protein. • Given these correspondences, find the best transformation (rotation and translation only) of the motif that aligns it to the protein. • Optimize over all possible correspondences

  8. Oh, BTW… • Given two sets of k points, easy to find the optimal rotation and translation that minimizes the least sum-of-squared error (also RMSD). • Boils down to finding the largest eigenvalue of a 4x4 matrix.

  9. Previous Work • Brute Force approach: match edges of same length. • Geometric Hashing: Pennec & Ayache, Bioinformatics, 1998

  10. What is missing ? • Ad hoc: Try to minimize a quantity that is only indirectly related to the least square error or RMSD. • Hard to evaluate the quality of partial matches • Brute Force methods infeasible for larger motifs • Geometric Hashing requires significant preprocessing

  11. Estimating the error Model the alignment problem as a regression problem: Y = model set (protein) T = data set (motif) g = transformation (rot+trans) • Which error criterion to use ? • Least Mean Squared Error (also RMSD) • LSE is not good when you have outliers. • what to do ?

  12. Robust error estimation • LSE: larger error terms have disproportionate influence. • Use a function to reduce the effect of larger error terms (M-estimators)

  13. Its an optimization problem! Consider the case of full matching: Domain: set of all possible correspondences between points on the motif and points on the protein Range: given a particular set of corresponding points, the minimum error in aligning those point sets. Goal: find the global minimum of this function!

  14. Looking for global minimum Our approach: • Prune the search space to a small and plausible sub-space • Find (most) of the local minima in this sub-space quickly • Choose the minimum over these local minima

  15. Finding local minima is easy:ICP Iterative Closest Point (Besl-McKay):

  16. ICP contd… • ICP is guaranteed to converge to a local minimum • But depends a lot on initial seeding • Convergence is quick: ~4-5 iterations ICP movie

  17. Pruning the search space • Every point in motif/protein has some features: • Amino acid type, element type, sec. structure, hydrophobic/polar, ‘substitutable’ • Assume: a point with feature X can only match another point with feature X (or {Y,Z,W}) • Assume: some features are more frequent than others

  18. Our Approach • Find the feature that is least frequent in protein. • For each occurrence of the feature: • Seed ICP appropriately. Find local minimum. • Look around a few more times • Return the best answer you have

  19. Observations • Will always find a perfect match, if it exists. • Moreover, will find such a match quickly. • The error is directly interpretable in RMSD terms

  20. Does it work ?

  21. …contd Trypsin active site against Trypsin like proteins

  22. …contd Trypsin active site against kinases

  23. What about partial matching ? • Basic idea is the same: pruning+ICP • Replace least squared error estimates by M-estimator based errors. Problem: How to find the optimal rotation/translation that minimizes this new variety of error criterion? Answer: weighted LSE ? Is there a better way ?

  24. RANSAC Choice of the parameters has statistical justification

  25. Plain Vanilla (Least Squares):

  26. M-estimator+ weighted LSE

  27. M-estimator + RANSAC

  28. …contd Data for distorted trypsin active site against ten different trypsins:

  29. Future Work • Test on larger motifs: secondary structure elements • Choice of better features • A theoretical guarantee about the quality of results • Explore different criteria for partial matching

  30. Thanks!

More Related