1 / 17

Similarity Measures for Protein Binding Sites based on Fuzzy Histogram Comparison

Similarity Measures for Protein Binding Sites based on Fuzzy Histogram Comparison. Thomas Fober Eyke Hüllermeier Knowledge Engineering & Bioinformatics Group Department of Mathematics and Computer Science Marburg University, Germany.

ayanna
Télécharger la présentation

Similarity Measures for Protein Binding Sites based on Fuzzy Histogram Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity Measures for Protein Binding Sitesbased on Fuzzy Histogram Comparison Thomas FoberEykeHüllermeier Knowledge Engineering & Bioinformatics Group Department of Mathematics and Computer Science Marburg University, Germany World Congress on Computational Intelligence, Barcelona, 21.07.2010

  2. Protein Binding Sites • protein consists of amino acids • small cavities on the surface of a protein • small molecules can bind to these sites and cause a reaction of the protein • inhibit binding site or cause a reaction of the protein • no sequence or fold information • model: 7 physico-chemical properties defined by certain amino acids • abstraction: summarize patch into a spacial point( pseudocenter)

  3. Applications • cross reactivities • binding site of target is known • search for proteins with similarbinding site • these proteins may also be influenced by ligand • prediction of the function • protein with unknown function but known structure • searchfor proteins with known function and similarbinding site • goal: find a similarity measure between protein binding sites • how to store / represent a protein binding site • how to define a similarity measure on a certain representation

  4. Representation of Protein Binding Sites

  5. Existing Approaches Point-cloudbased • labeled point cloud superposition • geometric hashing • Hausdorff distance Graph based • graph isomorphism • subgraph isomorphism • maximum common subgraph • minimum common supergraph • graph kernels • random walk • shortest path • graph edit distance • iterative approaches • quadratic programming • evolutionary algorithms

  6. Cavbase • data base containing protein binding sites • 61,516 protein structures used to extract • 248,686 protein binding sites  for large scale studies such approaches are not feasible

  7. Histogram Representation I • binding sites consist of a set of pseudo-centers and a set of distances between them two distributions  two histograms • first histogram represents distances • use set of bins • h(b) is the percentage of distances whose length are in the interval [b – 1, b[ • second histogram to represent pseudocenters • use bins and count for each type of pseudocenter its relative number of occurrences

  8. Histogram Representation I (2) 3 1.5 0.8 2 0.9 0.6 1 red green blue 4 1 2 3

  9. Histogram Representation II • considering pseudocenters and distances separately comes with a high loss of information • instead by two represent a protein binding site by 28 histograms • consider sets of all distances between pseudocenters of type i and j • calculate for each a histogram as described in (I) • normalize histograms to give all of them the same weight • advantage here: pseudocenters, their type, and distances are combined

  10. Histogram Representation II(2) 1 2 3 1 2 3 1.5 0.8 0.9 0.6 3 1 2 3 1 2 3 1 2 3 1 2

  11. Calculation of Similarity ... ... () ...

  12. DistanceMeasures on Histrograms • bin-by-bin measures • Minkowski distance: • histogram intersection: • χ2 , Kullback-Leibler divergence, ... • cross-bin measures • quadratic form distance: whereand aij gives the similarity between bin i and j

  13. DistanceMeasures on Histrograms (2) • distances based on cummulative distributions • match distance: • Kolmogorov-Smirnov distance: • earth-movers distance:

  14. Fuzzy Histograms • problem of discontinuity is known for histograms • 2.99 belongs to bin [2, 3[ whereas 3.0 belongs to bin [3, 4[ • 2.1 and 2.9 belong to bin [2, 3[ • use triangular fuzzy sets with core {b} and support ]b – 1, b + 1[ • replace for fuzzy histograms counts by sigma-counts newdistance: 2.9 ... 2 3 ... ... ... 1 6 7 8 9 1 2 3 4 5

  15. Experiment • classification of a two-class data set • the higher the classification rates, the better the similarity measure • ATP/NADH data set contains 355 protein binding sites • 214 bind NADH substrates • 141 bind ATP substrates • test procedure: • leave-one-out cross-validation • k-NN classifier

  16. Results

  17. Conclusions • histograms can be used to represent a protein binding site in a simple way • loss of information • still enough information to retrieve meaningful similarity values • runtime much smaller in comparison to approaches used so far (on NADH/ATP dataset factor of about 800) • cross bin measures try to solve the problem of discontinuity on bin-boundaries • fuzzy histograms achieve the highest classification rate

More Related