1 / 21

Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent

Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent. Alpha-helical Transmembrane Proteins. Transmembrane proteins fulfil many critical cellular functions. Comprise about 30% of the human proteome.

alijah
Télécharger la présentation

Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machine-based Transmembrane Protein Topology PredictionTim Nugent

  2. Alpha-helical Transmembrane Proteins • Transmembrane proteins fulfil many critical cellular functions. • Comprise about 30% of the human proteome. • Composed of hydrophobic, membrane-spanning alpha-helices, connected with loop regions. • Poorly represented in structural databases. • Predicting their structure and topology is therefore an important challenge for bioinformatics.

  3. Machine Learning-based Approaches

  4. Using Support Vector Machines for TM Topology Prediction • Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, • They have achieved significant improvements in prediction accuracy (~80%). • However, none of the top scoring methods use SVMs. • While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. • In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. • TM helix / Loop • Inside Loop / Outside Loop • Signal Peptide / ¬Signal Peptide • Re-entrant Loop / ¬Re-entrant Loop

  5. Assembling a Novel Data Set of Transmembrane Proteins • In order to study and predict features of transmembrane (TM) proteins, the use of a high quality data set containing sequences with experimentally confirmed TM regions is essential for both training and validation purposes. • Based on Möller set and MPTOPO database. Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. • Remove fragments, chain breaks, colicins, venoms etc. • Homology reduce at 40% sequence identity. • Topologies determined by OPM or PDB_TM. • Since PDB structures of TM proteins contain no lipid, theoretical approaches are used to predict the position of the membrane relative to the structure, and thus the TM helix boundaries. • OPM uses water-lipid transfer energy minimisation • PDB_TM uses hydrophobicity/structural feature analysis

  6. Data Set Composition • Based on Möller set and MPTOPO database. Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. • Remove fragments, chain breaks, colicins, venoms etc. • Homology reduce at 40% sequence identity. • Topologies determined by OPM or PDB_TM.

  7. Novel Data Set • Theoretical membrane placement on to the Mechanosensitive channel protein MscS crystal structure (PDB code 2oau) by OPM (left) and PDB_TM (right). The membrane region is between the red and blue bars.

  8. Re-entrant Helices • Re-entrant helices in Aquaporin Z (left) from Escherichia coli (PDB code 1rc2) and Potassium channel (right) from Bacillus cereus (PDB code 2ahy) marked with black arrows.

  9. Support Vector Machine Training • Data set of 131 non-redundant protein sequences. • Jack knife cross-validation - sequences with >25% sequence identity removed from training sets. • Signal peptide SVM – 10-fold cross validation + additional data from Phobius set and SWISS-PROT (2654 sequences). • PSI-BLAST profiles vs Uniref 90. E-value threshold for inclusion = 0.001 • Normalise by Z-score. • 27-35 (update - 41) residue sliding window. • Transduction. • Optimise window size, kernel choice and parameters using Mathew's Correlation Coefficient:

  10. Window Size H/L SVM Split 1: 37 39 35 33 31 MCC 1: 0.79 0.79 0.79 0.79 0.79 Split 2: 37 35 39 33 31 MCC 2: 0.82 0.82 0.82 0.82 0.81 * * * I/O SVM Split 1: 43 45 41 39 37 35 33 MCC 1: 0.66 0.66 0.66 0.66 0.65 0.64 0.63 Split 2: 45 43 41 39 37 35 33 MCC 2: 0.55 0.55 0.55 0.55 0.54 0.54 0.52 * * * * * Max TM helix length = 33 residues Average TM helix length = 21 residues Average topogenic loop (< 60 residues) length = 19 residues

  11. Per Residue SVM Prediction Accuracy

  12. Dynamic Programming • Modified version of original MEMSAT algorithm, treating TM helices as discrete units, rather than separating them into inside, outside and middle components. • Re-entrant helix and signal peptide states were added. • Residues were therefore predicted to lie in one of five different topological regions: inside loop, outside loop, TM helix, re-entrant helix and signal peptide. • For evaluating signal peptide preference, residues with positive signal peptide scores up to position 30 in a target sequence were added to the outside loop score and subtracted from the inside loops score, in order to direct prediction towards a non-cytoplasmic amino terminal. • The value was also scaled by a factor of 10 and subtracted from the TM helix SVM score to prevent TM helix prediction. • For the same reason, positive re-entrant helix scores were scaled by a factor of 10 and subtracted from the TM helix SVM score

  13. Overall Prediction Accuracy • Benchmark results for the SVM-based method ('TMSVM') against a selection of leading topology predictors. 'Correct signal peptide' and 'correct re-entrant helix' refer to correct topology prediction for proteins containing these features. TMSVM was able to detect signal peptides with 92% accuracy, and re-entrant helices with 39% accuracy. No false positives of either class were predicted. • OCTOPUS results were not cross-validated therefore are likely to be overestimated as there is considerable overlap between test and training sets. • Tested vs the Möller (low resolution) data set – scores 77%, same as MEMSAT3.

  14. Formate Dehydrogenase

  15. Ubiquinol Oxidase

  16. Glycerol uptake facilitator

  17. ABC transporter BtuCD

  18. Photosystem I

  19. Discriminating between TM and Globular Proteins • For SVM training, we used 416 randomly chosen proteins from the MEMSAT3 [11] set which consists of 2685 non-redundant chains from globular proteins of known structure, combined with our novel set of 131 TM proteins. • The remaining 2269 sequences were used used as test cases. PSI-BLAST profiles were generated for all sequences and 10-fold cross validation was used to assess performance, again removing sequences from the training fold with greater than 25% sequences identity to any sequence in the test fold. • Window size = 33, Kernel = RBF, MCC = 0.78

  20. Whole Genome Analysis

  21. Conclusions • Novel SVM-based approach predicts correct topology with 88% accuracy, 9% higher than next best method OCTOPUS. • Incorporates signal peptide and re-entrant helix prediction. • Signal peptide containing proteins correctly predicted with 92% accuracy. • Re-entrant helix containing proteins correctly predicted with 55% accuracy – room for improvement. • Good TM/globular protein discrimination – combined with SP prediction, highly suited to whole genome analysis. • Further work • SVM to predict amphipathic/pore-forming helices.

More Related