1 / 92

Effective Statistical Energy Function Based Protein Un/Structure Prediction

Effective Statistical Energy Function Based Protein Un/Structure Prediction. Doctoral Dissertation Defense Date: 06/05/2019. Avdesh Mishra Computer Science Department, University of New Orleans Supervisor: Dr. Md Tamjidul Hoque. Research Summary.

cree
Télécharger la présentation

Effective Statistical Energy Function Based Protein Un/Structure Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effective Statistical Energy Function Based Protein Un/Structure Prediction Doctoral Dissertation Defense Date: 06/05/2019 Avdesh Mishra Computer Science Department, University of New Orleans Supervisor: Dr. MdTamjidulHoque

  2. Research Summary • Sequence and Structure Based Energy Function • 3DIGARS: Extraction of energy score from atom-atom pair distance considering hydrophobic-hydrophilic interaction to discriminates native protein structure from decoys. • 3DIGARS2.0: Extraction of energy score by modeling error between actual and predicted accessible surface area from sequence. • 3DIGARS3.0:Extraction of energy score by mining torsion angles ubiquitously from structure. • Publications • A. Mishra and M.T. Hoque, “Three-Dimensional Ideal Gas Reference State based Energy Function”, Current Bioinformatics, 2017 • S. Iqbal, A. Mishra and M.T. Hoque, “Improved Prediction of Accessible Surface Area Results in Efficient Energy Function Application”, Journal of Theoretical Biology, 2015 • A. Mishra, S. Iqbal and M.T. Hoque, “Discriminate Protein Decoys from Native by using a Scoring Function based on Ubiquitous Phi and Psi Angles Computed from All Atom”, Journal of Theoretical Biology, 2016 • A. Mishra, S. Iqbal and M.T. Hoque, An Eclectic Energy Function to Discriminate Native from Decoys, The 4th Annual LA Conference on Computational Biology and Bioinformatics, 2016, New Orleans, LA Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  3. Research Summary (Cont.) • An Optimal Energy Function for ab initio Protein Structure Prediction • sDIGARS: An energy function, which is an optimized linear combination of DFIRE energy, polar-polar and polar-nonpolar energy, and energy obtained by modeling error between predictedtorsion angle and accessible surface area. • o3DIGARS: An energy function, which optimally combines sequence and structural based energetic terms collected from four different datasets using two different reference states, in addition to previously developed energy functions. • Publications • M.T. Hoque, Y. Yang, A. Mishra and Y. Zhou, “sDFIRE: Sequence-specific Statistical Energy Function for Protein Structure Prediction by Decoy Selection”, Journal of Computational Chemistry, 2016 • A. Mishra and M.T. Hoque, 3DIGARS-PSP: A Novel Statistical Energy Function and Effective Conformational Search Strategy based ab initio Protein Structure Prediction (Submitted) Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  4. Research Summary (Cont.) • Ab Initio Protein Structure Prediction • 3DIGARS-PSP: Features an appropriate combination of o3DIGARS energy function and a sampling algorithm to predict 3D structure of protein from its sequence. • Publications • A. Mishra and M.T. Hoque, 3DIGARS-PSP: A Novel Statistical Energy Function and Effective Conformational Search Strategy based ab initio Protein Structure Prediction (Submitted) • A. Mishra and M.T. Hoque, A Novel Statistical Energy Function and Effective Conformational Search Strategy based ab initio Protein Structure Prediction, CASP13 Proceedings, 2018 • A. Mishra and M.T. Hoque, Improved Protein Structure Prediction using Advanced Scoring Function and Effective Sampling, The 5th Annual LA Conference on Computational Biology and Bioinformatics, 2017, New Orleans, LA • A. Mishra and M.T. Hoque, Next Generation Evolutionary Sampling and Energy Function Guided Ab Initio Protein Structure Prediction, Biophysical Journal (Abstract Issue), 2017 Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  5. Research Summary (Cont.) • Disulfide Connectivity Pattern Prediction • disBPred: GPLSC1GGVC2IPIRC3PVPGTC4FGKC5C6RW  C1-C3, C2-C5, C4-C6 • Publications • A. Mishra and M.T. Hoque, Prediction of Disulfide (S-S) Bonds using Machine Learning Methods (Manuscript Under Preparation) Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  6. Research Summary (Cont.) • Statistical Energy Function Based Conformational Ensemble Generator for Unstructure Proteins • flexEgy: Extraction of energy score from atom-atom pair distance (collected from NMR structures) considering hydrophobic-hydrophilic interaction to discriminates native structure of unstructured/flexible proteins from decoys. • aiCEG: Features an appropriate combination of flexEgyenergy function, disBPred, and a sampling algorithm to generate ensemble of conformations for unstructured/flexible proteins. • Publications • A. Mishra and M.T. Hoque, Three-Dimensional Ideal Gas Reference State based Energy Function for Flexible Proteins, The 7th Annual LA Conference on Computational Biology and Bioinformatics, 2019, New Orleans, LA • A. Mishra and M.T. Hoque, An Effective Energy Function and Conformational Search Based Ab Initio Conformational Ensemble Generator (Manuscript Under Preparation) Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  7. Research Summary (Cont.) • Protein and Small Molecules (DNA/RNA/Carbohydrate) Interaction • StackDPPred:…GPLSGGVCIR…  [DNA-binding(0.95), Non DNA-binding(0.05)] • AIRBP: …MPLGGAVCIS…  [RNA-binding(0.95), Non RNA-binding(0.05)] • StackCBPred: …MPLT…  …N(0.48)B(0.88)B(0.95)B(0.78)… • Publications • A. Mishra, P. Pokhreland M.T. Hoque, StackDPPred: A Stacking based Prediction of DNA-binding Proteins from Sequence, Bioinformatics, 2018 • A. Mishra, R. Khanaland M.T. Hoque, AIRBP: Accurate Identification of RNA-binding Proteins Using Machine Learning Techniques (Submitted) • S. Gattani, A. Mishra, and M.T. Hoque, StackCBPred: A Stacking based Prediction of Protein-Carbohydrate Binding Sites from Sequence (Submitted) Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  8. Research Summary (Cont.) • Mapping of Protein Sequence to its Supersecondary Structure (SSS) • StackSSSPred: Two separate binary machine learning models to predict Beta-Hairpin (BH) and Beta-Alpha-Beta (BHB) SSS types. M BH(0.88) R BH(0.90) S BH(0.80) T Non-BH(0.30) L Non-BH(0.05) … • Publications • M. Flot, A. Mishra, A.S. Kuchi and M.T. Hoque, StackSSSPred: A Stacking based Prediction of Supersecondary Structure from Sequence, Book Chapter (Chapter 5, pp 101-122), Protein Supersecondary Structures, Methods in Molecular Biology, vol 1958, Humana Press, New York, NY, 2019 Beta-Hairpin Avdesh Mishra, PhD Candidate Computer Science Department, UNO Beta-Alpha-Beta

  9. Research Summary (Cont.) • Additional Projects • Oyster Vessel Behavior Prediction • Multiclass Patent Document Classification • Hierarchical Classification of Transposable Elements • Hierarchical Classification of File Fragments • Functional Morpholoy Prediction • Torsion Angle Fluctuation Prediction • Peptide Binding Residue Prediction • Publications • D.J. Frey, A. Mishra, M.T. Hoque, M. Abdelguerfi and T. Soniat, “A Machine Learning Approach to Determine Oyster Vessel Behavior”, Machine Learning and Knowledge Extraction of MDPI, 2018 • C. Anne, A. Mishra, M.T. Hoque and S. Tu, “Multiclass Patent Document Classification”, Artificial Intelligence Research Journal, 2018 • A. Mishra, M. Panta, M.T. Hoque and J. Atallah, “Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Approach”, The 6th Annual Conference on Computational Biology and Bioinformatics, New Orleans, LA, 2018 • M. Bhatt, A. Mishra, R. Rajendra, S.E. Blake-Gatto, M.T. Hoque, I. Ahmed, Hierarchical Classification Approach: A new take to File Fragment Classification Problem (Submitted) • P. Pun, A. Mishra, S Lailvaux and M.T. Hoque, “A Machine Learning Approach to Functional Morphology and Performance Prediction”, The 7th Annual Conference on Computational Biology and Bioinformatics, New Orleans, LA, 2019 • M.K. Ahmed, A. Mishra and M.T. Hoque, “TAFPred: An Efficient Torsion Angular Fluctuation Predictor of a Protein from its Sequence”, The 6th Annual Conference on Computational Biology and Bioinformatics, New Orleans, LA, 2018 • S. Gattani, A. Mishra and M.T. Hoque, “Sequence and Structure based Protein Peptide Binding Residue Prediction”, The 6th Annual Conference on Computational Biology and Bioinformatics, New Orleans, LA, 2018 Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  10. Main Focus of This Talk • The First: Prediction of Structured Proteins • The Second: Disulfide Connectivity Pattern Prediction • The Third: Conformational Ensemble Generator for Unstructured Proteins Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  11. Existing Gap • There exist a huge gap in between the available number of sequences and the corresponding 3D structures. Avdesh Mishra, PhD Candidate Computer Science Department, UNO There exist about 156 million sequences in UniProtKB But, only 1,51,754 protein structures in PDB https://www.rcsb.org/stats/growth/overall https://www.uniprot.org/statistics/TrEMBL

  12. Structure  Sequence  Structure (Known) (Unknown) with known structure Computational Prediction Structure (Predicted) Structure Database Sequence Avdesh Mishra, PhD Candidate Computer Science Department, UNO with unknown structure mine useful information

  13. Protein Un/Structure Prediction Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  14. Introduction Avdesh Mishra, PhD Candidate Computer Science Department, UNO Energy Function

  15. Role of Energy Function In general terms: Energy function is used to score structures. -10 Unfolded state -20 Energy Function -100 Avdesh Mishra, PhD Candidate Computer Science Department, UNO -110 Decoy -140 Native state Dill, K. A. and H. S. Chan (1997). "From Levinthal to pathways to funnels." Nature Structural Biology 4: 10-19. Scores

  16. Design of Energy Function Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  17. Construction of Energy Library Avdesh Mishra, PhD Candidate Computer Science Department, UNO The Boltzmann distribution relates potential energy to probability. So, to obtain energy from probability an inverse of Boltzmann’s distribution is used. Observed frequency of atom types i and j in a distance bin r Energy Score of atom types i and j in a distance bin r Expected frequency of atom types i and j in a distance bin r. (a.k.a. Reference State) Samudrala, R. and J. Moult (1998). "An All-atom Distance-dependent Conditional Probability Discriminatory Function for Protein Structure Prediction." Journal of Molecular Biology 275: 895-916.

  18. Construction of Energy Library (Cont.)(e.g. from pair-wise distance) ri r = 1 r = 2 r = 30 ALA N, ALA N 2 1 • Structure based/All-Atom Energy Extracted from Pair-wise Distances • For sequence based energy, we will have amino acids in rows instead of atoms ALA N, ALA CA 14028 rows collected from 167 residue specific atom types atom pair (i,j) Avdesh Mishra, PhD Candidate Computer Science Department, UNO TYR OH, ALA OH bin = 1 size = 0.5 bin = 2 size = 0.5 bin = 30 size = 0.5 bini Maximum distance (cutoff distance) of 15 Å Divide cutoff distance of 15 Å into bins of equal size = 0.5 Å 30 columns

  19. Reference State • State when there is no interaction between atoms of protein • Different types of reference states are derived from known structures e.g. • Averaging Reference State Col. Sum Avdesh Mishra, PhD Candidate Computer Science Department, UNO Total Sum Samudrala, R. and J. Moult (1998). "An All-atom Distance-dependent Conditional Probability Discriminatory Function for Protein Structure Prediction." Journal of Molecular Biology 275: 895-916.

  20. Introduction Avdesh Mishra, PhD Candidate Computer Science Department, UNO Sampling Algorithm

  21. Sampling Algorithm Generate Initial Population • Adopted and extended the idea of Kite-Genetic Algorithm (KGA) for sampling 3D structure. Evaluate Fitness Termination Condition Reached? Avdesh Mishra, PhD Candidate Computer Science Department, UNO The “best chromosome” Associated Memory-Crossover (Segment Translation) Mutation (Angular Rotation) End New Population Hoque, M. T. and S. Iqbal (2017). "Genetic algorithm-based improved sampling for protein structure prediction." International Journal of Bio-Inspired Computation 9(3): 129-141.

  22. o3DIGARS: An Optimal Energy Function for ab initio Protein Structure Prediction Avdesh Mishra, PhD Candidate Computer Science Department, UNO  An Optimal Linear Combination of Sequence and Structure Based Energy Features

  23. Dataset and Reference States Two Averaging Reference States R1 – Proposed by Hoque et al. Row Sum Avdesh Mishra, PhD Candidate Computer Science Department, UNO R2 – Proposed by Samudrala et al. Col. Sum • Samudrala, R. and J. Moult (1998). "An All-atom Distance-dependent Conditional Probability Discriminatory Function for Protein Structure Prediction." Journal of Molecular Biology 275: 895-916. • Hoque, M. T., et al. (2016). "sDFIRE: Sequence-specific statistical energy function for protein structure prediction by decoy selections." Journal of Computational Chemistry 37(12): 1119-1124. Total Sum

  24. Collection of Energetic Terms Avdesh Mishra, PhD Candidate Computer Science Department, UNO Heffernan, R., et al. (2015). "Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning." Scientific Reports 5.

  25. Selection of Useful Energetic Terms Avdesh Mishra, PhD Candidate Computer Science Department, UNO Optimization Dataset Hierarchy Obj_Fxn =

  26. o3DIGARS Energy Function ASA energy computed from training dataset (TDS3) and reference state (RS1) ASA energy computed using REGAd3p extracted from 3DIGARS2.0 3DIGARS energy Psi energy computed from TDS4 and RS1 Avdesh Mishra, PhD Candidate Computer Science Department, UNO Psi energy computed from TDS2 and RS1 Psi triplet energy computed from TDS4 and RS1 Mishra, A. and M. T. Hoque (2018). "3DIGARS-PSP: A Novel Statistical Energy Function and Effective Conformational Search Strategy based ab initio Protein Structure Prediction.". 2018, from http://cs.uno.edu/~tamjid/TechReport/AbInitio_TR2018_1.pdf.

  27. Evaluation Measures • We utilize four different evaluation measures: • Native Count: count of correctly selected native structures out of decoys • Z-score: score that shows how well the energy function can separate native from decoys (more negative Z-score is better) based on energy score • Pearson Correlation Coefficient (between the models energy and TM-score) • TM-score: model structural accuracy score, ranges from (0, 1] Avdesh Mishra, PhD Candidate Computer Science Department, UNO Xu, J. and Y. Zhang (2010). "How significant is a protein structure similarity with TM-score=0.5?" Bioinformatics 26(7): 889-895.

  28. Performance Evaluation Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  29. 3DIGARS-PSP: Ab Initio Protein Structure Predictor Avdesh Mishra, PhD Candidate Computer Science Department, UNO  A predictor that appropriately combines o3DIGARS energy function and KGA based sampling.

  30. Protein Sequence 3DIGARS-PSP Overview Initial Seed Models Experimentally Validated Structures Backbone Models Ramachandran Distribution and SS Propensities for 20 Different Amino Acids Initialize GA Population using Single Point Angular Mutation • Sampling and scoring is carried out in an iterative fashion using GA to obtain the final predicted structure. Obtain Full Model using Oscar-star and Calculate Fitness using o3DIGARS Crossover involves protein segment translation followed by torsion angle rotation Rank and Save Best Model in Memory Select 5% Elite Models Torsion angle rotation: follows the principal of rotation about an arbitrary axis. Perform Associated Memory Crossover @ 70 % Avdesh Mishra, PhD Candidate Computer Science Department, UNO Fill Rest using Single Point Angular Mutation Perform Angular Mutation @ 60% Obtain Full Model using Oscar-star and Calculate Fitness using o3DIGARS Partial View of the GA Chromosome Rank Models in Ascending Order based on Fitness Yang, J., et al. (2015). "The I-TASSER Suite: Protein structure and function prediction." Nature Methods 12: 7-8. Washington, U. o. "Robetta Full-chain Protein Structure Prediction Server." Retrieved February 2017, 2017, from http://robetta.bakerlab.org/. Best Model Save Models Gen < 300 End

  31. Phi/Psi Angle Rotation (Mutation) Arbitrary Axis Y Y Y Y Y Y Connected Backbone Atoms X X X X X X P2 P2 P2 P2 P2 P2 P1 P1 P1 P1 P1 P1 Z Z Z Z Z Z Step 4: Rotate the segment of the structure after point P2 around the z Axis Initial Position Step 3: Rotate P2 onto the z Axis Step 2: Translate P1 to origin Avdesh Mishra, PhD Candidate Computer Science Department, UNO Phi angle involves atoms: C(O)n-1-Nn-C(α)n-C(O)n Thus, P1 and P2 for Phi angle rotation are Nn and C(α)n Psi angle involves atoms: Nn-C(α)n-C(O)n-Nn+1 Thus, P1 and P2 for Psi angle rotation are C(α)n and C(O)n Step 5: Rotate the axis passing through points P1 and P2 to the original orientation Step 6: Translate the structure to the original position

  32. Phi/Psi Angle Rotation (Mutation)(Guided by Ramachandran Distribution) • For a guided change of phi or psi angle • We collected Ramachandran Distribution (phi-psi distribution) for 20 standard amino acid types from 4,332 high-resolution experimental structures. To change the phi or psi angle of a certain amino acid type (aa_type) the zone_index belonging to aa_type is selected randomly. Avdesh Mishra, PhD Candidate Computer Science Department, UNO Then, using roulette wheel selection method the most probable torsion angle is obtained and the current torsion angle is rotated to this most probable torsion angle. Ramachandran, G.N.; Ramakrishnan, C.; Sasisekharan, V. (1963). "Stereochemistry of polypeptide chain configurations". Journal of Molecular Biology. 7: 95–9.

  33. Segment Translation (Crossover) • While creating child#1 from parent#1 and parent#2, instead of simply copying the values, we perform translation of the points. Kite-GA Overview • To perform translation • Difference between the Cartesian coordinates of crossover points of two parents is calculated. • Then, the difference is subtracted from the Cartesian coordinates of every point after crossover point to obtain translated points. Avdesh Mishra, PhD Candidate Computer Science Department, UNO Translated Points Hoque, M. T. and S. Iqbal (2017). "Genetic algorithm-based improved sampling for protein structure prediction." International Journal of Bio-Inspired Computation 9(3): 129-141.

  34. Crossover • During crossover • We first perform the translation • Next, translation is followed by the rotation of phi and psi angles • To ensure that the secondary structure (SS) before and after the translation operation is preserved • Important question to ask • How to identify SS type of the amino acid before and after the translation operation? • Can be obtained from the SS propensities Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  35. SS Propensities • We constructed SS propensities for 20 standard amino acid types from the 4,332 high-resolution experimental structures. 177 < psi ≤ 180 H T E H E T U U Based on the phi and psi angle the cell in the SS propensity table is identified. Then, the SS type which has the largest frequency count is assigned to the given amino acid. Avdesh Mishra, PhD Candidate Computer Science Department, UNO Psi Axis Divided Into Equal Bin Size of 30 -177 ≤ phi < -174 -180 ≤ phi < -177 Phi Axis Divided Into Equal Bin Size of 30 Partial view of SS propensities for amino acid “ALA”

  36. SS Propensities (Cont.) • Further, • We extract phi-psi angle pairs belonging to SS type “H” and “E”, and group them as helix and beta group. • Next, we utilize phi-psi angle pairs from helix and beta group to update the phi or psi angles that results in clashas well as while performingbeta smoothing. Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  37. Beta Smoothing • The random changes of phi or psi angles could destroy the conserved beta sheet regions. • To overcome this, we apply beta smoothing • An amino acid (AAi) is considered to satisfy beta condition if: • AAi-1 and AAi+1 both has SS type “E” • AAi-1 and AAi-2both has SS type “E” • AAi+1 and AAi+2 both has SS type “E” Avdesh Mishra, PhD Candidate Computer Science Department, UNO Algorithm followed to change the torsion angles constrained by beta condition

  38. Handling Clashes • Change in phi or psi angles could result in a clash between atoms within the structure. • To prevent clash, • Distance between all the alpha-carbon atom pairs (Cα-Cα) within the structure is validated. • If any Cα-Cα pair is at < 3.6 Å • The change is discarded and either a new angle for rotation is selected or a different residue for phi or psi angle change is selected. Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  39. 3DIGARS-PSP Test Dataset • We collected 16 proteins set consisting models with low TM-score (TM-score < 0.5) submitted by the Rosetta and I-Tasser predictors in the CASP8 challenge. • Further, we collected 3 challenging proteins from E. coli genome. • Two of the proteins (PDB id: 1k4nA and zp7vA) have residue length > 150 • The other protein (PDB id: 2z9hA) has residue length < 150. Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  40. Performance of 3DIGARS-PSP Avdesh Mishra, PhD Candidate Computer Science Department, UNO Mishra, A. and M. T. Hoque (2018). "3DIGARS-PSP: A Novel Statistical Energy Function and Effective Conformational Search Strategy based ab initio Protein Structure Prediction.". 2018, from http://cs.uno.edu/~tamjid/TechReport/AbInitio_TR2018_1.pdf.

  41. Performance of 3DIGARS-PSP Avdesh Mishra, PhD Candidate Computer Science Department, UNO

  42. disBPred: A Disulfide Bond Predictor Avdesh Mishra, PhD Candidate Computer Science Department, UNO  A framework using optimized RBF kernel SVM and Depth First Search

  43. Background • Disulfide bonds are covalent bonds formed during post-translational modification by the oxidation of a pair of cysteines. • Output from disulfide prediction can be applied in aiPSP or aiCEG • to reduce conformational search space • can be incorporated with energy function to generate a new score that favors the close orientation of cysteine residues which involve in disulfide bonding. Thiol groups Avdesh Mishra, PhD Candidate Computer Science Department, UNO Disulfide bond derived from two thiol groups through oxidation. Schematic view of protein with disulfide bonds.

  44. Disulfide Bonds Prediction • We carry out disulfide bonds prediction in three phases: • Individual cysteine bonding state prediction • Cysteine pair bonding state prediction • Cysteine connectivity pattern prediction Individual Cysteine Bonding State Prediction Cysteine Pair Bonding State Prediction Cysteine Connectivity Pattern Prediction Avdesh Mishra, PhD Candidate Computer Science Department, UNO Prediction probability as input feature Prediction probability as input

  45. Benchmark/Training and Case Study Dataset Benchmark dataset established by Niu et al. UniProt Knowledge Base SL477 dataset established by Zhang et al. https://www.uniprot.org/ Benchmark Set-A Benchmark Set-B Filter  proteins not solved by NMR Filter  Sequences < 50 amino acid long Filter  proteins not containing disulfide bonds Filter  Sequences containing keyword “alternate” DP20 (20 disordered proteins), contains disulfide bond and either partial or full structure solved by NMR Filter  Sequences not containing at least one disulfide bonds Avdesh Mishra, PhD Candidate Computer Science Department, UNO Filter  Sequences which contains “?” or “>” character Case Study Dataset https://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html Filter  Sequences having  25% similarity using Blastclust • Niu, S., Huang, T., Feng, K.Y., He, Z., Cui, W. Inter-and intra-chain disulfide bond prediction based on optimal feature selection. Protein Pept Lett. 2013; 20: 324–35 • Zhang, T., et al. (2012). "SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method." Journal of Biomolecular Structure and Dynamics 29(4): 799-813. DBD1866 (Disulfide Bond Dataset, 1866 proteins) 8056 bonding cysteine pairs 487514 non-bonding cysteine pairs (Imb_DBD) 16104 bonding cysteines 7083 non-bonding cysteines

  46. Feature Set for Individual Cys Bonding Prediction Amino acid (1) Characterizes specific amino acid type Disordered Probability (1) Disordered probabilities computed using DisPredict2.0 Position Specific Scoring Matrix (20) Evolutionary information obtained from sequence alignment computed using PSI-BLAST Physical Properties (5) Polarity, Secondary structure, Molecular volume, Codon diversity, Electrostatic charge Torsion Angle Fluctuation (2) Phi and Psi torsion angle fluctuations computed using DAVAR Position Specific Estimated Energy (1) Energy of each amino acid computed using DisPredict2.0 Monogram, Bigram (1, 20) Conserved amino acid subsequence information computed from PSSM Terminal Indicator (1) Flexible terminal region residue indicator Secondary Structure Probabilities (6) Three different secondary structure (helix, beta and coil) probabilities obtained from DisPredict2.0 and BalancedSSP Accessible Surface Area (1) Predicted real valued accessible surface area computed using REGAd3p Avdesh Mishra, PhD Candidate Computer Science Department, UNO These features are ignored for cysteine cites and only extracted for neighboring residues while windowing • Islam et al., A balanced secondary structure predictor, Journal of Theoretical Biology, 2016, 389, 60-71 • Iqbal et al., Estimation of Position Specific Energy as a Feature of Protein Residues from Sequence Alone for Structural Classification, PLOS one, 11(9) • Zhang et al, Fluctuations of backbone torsion angles obtained from NMR-determined structures and their prediction. Proteins: Structure, Function, and Bioinformatics. 2010;78(16):3353–62

  47. Feature Set for Cysteine Pair Bonding Prediction These features are ignored for cysteine cites and only extracted for neighboring residues while windowing Individual Cysteine Bonding Probability, Cysteine Distance (2) Predicted individual cysteine bonding probability and Cysteine-Cysteine sequence distance Avdesh Mishra, PhD Candidate Computer Science Department, UNO • Sharma A, Lyons J, Dehzangi A, Paliwal K. A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J Theor Biol. 2013;320:41–6. pmid:23246717 • William R. Atchley, Jieping Zhao, Andrew D. Fernandes, and Tanja Drüke, Solving the protein sequence metric problem, Proc. Natl. Acad. Sci. USA, 2005, 102(18), 6395-400

  48. Cysteine Pair Prediction – Feature Representation How do we represent features for cysteine pairs? Residue Index 42 f1, f2, f3, f4, f5, f6, f7, f8, … C • Features for individual cysteines 56 f1’, f2’, f3’, f4’, f5’, f6’, f7’, f8’, … C Avdesh Mishra, PhD Candidate Computer Science Department, UNO Cys42–Cys56 = abs(f1 – f1’), abs(f1 + f1’), abs(f2 – f2’), abs(f2 + f2’), … • Features for cysteine pairs Cysteine pair

  49. Individual Cysteine Prediction Model Sliding Window Training Sequences • Applied windowing to include neighboring residue information • Tested window size 1 to 41, and found 37 as the best performing window size M G A A A C C A G P 53 + 59 × 36 = 2177 features • SVM identifies the best decision boundary • Maximizes the separating hyperplane • To accurately classify between overlapping classes, softly penalizes misclassification (C) • To accurately classify nonlinearly separable classes, feature space is transformed into higher dimension using Radial Basis Function (RBF) • Grid search technique was used to optimize the RBF parameter and cost parameter, C. Support Vector Machine Classifier Individual Cysteine Bonding Prediction Model G A G C P C Avdesh Mishra, PhD Candidate Computer Science Department, UNO Test Sequence C C Binding/Non Binding Annotation NB B 0.34 Binding/Non Binding Probability 0.75

  50. Cysteine Pair Prediction Model Sliding Window • First, applied windowing to include neighboring residue information • Then, the absolute value of sum and difference of the features of cysteine pairs are used to create the model • Best window size was found to be = 9 Training Sequences M G A A A C C A G P 116 × 8 + 106 + 2 + 1 + 2 × 8 = 1053 features Support Vector Machine Classifier • Grid search technique was used to optimize the RBF parameter and cost parameter, C. Cysteine Pair Bonding Prediction Model G A G C P C Avdesh Mishra, PhD Candidate Computer Science Department, UNO Test Sequence Cys42-Cys56 Cys42-Cys88 Binding/Non Binding Annotation B NB Binding/Non Binding Probability 0.75 0.34

More Related