Download
improving protein ligand binding affinity prediction using random forest n.
Skip this Video
Loading SlideShow in 5 Seconds..
Improving Protein-Ligand Binding Affinity Prediction using Random Forest PowerPoint Presentation
Download Presentation
Improving Protein-Ligand Binding Affinity Prediction using Random Forest

Improving Protein-Ligand Binding Affinity Prediction using Random Forest

240 Vues Download Presentation
Télécharger la présentation

Improving Protein-Ligand Binding Affinity Prediction using Random Forest

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Improving Protein-Ligand Binding Affinity Prediction using Random Forest Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong 1 April 2014http://istar.cse.cuhk.edu.hk JackyLeeHongJian@Gmail.com

  2. Protein-Ligand Docking • Protein: large pharmaceutical target • Ligand: small drug candidate Hongjian Li, Kwong-Sak Leung, TakanoriNakane and Man-Hon Wong. iview: an interactive WebGL visualizer for protein-ligand complex. BMC Bioinformatics, 15(1):56, 2014.

  3. Docking Problem Definition • Given a protein and a ligand, predict • binding conformation (pose generation) • binding affinity (scoring) Pose 1 Pose 2 8.274 pKd 8.011 pKd

  4. Redocking Success Rates • idock and Vina predicted up to 9 conformations • RMSDi : ith highest binding affinity, i= 1,2,…, 9 • RMSDmin : closest conformation, min(RMSDi) Hongjian Li, Kwong-Sak Leung, Pedro J. Ballester and Man-Hon Wong. istar: A Web Platform for Large-Scale Protein-Ligand Docking. PLoS ONE, 9(1):e85678, 2014.

  5. Redocking Success Rates • PDBbind v2013 refined set (N = 2959) • N = 2959 – 66 (size_xyz < 3) – 2 (Sr,Cs) = 2893

  6. Accuracy of Vina Scoring Function • SD=2.85 kcal/mol on training set (N = 1300) • SD=2.75 kcal/mol on test set (N=116) • 104x in disassociation constant Kd Oleg Trott and Arthur J. Olson. AutoDockVina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of Computational Chemistry, 31 (2):455–461, 2010.

  7. Scoring Functions • Force field based • AMBER • CHARMM • MedusaScore • VDWScore • Knowledge based • ITScore/SE • DrugScore • SMoG • STScore • Empirically based • GoldScore • ChemScore • ASP • X-Score • AutoDockVina • ID-Score • Cyscore Martiniano Bello, MarletMartínez-Archundia, and Jos Correa-Basurto. Automated docking for novel drug discovery. Expert Opinion on Drug Discovery, pages 1–14, 2013.

  8. Assumed Functional Form • Vina

  9. Assumed Functional Form • Predetermined, theory-inspired, additive

  10. Machine-Learning Scoring Functions • No modelling assumptions, non-parametric • Implicitly capture binding effects that are hard to model explicitly • Random forest • RF-Score • Support vector regression • SVR-Score, MD-SVR Pedro J. Ballester and John B. O. Mitchell. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics, 26(9):1169–1175, 2010. Liwei Li, Bo Wang, and Samy O. Meroueh. Support Vector Regression Scoring of Receptor–Ligand Complexes for Rank-Ordering and Virtual Screening of Chemical Libraries. Journal of Chemical Information and Modeling, 51(9):2132-2138, 2011. Wenhu Zhan, Daqiang Li, JinxinChe, Liangren Zhang, Bo Yang, Yongzhou Hu, Tao Liu, Xiaowu Dong. Integrating docking scores, interaction profiles and molecular descriptors to improve the accuracy of molecular docking: Toward the discovery of novel Akt1 inhibitors. European Journal of Medicinal Chemistry, 75:11-20, 2014.

  11. The Question • Q1: Does circumventing an assumed functional form lead to better predictions?

  12. Four Models • AutoDockVina • Undisclosed nonlinear optimization • MLR::Vina • Multiple linear regression with Vina features • RF::Vina • Random forest with Vina features • RF::VinaElem • Random forest with Vina and RF-Score features

  13. Model 1: AutoDockVina • Vina’s score for the kthconformer: • k=1 for crystal pose only

  14. Model 2: MLR::Vina • Quasi linear • Grid search on w6 • 101 values samples from 0 to 1 with a step size of 0.01 • 16 values sampled from 0.005 to 0.020 with a step size of 0.001

  15. Model 3: RF::Vina • 6 Vina features • P=500 trees • mtry = 1 to 6 • Best mtry on Out-of-Bag (OOB) data

  16. RF Construction & Prediction • RF trains binary trees using the CART algorithm • RF grows tree without pruning from a bootstrap sample of the training data • RF selects the best split at each node from a small number (mtry) of randomly chosen features. mtryϵ {1,…,36} • RF stops splitting a node with <=5 samples • Prediction from an individual tree is the arithmetic mean of its samples in a leaf node • P = 500 trees, RF prediction is arithmetic mean

  17. RF Construction & Prediction • Out-of-bag (OOB) data as internal validation • Possible mtryvalues cover all the feature subset sizes, i.e. {1,…,36}

  18. Model 4: RF::VinaElem • 6 Vina features + 36 RF-Score features • 4 and 9 atom types for protein and ligand, resp. • Occurrences for a particular j–iatom type pair • dcutoff=12Å,Z(C)=6,Z(N)=7,Z(O)=8,Z(P)=15,Z(S)=16

  19. Visualization of Feature Vector • Protein-ligand atomic contacts as blue lines Pedro J. Ballester. Machine Learning Scoring Functions Based on Random Forest and Support Vector Regression. Pattern Recognition in Bioinformatics, Lecture Notes in Computer Science, 7632:14–25, 2012.

  20. PDBbind v2012 • A: experimentally determined structures as of 2012 • B: protein-ligand, DNA/RNA-ligand, protein-DNA/RNA, prot-prot complexes • C: Complexes with Kd, Ki, or IC50 • 7,121 protein-ligand complexes • D: Prot-lig complexes with Kd or Ki • E: 67 non-redundant clusters • BLAST cutoff of 90% sequence identity • With highest, lowest, medium affinity Renxiao Wang, Xueliang Fang, Yipin Lu, and ShaomengWang. The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures. Journal of Medicinal Chemistry, 47(12):2977–2980, 2004. Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-YieYang, and Shaomeng Wang. The PDBbind Database: Methodologies and Updates. Journal of Medicinal Chemistry, 48(12):4111–4119, 2005.

  21. PDBbind v2007 Benchmark • Refined set (N = 1300) • Training set: 1105 complexes • Test set: 195 complexes • Measured binding affinities span >12 orders of magnitude • 65 non-redundant clusters

  22. PDBbind v2013 Blind Benchmark • Test set: refined13\refined12 (N = 382) • 4 training sets • refined02 (N = 792) • refined07 (N = 1300) • refined10 (N = 2059) • refined12 (N = 2897) • | Train ∩ Test | = 0

  23. Performance Measures • Root Mean Square Error RMSE • Standard deviation in linear correlation SD • Pearson correlation Rp • Spearman rank-correlation Rs

  24. Results and Discussion • Vina vs. MLR::Vina vs RF::Vina vs RF::VinaElem • Overfitting • MLSFs are remarkably more accurate than Empirical SFs • The impact of data representation on the applicability domain of SFs • Machine-learning SFs assimilate data better than Empirical SFs • Achieved improvement over AutoDockVina using Random Forest • MLSFs are a largely unexplored research area

  25. MLR::Vina Better than Vina • w6=0.010 for MLR::Vina

  26. RF::Vina Better than MLR::Vina • w6=0.010 for MLR::Vina • seed=99789 for RF::Vina

  27. RF::VinaElem Better than RF::Vina • seed=99789 for RF::Vina • seed=72551 for RF::VinaElem

  28. Ligand- and Protein-Only Features • With and without Nrot • QSAR models

  29. Overfitting • Machine-learning SFs tend to overfit training data • Challenged by Cao and Li • MLR::Vina • SD=1.71 on training set (N = 1105) • SD=1.91 on test set (N = 195) • RF::Vina • SD=0.60 on training set (N = 1105) • SD=1.62 on test set (N = 195) Yang Cao and Lei Li. Improved protein–ligand binding affinity prediction by using a curvature-dependent surface-area model. Bioinformatics, 2014.

  30. MLSFs More Accurate • AutoDockVina with and without coupling a linear model

  31. MLSFs More Accurate • RF::VinaElem with and without coupling a linear model

  32. MLSFs More Accurate • 22 SFs on the PDBbindv2007 benchmark

  33. Applicability Domain • Same training set • Same test set • Same features • … implies • Same applicability domain • For predictive purposes

  34. MLSFs Assimilate Data Better

  35. Rescoring Vina • Available as rf-score-3.tgz for Linux & Windows • N = 1300 vs N = 2897

  36. Largely Unexplored Research Area • Protein family-specific SFs • Pose generation error • Data selection criteria • Waters and ions • Fingerprint descriptors

  37. Conclusion • Q1: Does circumventing an assumed functional form lead to better predictions? • A1: Yes.

  38. Ongoing Directions • Q2: Under what conditions? • Q3: How to improve predictions of docked poses?