250 likes | 259 Vues
Validating New Techniques for HTS data analysis. Alain Calvet, Kjell Johnson and George S. Cowan Pfizer Global Research and Development Ann Arbor Laboratories. alain.calvet@pfizer.com. Outline. 1. Problems in High Throughput Screening results analysis
E N D
Validating New Techniques for HTS data analysis Alain Calvet, Kjell Johnson and George S. Cowan Pfizer Global Research and Development Ann Arbor Laboratories alain.calvet@pfizer.com
Outline • 1. Problems in High Throughput Screening results analysis • Laschiate ogni speranza voi ch’entrate • (Dante Alighieri ca 1306) • 2. Solutions to these problems • Wer nur organische Chemie versteht, versteht auch die nicht recht • G.C. Lichtenberg ca 1790) • 3. Results, Validation, Conclusions • Fiat Lux • Genesis, I. 3
Outline • Give up every hope, All ye who enter • High Throughput Screening results analysis • He, who only understands organic chemistry, does not understand it well • G.C. Lichtenberg ca 1790) • Information analysis vs Chemist approach • Let there be light • Results, Validation, Conclusions
% Inhibition IC50 m 10.2 93.6 -5.3 . . . 140.4 76.8 0.4 32.1 2.9 . . . >100 Screening Process First Pass Second Pass Chemical Library
Laschiate Ogni Speranza ... • Volume of data • Noise in Y column (% Inhibition) • High proportion of false positives • Some false negatives • Nonsense inhibition percentage • Systematic errors in the measurements • Noise in X matrix (Descriptors) • Bad compounds (e.g. promiscuous compounds) • Old library (unstability of compound) • Combinatorial libraries (are often impure)
Wer nur Organische Chemie ... • Can we improve the quality of information obtained from screening? • e.g. by • Looking for consistency • Filtering out what is not consistent • … • Then build an activity model in “chemistry/biology space”
Recursive Partitioning Kohonen Maps, Linear Vector Quantization Neural Nets Support Vector Machine Version Space SAR PLS And others ... Next_Firm, CART www.salford-systems.com www.cis.hut.fi/research www.partek.com, www.neurocolt.org svm.first.gmd.de, next slide In house developed software www.sas.com Partek, MOE Classification Methodsfor Biochemical and ADME HTS
Here I wish to spare you 15 slides of mathematics about SVM • Based on Statistical Learning Theory of Vapnik and Chervonenkis (late 1960’s)) • First described as SVM in 1993 • www.kernel-machines.org • ais.gmd.de/~thorsten/svm_light (SVMLight) • www.support-vector.net • svm.first.gmd.de • An Introduction to Support Vector Machines. Nello Cristianini and John Shawe-Taylor. Cambridge University Press, 2000
(x) x Descriptor space Feature space Support Vector Machine • Defines a hyperplane/ • hypersurface in Rpfor • purposes of discrimination • Incorporates kernel functions to define the best possible planar separation between actives and inactives in a “feature” space • Designed to minimize the total error of the classification
Support Vector MachineIdeal case: separable instances • A boundary between actives and inactives is defined along with a unit error margin • A classifier , the distance to the boundary, is computed for each compound Active Not Active “Margin” Feature space
Support Vector MachineReal life: not separable instances • In practice a perfect separation is rarely obtained: • Instead one obtains compounds inside the margin, as well as possible mislabeled on both sides of the boundary Active Not Active Feature space
Not Separable Separable Distribution of , two real examples
Data setKinase Inhibition • Training set: • 26751 compounds with % inhibition • 804 were labeled as “active” • Validation set, “Ground truth”: • 1456 compounds with IC50’s : • 878 active and 578 not active • 606 had been tested in first pass • 850 had not been tested in first pass
Descriptors, Chemical Space • BCI fragments • Augmented Atoms • Atom sequences • Atom pairs • Ring fragments • In this study we use 6395 BCI fragments, this is equivalent to a 6395 dimensional space Barnard Chemical Information LTD. Sheffield (UK)
Looking for measurement errors • Different analyses were run using different parameters in the SVM algorithm. • Compounds were identified as mislabeled, false positive or false negative, based on a vote between results from the different runs and were removed • Analyses were rerun using an “improved” data set to build an activity model • An average prediction was then computed
Looking for measurement errors Activity Label Total Act Not_Act Class’ Act=0 Not_Act = 9 21 25867 25888 Act=1 Not_Act = 8 4 22 26 Act=2 Not_Act = 7 23 12 35 Act=3 Not_Act = 6 10 9 19 Act=4 Not_Act = 5 72 8 80 Act=5 Not_Act = 4 19 4 23 Act=6 Not_Act = 3 58 10 68 Act=7 Not_Act = 2 162 4 166 Act=8 Not_Act = 1 38 6 44 Act=9 Not_Act = 0 397 5 402 804 25947 26751
Looking for measurement errors • Different analyses were run using different parameters in the SVM algorithm. • Compounds were identified as mislabeled, false positive or false negative, based on a vote between results from the different runs and were removed • Analyses were rerun using an “improved” data set to build an activity model • An average prediction was then computed
Fiat Lux!! • Prediction of second pass activity (IC50’s) based on first pass screening information • Data mining in screening library (330678 compounds)
Predicted Class Count Column % Row % Kappa Std Err Act Not_Act 0.0200 0.592 617 261 Act 878 93.63 32.75 70.27 29.73 Accuracy = 0.792 Activity Label 42 536 Precision = 0.936 Not_Act 578 6.37 67.25 Recall = 0.703 7.27 92.73 Sensitivity = 0.703 Specificity = 0.927 659 797 1456 Prediction of second pass: Statistics Obs - Exp Kappa = 1 - Exp Redman, C. E. Screening Compounds for Clinically Active Drugs in Statistics for the Pharmaceutical Industry, 36, 19-42, 1981
1 0.9 0.8 0.7 0.6 True Positive 0.5 0.4 0.3 0.2 0.1 0 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 False Positive Prediction of second pass screening, R.O.C. curve 1456 Compounds 878 Active 578 Not Active AUC = 0.874
800 700 600 500 Number acives retrieved 400 300 200 100 0 1 10 100 1000 10000 100000 Number Tested, (Logarithmic scale) Upper_Reference Number_Act Random Virtual screening: 330678 compounds = - 0.17 ROC curve: AUC = 0.978 = - 3.79 = 2.35 827 Known active in 330678 compounds
800 700 600 n 1 318 901 11897 330678 2.35 1.00 0.00 - 1.00 - 3.79 500 Number acives retrieved 400 300 200 100 0 1 10 100 1000 10000 100000 Number Tested, (Logarithmic scale) Upper_Reference Number_Act Random Virtual screening: 330678 compounds ROC curve AUC = 0.978 827 Known active in 330678 compounds
And More!! • Not presented here because of lack of time: • Pooling techniques: i.e. SVM, PLS, VSSAR • Validation of method by checking false positive and false negative for compounds present in both pass 1 and pass 2 screening • Subsetting compounds from second pass between • Present/not present in first pass • Similar/dissimilar with active in first pass (in fact learning set)
Conclusions (1) • Support Vector Machine has been applied to high throughput screening data in a high dimensional binary space • First pass results were filtered and reanalyzed to predict activity found in confirmation screening (IC50’s measurements) • This may be useful to • Prioritize compounds for retesting • Optimize the design of Combinatorial Libraries • Virtual Screening
Conclusions (2) • Results depend highly upon the quality of the information obtained from High Throughput Screening • Some screens are good • Some are disastrous • BCI descriptors are biased towards structural description of molecules • Pharmacophore descriptors • But not zillions of them