Validating New Techniques for HTS data analysis

Validating New Techniques for HTS data analysis Alain Calvet, Kjell Johnson and George S. Cowan Pfizer Global Research and Development Ann Arbor Laboratories alain.calvet@pfizer.com

Outline • 1. Problems in High Throughput Screening results analysis • Laschiate ogni speranza voi ch’entrate • (Dante Alighieri ca 1306) • 2. Solutions to these problems • Wer nur organische Chemie versteht, versteht auch die nicht recht • G.C. Lichtenberg ca 1790) • 3. Results, Validation, Conclusions • Fiat Lux • Genesis, I. 3

Outline • Give up every hope, All ye who enter • High Throughput Screening results analysis • He, who only understands organic chemistry, does not understand it well • G.C. Lichtenberg ca 1790) • Information analysis vs Chemist approach • Let there be light • Results, Validation, Conclusions

% Inhibition IC50 m 10.2 93.6 -5.3 . . . 140.4 76.8 0.4 32.1 2.9 . . . >100 Screening Process First Pass Second Pass Chemical Library

Laschiate Ogni Speranza ... • Volume of data • Noise in Y column (% Inhibition) • High proportion of false positives • Some false negatives • Nonsense inhibition percentage • Systematic errors in the measurements • Noise in X matrix (Descriptors) • Bad compounds (e.g. promiscuous compounds) • Old library (unstability of compound) • Combinatorial libraries (are often impure)

Wer nur Organische Chemie ... • Can we improve the quality of information obtained from screening? • e.g. by • Looking for consistency • Filtering out what is not consistent • … • Then build an activity model in “chemistry/biology space”

Recursive Partitioning Kohonen Maps, Linear Vector Quantization Neural Nets Support Vector Machine Version Space SAR PLS And others ... Next_Firm, CART www.salford-systems.com www.cis.hut.fi/research www.partek.com, www.neurocolt.org svm.first.gmd.de, next slide In house developed software www.sas.com Partek, MOE Classification Methodsfor Biochemical and ADME HTS

Here I wish to spare you 15 slides of mathematics about SVM • Based on Statistical Learning Theory of Vapnik and Chervonenkis (late 1960’s)) • First described as SVM in 1993 • www.kernel-machines.org • ais.gmd.de/~thorsten/svm_light (SVMLight) • www.support-vector.net • svm.first.gmd.de • An Introduction to Support Vector Machines. Nello Cristianini and John Shawe-Taylor. Cambridge University Press, 2000

(x) x Descriptor space Feature space Support Vector Machine • Defines a hyperplane/ • hypersurface in Rpfor • purposes of discrimination • Incorporates kernel functions to define the best possible planar separation between actives and inactives in a “feature” space • Designed to minimize the total error of the classification

Support Vector MachineIdeal case: separable instances • A boundary between actives and inactives is defined along with a unit error margin • A classifier , the distance to the boundary, is computed for each compound Active  Not Active “Margin” Feature space

Support Vector MachineReal life: not separable instances • In practice a perfect separation is rarely obtained: • Instead one obtains compounds inside the margin, as well as possible mislabeled on both sides of the boundary Active  Not Active Feature space

Not Separable Separable Distribution of , two real examples

Data setKinase Inhibition • Training set: • 26751 compounds with % inhibition • 804 were labeled as “active” • Validation set, “Ground truth”: • 1456 compounds with IC50’s : • 878 active and 578 not active • 606 had been tested in first pass • 850 had not been tested in first pass

Descriptors, Chemical Space • BCI fragments • Augmented Atoms • Atom sequences • Atom pairs • Ring fragments • In this study we use 6395 BCI fragments, this is equivalent to a 6395 dimensional space Barnard Chemical Information LTD. Sheffield (UK)

Looking for measurement errors • Different analyses were run using different parameters in the SVM algorithm. • Compounds were identified as mislabeled, false positive or false negative, based on a vote between results from the different runs and were removed • Analyses were rerun using an “improved” data set to build an activity model • An average prediction was then computed

Looking for measurement errors Activity Label Total Act Not_Act Class’ Act=0 Not_Act = 9 21 25867 25888 Act=1 Not_Act = 8 4 22 26 Act=2 Not_Act = 7 23 12 35 Act=3 Not_Act = 6 10 9 19 Act=4 Not_Act = 5 72 8 80 Act=5 Not_Act = 4 19 4 23 Act=6 Not_Act = 3 58 10 68 Act=7 Not_Act = 2 162 4 166 Act=8 Not_Act = 1 38 6 44 Act=9 Not_Act = 0 397 5 402 804 25947 26751

Looking for measurement errors • Different analyses were run using different parameters in the SVM algorithm. • Compounds were identified as mislabeled, false positive or false negative, based on a vote between results from the different runs and were removed • Analyses were rerun using an “improved” data set to build an activity model • An average prediction was then computed

Fiat Lux!! • Prediction of second pass activity (IC50’s) based on first pass screening information • Data mining in screening library (330678 compounds)

Predicted Class Count Column % Row % Kappa Std Err Act Not_Act 0.0200 0.592 617 261 Act 878 93.63 32.75 70.27 29.73 Accuracy = 0.792 Activity Label 42 536 Precision = 0.936 Not_Act 578 6.37 67.25 Recall = 0.703 7.27 92.73 Sensitivity = 0.703 Specificity = 0.927 659 797 1456 Prediction of second pass: Statistics Obs - Exp Kappa =  1 - Exp Redman, C. E. Screening Compounds for Clinically Active Drugs in Statistics for the Pharmaceutical Industry, 36, 19-42, 1981

1 0.9 0.8 0.7 0.6 True Positive 0.5 0.4 0.3 0.2 0.1 0 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 False Positive Prediction of second pass screening, R.O.C. curve 1456 Compounds 878 Active 578 Not Active AUC = 0.874

800 700 600 500 Number acives retrieved 400 300 200 100 0 1 10 100 1000 10000 100000 Number Tested, (Logarithmic scale) Upper_Reference Number_Act Random Virtual screening: 330678 compounds  = - 0.17 ROC curve: AUC = 0.978  = - 3.79  = 2.35 827 Known active in 330678 compounds

800 700 600 n 1 318 901 11897 330678  2.35 1.00 0.00 - 1.00 - 3.79 500 Number acives retrieved 400 300 200 100 0 1 10 100 1000 10000 100000 Number Tested, (Logarithmic scale) Upper_Reference Number_Act Random Virtual screening: 330678 compounds ROC curve AUC = 0.978 827 Known active in 330678 compounds

And More!! • Not presented here because of lack of time: • Pooling techniques: i.e. SVM, PLS, VSSAR • Validation of method by checking false positive and false negative for compounds present in both pass 1 and pass 2 screening • Subsetting compounds from second pass between • Present/not present in first pass • Similar/dissimilar with active in first pass (in fact learning set)

Conclusions (1) • Support Vector Machine has been applied to high throughput screening data in a high dimensional binary space • First pass results were filtered and reanalyzed to predict activity found in confirmation screening (IC50’s measurements) • This may be useful to • Prioritize compounds for retesting • Optimize the design of Combinatorial Libraries • Virtual Screening

Conclusions (2) • Results depend highly upon the quality of the information obtained from High Throughput Screening • Some screens are good • Some are disastrous • BCI descriptors are biased towards structural description of molecules • Pharmacophore descriptors • But not zillions of them

Validating New Techniques for HTS data analysis

Validating New Techniques for HTS data analysis

Presentation Transcript

Techniques of Data Analysis

Data Analysis Techniques for Gravitational Wave Observations

MANAGING THE WEALTH OF NEW HTS DATA

New (Applications of) Compiler Techniques for Data Grids

Validating your data

Validating data Formats and conventions Testing techniques

Algorithms for Validating Transactional Data

Data Reduction and Analysis Techniques

Measurement techniques and data analysis

New data analysis for AURIGA

Data storage considerations for HTS platforms

MHS Data Sources – Techniques for Analysis

Statistical Techniques for Temporal Microarray Data Analysis

Data Reduction and Analysis Techniques

MHS Data Sources – Techniques for Analysis

Data Analysis Techniques for LIGO

HTS data file

High Throughput Sequence (HTS) data analysis

Data Reduction and Analysis Techniques