Testing the Limits of a QSAR Model:

Testing the Limits of a QSAR Model: How many cases are actually needed to develop a reliable predictive model? C. Matthew Sundling, Curt M. Breneman, Mark J. Embrechts, Changjian Huang, Xiaohua Wu, N. Sukumar April 9th, 2008

Related RECCR Presentations • Dr. Dominic Ryan - Stability of rank order predication models of top ten testing molecules QSAR model stability: How much information is in the data? (COMP, Morial Convention Center rm. 347, Wednesday, 3:05 pm) • Prof. Mark Embrechts - One-class SVM for outlier detection and applicability of a model to a specific testing set. Testing the validity range of QSAR models using one-class support vector machines. (COMP, Morial Convention Center rm. 347, Thursday, 1:30pm)

Objective • How do models function as training data is reduced? • Hypothesis: if (1) a model is "stable", and the (2) descriptors are appropriate to the effect being modeled, then a great deal of training information could be removed without the testing predictions to degrade significantly.

Datasets BP - boiling points ACE - Angiotensin-Converting Enzyme inhibitors AChE - acetylcholinesterase inhibitors Lombardo - blood-brain barrier (BBB) partitioning Artemisinin - anti-malarial compounds (298 compounds) (112 compounds) (60 compounds) (70 compounds) (179 compounds) difficulty

Descriptors MOE - “classic” 2D descriptors TAE - electron density derived surface property distributions SS - surface statistics of TAE property distributions Wavelets - alternative representation of TAE information PEST - shape-property 3D hybrid of TAE property distributions ALL - combination of all descriptors

Testing Procedure Training Set (70%) Training Set Dataset 90% Training Set Subset Testing Set (30%) Testing Set (30%) Testing Set (30%) PLS models of five components were used throughout the study.

Typical Results AChE BP Artemisinin ACE

Repeating Testing Procedure Training Set Training Set Training Set Training Set (70%) Training Set Training Set Dataset 90% Training Set Training Set Training Set Subset Training Set Testing Set (30%) Testing Set (30%) Testing Set (30%)

Multiple Training Sets

Performance Instability

Multiple Testing Sets Dataset Training Set (70%) Training Set (70%) Training Set (70%) … Testing Set (30%) Testing Set (30%) Testing Set (30%) … Repeat modeling study …

Multiple Testing Sets

Distance(trainingset,testset) Can I understand more about the relationship between the training data and the testing data?

Distance(trainingset,testset) Distance function = Sum of Euclidean distances for testing molecules to nearest neighbor training molecule

Distance(trainingset,testset)

Conclusion: Stable?

Conclusion: Data vs. Information?

Conclusion: Potential applications? Extend ensemble models to include stability analysis?

Conclusion: Your model’s performance is dependent on your training data. (Duh!) It’s hard to know when you have enough.

Thanks! Curt M. Breneman Mark J. Embrechts Changjian Huang Xiaohua Wu N. Sukumar

Testing the Limits of a QSAR Model: