M.Pavan, P.Gramatica, F.Consolaro, V.Consonni, R.Todeschini

QSAR MODELLING OF THE AROMATIC AMINES MUTAGENICITY BY GENETIC ALGORITHM - VARIABLE SUBSET SELECTION TRAINING SET SELECTION In order to have knowledge of the predictive capability of the models both internal and external validations were performed. An experimental design based on the Todeschini-Marengo algorithm was used to select the most representative training set of amines: models were developed on the selected training set and predictions were made for the molecules excluded from the model generation step (test set). Molecular descriptors Experimental responses DATASET: 146 amines 657 molecular descriptors 2 responses - TA98 - TA100 Training set Test set Training set Testset Internal validation External validation Variable subset selection CART K-NN RDA CP-ANN Q2LOO Q2LMO Zr ERcv Models Q2ext ERext Regression models Classification models OLS Predictions Training set = 55 compounds Test set = 60 comp. NOMER% ER%ERext% 14.5 9.1 6.7 Training set = 46 compounds Test set = 30 comp. STRAIN TA98 frameshift mutation aromatic amines intercalary agents BEPp1 < 3.77 2 1 M.Pavan, P.Gramatica, F.Consolaro, V.Consonni, R.Todeschini QSAR Research Unit, Dept. of Structural and Functional Biology, University of Insubria, Varese, Italy e-mail: manuela.pavan@libero.it Web: http://fisio.dipbsf.uninsubria.it/dbsf/qsar/QSAR.html INTRODUCTION Aromatic and heteroaromatic amines are widespread chemicals of considerable industrial and environmental relevance as they are carcinogenic for human beings. QSAR studies have been used to develop models to estimate and to predict mutagenicity by relating it to chemical structure. In mutagenicity QSAR applications, the investigators focus on either the molecular determinants that discriminate between active and inactive chemicals, or the modulators of the relative potency of the active chemicals. The development of a model to predict mutagenicity necessitates a test system capable of providing reproducible and quantitative estimates of toxic activity; the most widely used is a bacterial test, based on the Salmonella typhimurium strains (TA98  frameshift mutation; TA100  base-substitution mutation), introduced by Ames. The data set is constituted by 146 aromatic and heteroaromatic amines collected by Debnath1; mutagenicity data are expressed as the mutation rate in log (revertants/nmol). [1]A.K. Debnath et all. A QSAR investigation of the Role of Hydrophobicity in Regulating Mutagenicity in the Ames Test: 1. Mutagenicity of Aromatic and Heteroaromatic Amines in Salmonella typhimurium TA98 and TA100. Environmental and Molecular Mutagenesis 19, 37-52 (1992). MOLECULAR DESCRIPTORS The molecular structure has been represented by a wide set of 657 molecular descriptors calculated by the software DRAGON2: constitutional descriptors (56) BCUT descriptors (7) walk counts (20) 2D autocorrelation descriptors (96) Galvez index (21) aromaticity descriptors (4) charge descriptors (7) geometrical descriptors (18) molecular profiles (40) WHIM descriptors (99) 3 3D-MoRSE descriptors (160) empirical descriptors (3) topological descriptors (69) [2]R.Todeschini and V.Consonni - DRAGON- Software for the calculation of molecular descriptors, version 1.0 for Windows,(2000), Milano Chemometric and QSAR Research Group.Free download available at: http://www.disat.unimib.it/chm [3]R.Todeschini and P.Gramatica, 3D-modelling and prediction by WHIM descriptors. Part 5. Theory development and chemical meaning of the WHIM descriptors, Quant.Struct.-Act.Relat., 16 (1997) 113-119. REGRESSION MODELS The mutagenicity potencyhas been modelledby Ordinary Least Squares (OLS) method using a selected subset starting from 657 different molecular descriptors; the selection of the best subset of variables has been realised by Genetic Algorithm (GA-VSS). The obtained models have been validated by leave-one-out (Q2LOO), leave-more-out (Q2LMO), y-scrambling (Zr) and an external test set (Q2ext) and show satisfactory predictive performances, considering the uncertainty of the biological end-points. • TA98 • TA100 LogTA98=-3.98+2.40MWC07+0.56MATS7m+2.44Mor27u+1.12Mor15m LogTA100=14.86-0.36nN-16.34ATS2e+12.43ATS4p-1.66GATS4p Training set = 60 compounds Test set = 39 comp. n.variables Q2LOOQ2LMOQ2ext R2 4 76.6 75.9 69.0 80.3 n.variables Q2LOOQ2LMOQ2ext R2 4 80.9 80.7 66.7 83.9 nN = number of Nitrogen atoms ATS2e = Broto-Moreau autocorrelation of a topological structure - lag 2 / weighted by atomic Sandreson electronegativities ATS4p = Broto-Moreau autocorrelation of a topological structure - lag 4 / weighted by atomic polarizabilities GATS4p = Geary autocorrelation - lag 4 / weighted by atomic polarizabilities MWC07 = number walk count of order 07 MATS7m = Moran autocorrelation - lag 7 / weighted by atomic masses Mor27u = 3D-MoRSE - signal 27 / unweighted Mor15m = 3D-MoRSE - signal 15 / weighted by atomic masses CLASSIFICATION MODELS Some classification methods (CART, K-NN, RDA and CP-ANN) have been applied to this data set in order to distinguish between activity classes. The selection of the best subset of variables has been realised by the experimental design based on the Todeschini - Marengo algorithm. The models have been validated internally (ER) and externally (ERext). The classification models for TA100 have showed a predictive power worse than the TA98 ones and thus they are not presented here. CONCLUSIONS molecular dimension molecular branching • TA98 • CART (classification and regression tree) STRAIN TA100 base-substitution mutation molecular dimension electronic properties aromatic amines complex base-pair substitution mutation 1= mutagenic compounds 2 = non mutagenic compounds BEPp1 = positive eigenvalue n.1 / weighted by atomic polarizabilities

M.Pavan, P.Gramatica, F.Consolaro, V.Consonni, R.Todeschini

M.Pavan, P.Gramatica, F.Consolaro, V.Consonni, R.Todeschini

Presentation Transcript

ILO-OSH 2001 and National OSH-MS profiles

Search Engines for Semantic Web Knowledge

Wireless LAN Simulation - IEEE 802.11 MAC Protocol

Comparison of the AEOLUS3 Atmospheric Dispersion Computer Code with NRC Codes PAVAN and XOQDOQ

Pattern Recognition

Ryan E. Grant 1 , Pavan Balaji 2 , Ahmad Afsahi 1

PUNTO DE GRAMÁTICA

Towards Asynchronous and MPI-Interoperable Active Messages

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification

CONNECTIVITY BROKER Towards Absolutely Reliable Wireless Systems (First Step)

HOT Inside The Technical Architecture

Roberto Todeschini Viviana Consonni

Sixth Indo-US Workshop on Mathematical Chemistry Kolkata, 8-10 January 2010

Range-Efficient Computation of F 0 over Massive Data Streams

Seasonal forecast and their application to agriculture in Italy

ELECTRONIC TONGUE