410 likes | 751 Vues
SMILES 2. C371 Lecture Based on Dr. David Wild’s C571 Presentations Fall 2004. Linear Notations. Represent the atoms, bonds, and connectivity as a linear text string SMILES Concise Orignally designed for manual command line entry into text-only systems Now widely used
E N D
SMILES 2 C371 Lecture Based on Dr. David Wild’s C571 Presentations Fall 2004
Linear Notations • Represent the atoms, bonds, and connectivity as a linear text string • SMILES • Concise • Orignally designed for manual command line entry into text-only systems • Now widely used • Can be input to a spreadsheet cell, on one line of a text file, or in an Oracle database text field • System to generate canonical form of SMILES
Review of SMILES • Atoms represented by normal chemical symbols (uppercase for aliphatics, lowercase for aromatic) • Adjacent atoms imply single bonds • Use = for double, # for triple bonds • Hydrogens usually implicit • Parentheses imply branching • Ring closure indicated by numbers
SMILES Review (cont’d) • Can make Hydrogens explicit • Non-organic atoms are put in square brackets, e.g., [Xe] • Charged species also in square brackets with a + or -, e.g., [Na+] or [O-] • Unknown atoms indicated by a * • Stereochemistry represented by @@
SMILES for Tyrosine NC(Cc1ccc(O)cc1)C(=O)O
SMILES FOR Acetaminophen (Tylenol) O=C(O)Nc1ccc(O)cc1
SMILES for Isatin O=c2[nH]c1ccccc1c2=O
Canonicalizing SMILES – Morgan Algorithm • Each atom has a connectivity value: how many atoms it is connected to • That value is replaced by the sum of the connectivity values of the its neighbors • Continues iteratively, until number of different values is maximized • Atoms are numbered in decreasing order of connectivity value • In case of a tie, other properties are used (e.g. atomic number, bond order, etc).
Canonicalizing SMILES – CANGEN • Two-stage procedure used by Daylight • First stage CANON, generates a canonical connection table using a modified version of the Morgan Algorithm that produces a tree structure • Second stage GENES creates a unique SMILES using a depth-first search of a the molecular graph tree output by CANON • More information – JCICS 29,1989,97-101
Representing reactions CH4 + 2O2 CO2 + 2H2O • Need to identify the 2D arrangement of products and reagents and distinguish them) • Possibly map which starting material atoms map to which product atoms. • Other information (e.g., yield, equilibrium constants, conditions generally stored separately • Not all reactions specified stoichiometrically
Simple Reaction SMILES • Each reagent and product represented as SMILES • Reagents on the left of a “>>”; products on the right • Individual reagents and products are separated by a “.” CH4 + 2O2 CO2 + 2H2O Reaction SMILES: C.OO>>C(O)O.O
Reaction SMILES example • Agents specified between the two “>>” Reaction SMILES: C.O=O>O=[O+]-[O-]>O=C=O.O
Reaction SMILES example • Note implicit hydrogens Reaction SMILES: C(=O)Cl.NC>>C(=O)NC.Cl
Atom-mapping SMIRKS representation • Each reactant atom gets a tag (e.g “C” becomes “[C:1]”) which maps to the same product tag. • Hydrogens are explicit SMIRKS: [C:1](=[O:2])[Cl:3].[H:99][N:4]([H:100])[C:0]>>[C:1](=[O:2])[N:4]([H:100])[C:0].[Cl:3][H:99]
Daylight RS/SMIRKS Sites • Basic reaction representation (Reaction SMILES) • http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html • SMIRKS introduction • http://www.daylight.com/dayhtml_tutorials/languages/smirks/index.html • SMIRKS theory • http://www.daylight.com/dayhtml/doc/theory/theory.rxn.html • SMIRKS depicter • http://www.daylight.com/daycgi_tutorials/react.cgi
Representing generic structures • A generic structure is one which, by ambiguity, represents a (possibly infinite) set of possible structures • Ambiguity usually takes the form of “R” groups • Originally used for representing patents • Now used for representing combinatorial libraries too • Also known as Markush Structures
Specifying a substructure query with SMARTS • SMARTS: a superset of SMILES extended to allow partial structures (substructures) and optional parts of molecules to be represented • Simple example *C(=O)O where the * represents an attachment point (i.e. any number of any atoms) • More information: • http://www.daylight.com/meetings/summerschool01/course/basics/smarts.html • http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
Try out a SMARTS search • DepictMatch: • http://www.daylight.com/cgi-bin/contrib/depictmatch.cgi • Enter a set of SMILES and a SMARTS, and any part of the SMILES that is found in the SMARTS is highlighted • As an example, we’ll use the sample dataset described on the following two slides, and use *C(=O)O (carboxyl group) as our SMARTS and RC(=O)O (carboxyl attached to a ring)
Sample dataset Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate
Sample Dataset SMILES file • CC(=O)Nc1ccc(O)cc1 Acetaminophen • CC(C)NCC(O)COc1ccccc1CC=C Alprenolol • CC(N)Cc1ccccc1 Amphetamine • CC(CS)C(=O)N1CCCC1C(=O)O Captopril • CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine • OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac • NCC1(CC(=O)O)CCCCC1 Gabapentin • COC(=O)c1ccccc1O Salicylate
Web / Oracle Systems • Advantages • Single database for structures and data • No software to install on client machines (except maybe plug-ins like Chime) • Not dependent on (expensive) contract with MDL • Highly customizable • Disadvantages • Requires extensive web-based interface software to be written, for registration, searching, etc • Company will have to maintain system internally • Requires current ISIS system to be abandoned
Chemistry Cartridges • Daylight DayCart • http://www.daylight.com/products/daycart.html • Tripos Auspyx • http://www.tripos.com/sciTech/inSilicoDisc/chemInfo/auspyx.html • Accelrys Accord for Oracle • http://www.accelrys.com/accord/oracle.html • MDL Direct • http://www.mdl.com/products/framework/rel_chemistry_server/index.jsp • IDBS ActivityBase • http://www.id-bs.com/products/abase/ • JChem Cartridge • http://www.jchem.com
Example - DayCart • Store SMILES as string (VARCHAR2) in Oracle database • Cartridge provides extra functions and extensions to functions for searching based on chemical structures • Structure search implemented by EXACT function • Substructure search implemented by MATCHES function • Similarity search implemented by TANIMOTO and EUCLID functions
Measuring similarity between molecules • Similar Property Principle: “Molecules with similar structure are likely to have similar biological activity” • Generally the Tanimoto Coefficient or Euclidean Distance between fingerprints is used
c Tanimoto Similarity = #a + #b - c Fingerprint Similarity – Tanimoto • Also known as Jaccard Coefficient • ‘1s’ in common / ‘1s’ not in common • 0’s are treated as not significant • Similarity is between 0 (dissimilar) and 1 (same) • Good cutoff for likely biologically similar molecules is 0.7 or 0.8 c = ‘1’s in common #a = ‘1’s in fingerprint A #b = ‘1’s in fingerprint B A 101101011 B 011101101 c = 4 #a = 6 #b = 6 • Example: Tanimoto Similarity =4 / ( 6 + 6 – 4 ) = 0.5
Fingerprint similarity – Euclidean • Pythagorean distance • For binary dimensions, equivalent to the square root of the Hamming distance (i.e. square root of the number of bits that are different) • 0’s are treated as significant • Smaller values mean more similar • Example: 101101011 011101101 Different?xx xx Euclidean distance = sqrt(4) = 2.0
Sample dataset Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate
Sample Dataset SMILES file • CC(=O)Nc1ccc(O)cc1 Acetaminophen • CC(C)NCC(O)COc1ccccc1CC=C Alprenolol • CC(N)Cc1ccccc1 Amphetamine • CC(CS)C(=O)N1CCCC1C(=O)O Captopril • CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine • OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac • NCC1(CC(=O)O)CCCCC1 Gabapentin • COC(=O)c1ccccc1O Salicylate
Oracle table Test for sample dataset Smiles Name LogP ------ ---- ---- CC(=O)Nc1ccc(O)cc1 Acetaminophen 0.27 CC(C)NCC(O)COc1ccccc1CC=C Alprenolol 2.81 CC(N)Cc1ccccc1 Amphetamine 1.76 CC(CS)C(=O)N1CCCC1C(=O)O Captopril 0.84 CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine 5.20 OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac 4.02 NCC1(CC(=O)O)CCCCC1 Gabapentin -1.37 COC(=O)c1ccccc1O Salicylate 2.60
DayCart structure search using SQL select * from Test where exact(Smiles, “CC(N)Cc1ccccc1”) = 1; Smiles Name LogP ------ ---- ---- CC(N)Cc1ccccc1 Amphetamine 1.76
DayCart substructure search select * from Test where matches(Smiles, “*C(=O)O”) = 1; Smiles Name LogP ------ ---- ---- CC(CS)C(=O)N1CCCC1C(=O)O Captopril 0.84 OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac 4.02 NCC1(CC(=O)O)CCCCC1 Gabapentin -1.37 COC(=O)c1ccccc1O Salicylate 2.60
Substructure search for carboxylic acid Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate
DayCart substructure / value search select * from Test where (matches(Smiles, “*C(=O)O”) = 1) AND (LogP > 1.0)); Smiles Name LogP ------ ---- ---- OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac 4.02 COC(=O)c1ccccc1O Salicylate 2.60
DayCart similarity search Aspirin select * from TEST where tanimoto(SMILES, “CC(=O)Oc1ccccc1C(=O)O”) > 0.6; SMILES NAME LOGP ------ ---- ---- COC(=O)c1ccccc1O Salicylate 2.60 CC(=O)Nc1ccc(O)cc1 Acetaminophen 0.27 CC(N)Cc1ccccc1 Amphetamine 1.76
Similarity search for carboxylic acid Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate
More examples of DayCart http://www.daylight.com/meetings/summerschool02/course/admin/daycart_hints.html