SIMILARITY & DIVERSITY SEARCHING OF CHEMICAL DATABASES

SIMILARITY & DIVERSITY SEARCHING OF CHEMICAL DATABASES Naomie Salim Universiti Teknologi Malaysia

Drug Discovery Value Chain The organisation and processing of chemical data

develop assay 10,000’s compounds lead identification lead optimisation clinical trials 1 drug to market Drug Development Process • Time from synthesis to product: • 1990s (through 1996): 14.9 years • S.R. Shulman & M. Manocchia, Pharmacoeconomics, September, 1997 • Associated cost of bringing new drug to market estimated well over $500 million • M.L. Lee & K.M. Payne, American Pharmaceutical Review, 1999, 1:55

Computer-Aided Drug Design • 3-D target structure unknown • random screening if no actives are known • similarity searching • pharmacophore mapping (LI) • pattern recognition methods (LI) • QSAR (2D & 3D) (LO) • Combinatorial library design (LI & LO) • Structure-based drug design (LI) • docking • de novo design

Cheminformatics research in UTM • Mainly concerned with development & application of computer techniques • Similarity searching • Retrieval of diverse compounds from chemical libraries

Drug Lead Optimization • When a promising drug molecule has been found in a drug discovery program, the next step is to optimize the structure and properties of the potential drug. • Search for chemical compounds with similar structure or properties to a known compound.

Rationale For Using Similarity Information • Similar property principle • structurally similar molecules are likely to have similar properties • Given an active target molecule, a similarity search can identify further molecules in the database for testing Property P2 Property P1

Representing chemicals: eg. Connection tables -2D 2 d 1 O 2 C 3 N 4 C 5 N 6 C 7 C 8 C 9 N 10 N 1 d 3 s 7 s 2 s 4 s 3 s 5 d 4 d 6 s 5 s 7 d 10 s 2 s 6 d 8 s 7 s 9 d 8 d 10 s 6 s 9 s

Bit string similarity measure • The bit string similarity measure is currently the most widely used approach for database searching [Downs and Willett, 1996]. • Sub-structural descriptors encoded in bit string representation are capable of encapsulating the activity and physical properties of the molecules they characterised [Martin et al., 1998].

The Tanimoto coefficient as the coefficient of choice • Tanimoto and Cosine coefficients performed better than distance measures [Willett and Winterman, 1986] • Tanimoto coefficient calculation faster - does not involve a square root • Tanimoto involves a normalisation factor that helps lessen molecular size effects Cosine : Euclidean Distance : Tanimoto : n = total bit positions in bit-strings, a = bits set in both, b ,c = bits set in only one

Tanimoto the best coefficient to use for molecular similarity ? • Binary Tanimoto coefficient has a significant preferences of certain values, which is around 0.3 [Godden et al., 1999] • Distribution of binary Tanimoto coefficient values tends to shift towards lower values as number of bits in the query bit-string decreases [Lajiness, 1997; Flower, 1998; Dixon and Koehler, 1999] • Rankings of coefficients have high variations between datasets [Willett and Winterman, 1986]

Among approaches taken to overcome problem • Modification of the Tanimoto coefficient [Filiminov et al., 1999; Fligner et al., 2002] • Combination of different similarity coefficients into new coefficients [Dixon and Koehler, 1999] • Data fusion [Salim et al., 2003]

Approach taken in our study • Performance comparison of several coefficients taken from the general literature of information retrieval • Fusion of the rankings obtained from those coefficients

Similarity measures studied n = total bit positions in bit-strings, a = bits set in both, b = bits set in only one of the molecule, c = bits set in only the other, d = bits not set in either

Clusters of coefficients • {Jaccard/Tanimoto,Dice, Sokal/Sneath(1), Kulczynski(1)} • {Russell/Rao} • {Simple Matching,Hamann, Sokal/Sneath(2), Rogers/Tanimoto, Sokal/Sneath(3), Mean Manhattan} • {Baroni-Urbani/Buser} • {Ochiai/Cosine} • {Kulczynski(2), McConnaughey} • {Forbes} • {Fossum} • {Simpson} • {Pearson} • {Yule} • {Stiles} • {Dennis}

Datasets used • 30 activities from MDDR • 21 activities from ID Alert Bit string used • BCI bit strings • Daylight fingerprints • UNITY 2D bit strings Performance measure • Average number of actives in top 400 structures

What coefficient is the best for similarity searching ?

Rankings of coefficients(MDDR)

Rankings of coefficients (ID Alert)

Summarising ranks from both databases

Any coefficient consistently better than Tanimoto ? How do coefficients performed when compared to Tanimoto ?

Overall comparison with Tanimoto (1372 cases, across 51 activities, 2 databases, 3 fingerprints)

Average Improvement over Tanimoto (number of actives among top 400 in 51 activities, 2 databases, 3 fingerprints

Is there any relationship between performance of coefficients and number of bits set in active compounds ?

Considering activities with the lowest z-score in term of number of bits set Best three coefficients for each activity are highlighted Forbes appears in this best three list 11 out of 17 cases

Considering activities with the highest z-score in term of number of bits set Best three coefficients for each activity are highlighted Russell/Rao appears in this best three list 10 out of 17 cases

Considering activities with the medium z-score in term of number of bits set Best three coefficients for each activity are highlighted Cos appears in this best three list 8 out of 17 cases, Tan and Fos 7/11

Distribution of number of bits set in top 5% structures obtained through similarity searching with 21 5HT4 Agonist targets

Sample 5HT4 Agonists with very different ranks using the Russell/Rao and the Forbes coefficients.

Average number of bits set in different similarity percentiles

Can combination of coefficients give better performance?

Combination of coefficients has shown improvement over use of single coefficients

What coefficients to include in combinations?

Three main scenarios …. Case 3: Rus the best, SM, For the worst Case 2: Tan,Fos,Cos,Bar the best Case 1: For, SM the best, Rus the worst Eg : Enkephalinase inhibitor, 2-fusion Eg : Leukotriene D4 antagonist, 2-fusion Eg : Endothelin ETA antagonists, 2-fusion

Corresponding to three different situations ….

Typical Scenario ….

The best coefficient over all fusions (51 activities, 2 databases, 3 bit strings)? 3-coefficient fusion 2-coefficient fusion

What combinations of coefficients are the best (based on ordinal values) ? Overall 10 best 2-combinations Overall 10 best 3-combinations Overall 10 best 4-combinations

How do combinations compare with single coefficients ?

Average G-H Scores Overall 10 best singles Overall 10 best 2-fusions Overall 10 best 3-fusions Overall 10 best 4-fusions

How do combinations compare with Tanimoto ?

Which have overall best improvement over Tanimoto:2-fusions, 3-fusions, 4-fusions or certain single coefficients ?

Improvement Over Tanimoto (Best 10 of single and fusions) Overall 10 best singles Overall 10 best 2-fusions Overall 10 best 3-fusions Overall 10 best 4-fusions

What combinations is best ?

Probability-based similarity searching • Vector space similarity models (VSM) most widely used currently do not incorporate the importance of a particular fragment based on information gathered from previously known active and inactive compounds • In the probability-based models (PM), formal probability theory and statistics are used to estimate the probability that a structure is active (relevant) and non-active (non-relevant) to the query. • In PM, structures with relevance probability exceeding its non-relevance probability are ranked in decreasing order of their relevance.

Probability-based models • The Binary Independence Retrieval (BIR) model • based on the presence or absence of independently distributed bits in active and inactive structures. • probability of any given bit occurring in a structure is independent of the probability of occurrence of any other bits, whether in active structures or inactive structures. • The Binary Dependence (BD) model • assumes the probability of any given bit occurring in an active structure is dependent of the probability of any other bit occurring in an active structure and similarly for inactive structures.

Query fusion in probability models • Results and information obtained from previous queries used in subsequent queries • Based on these compounds, the probability that bit bi appearing in an active structure and the probability that bit bi appearing in an inactive structurefor each bit i is computed. • Information is used to obtained the ranking score function (RSV) for the second set. • The same procedure is repeated for subsequent search on other datasets

Results for probability-based searching (Aids dataset)

Results for probability-based searching (cont.)

Compound selection • More compounds are available than can be screened cost-effectively • Compound selection techniques can be used to • select compounds for screening • choose compounds to purchase from external suppliers • design combinatorial libraries

SIMILARITY & DIVERSITY SEARCHING OF CHEMICAL DATABASES