Similarity and Diversity Alexandre Varnek, University of Strasbourg, France

Similarity and DiversityAlexandre Varnek, University of Strasbourg, France

What is similar?

Colour Shape Pattern Size Different „spaces“, classified by:

16 diverse aldehydes...

...sorted by common scaffold

...sorted by functional groups

The „Similarity Principle“ : Structurally similar molecules are assumed to have similar biological properties Compounds active as opioid receptors

Structural Spectrum of Thrombin Inhibitors structural similarity “fading away” … reference compounds 0.56 0.72 0.53 0.84 0.67 0.52 0.82 0.64 0.39

Properties to describe elements (descriptors, fingerprints) • Distance measure („metrics“) Key features in similarity/diversity calculations:

molecule Mi = (descriptor1(i), descriptor2(i), …, descriptorn(i)) N-Dimensional Descriptor Space • Each chosen descriptor adds a dimension to the reference space • Calculation of n descriptor values produces an n-dimensional coordinate vector in descriptor space that determines the position of a molecule descriptorn descriptor2 descriptor1 descriptor3

descriptorn descriptor2 descriptor1 descriptor3 Chemical Reference Space • Distance in chemical space is used as a measure of molecular “similarity“ and “dissimilarity“ • “Molecular similarity“ covers only chemical similarity but also property similarity including biological activity DAB B A

Distance Metrics in n-D Space • If two molecules have comparable values in all the n descriptors in the space, they are located close to each other in the n-D space. • how to define “closeness“ in space as a measure of molecular similarity? • distance metrics

descriptorn descriptor2 descriptor1 descriptor3 Descriptor-based Similarity • When two molecules A and B are projected into an n-D space, two vectors, A and B, represent their descriptor values, respectively. • A = (a1,a2,...an) • B = (b1,b2,...bn) • The similarity between A and B, SAB, is negatively correlated with thedistance DAB • shorter distance ~ more similar molecules • in the case of normalized distance(within value range [0,1]), similarity = 1 – distance B DAB DBC C A DAB>DBCSAB<SBC

descriptorn descriptor2 descriptor1 descriptor3 Metrics Properties • The distance values dAB 0; dAA= dBB= 0 • Symmetry properties: dAB= dBA • Triangle inequality: dAB dAC+ dBC B DAB DBC C A

descriptorn DAB B A descriptor2 descriptor1 descriptor3 Euclidean Distance in n-D Space • Given two n-dimensional vectors, A and B • A = (a1,a2,...an) • B = (b1,b2,...bn) • Euclidean distance DAB is defined as: • Example: • A = (3,0,1); B = (5,2,0) • DAB= = 3

descriptorn DAB B A descriptor2 descriptor1 descriptor3 Manhattan Distance in n-D Space • Given two n-dimensional vectors, A and B • A = (a1,a2,...an) • B = (b1,b2,...bn) • Manhattan distance DAB is defined as: • Example: • A = (3,0,1); B = (5,2,0) • DAB= = 5

Distance Measures („Metrics“): Euclidian distance: [(x11 - x21) 2 + (x12 - x22)2] 1/2 = = (42 + 22)1/2 = 4.472 Manhattan (Hamming) distance: |x11 - x21| + |x12 - x22| = 4 + 2 = 6 Sup distance: Max (|x11 - x21|, |x12 - x22|) = = Max (4, 2) = 4

Binary Fingerprint

Popular Similarity/Distance Coefficients • Similarity metrics: • Tanimoto coefficient • Dice coefficient • Cosine coefficient • Distance metrics: • Euclidean distance • Hamming distance • Soergel distance

B A C Tanimoto Coefficient (Tc) • Definition: • value range: [0,1] • Tc is also known as Jaccard coefficient • Tc is the most popular similarity coefficient

binary A B a = 4, b = 4, c = 2 Example Tc Calculation

Dice Coefficient • Definition: • value range: [0,1] • monotonic with the Tanimoto coefficient

Cosine Coefficient • Definition: • Properties: • value range: [0,1] • correlated with the Tanimoto coefficient but not strictly monotonic with it

Hamming Distance • Definition: • value range: [0,N] (N, length of the fingerprint) • also called Manhattan/City Block distance

Soergel Distance • Definition: • Properties: • value range: [0,1] • equivalent to (1 – Tc) for binary fingerprints

Similarity coefficients

Properties of Similarlity and Distance Coefficients Metric Properties • The distance values dAB 0; dAA= dBB= 0 • Symmetry properties: dAB= dBA • Triangle inequality: dAB dAC+ dBC The Euclidean and Hamming distances and the Tanimoto coefficients (dichotomous variables) obey all properties. The Tanimoto, Dice and Cosine coefficients do not obey inequality (3). Coefficients are monotonic if they produce the same similarlity ranking

Similarity search Using bit strings to encode molecular size. A biphenyl query is compared to a series of analogues of increasing size. The Tanimoto coefficient, which is shown next to the corresponding structure, decreases with increasing size, until a limiting value is reached. D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

Similarity search Molecular similarity at a range of Tanimoto coefficient values D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

Similarity search The distribution of Tanimoto coefficient values found in database searches with a range of query molecules of increasing size and complexity D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

Molecular Similarity A comparison of the Soergel and Hamming distance values for two pairs of structures to illustrate the effect of molecular size A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003

Molecular Similarity The maximum common subgraph (MCS) between the two molecules is in bold Similarity = Nbonds(MCS) / Nbonds(query) A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003

Activity landscape

How important is a choice of descriptors ? Inhibitors of acyl-CoA:cholesterol acyltransferase represented with MACCS (a), TGT (b), and Molprint2D (c) fingerprints.

discontinuous SARs continuous SARs gradual changes in structure result in moderate changes in activity • “rolling hills” (G. Maggiora) small changes in structure have dramatic effects on activity • “cliffs” in activity landscapes Structure-Activity Landscape Index: SALIij = DAij / DSij DAij(DSij) is the difference between activities (similarities) of molecules iand j R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646

6 nM MACCSTc: 1.00 Analog 2390 nM discontinuous SARs VEGFR-2 tyrosine kinase inhibitors small changes in structure have dramatic effects on activity • “cliffs” in activity landscapes • lead optimization, QSAR bad news for molecular similarity analysis...

Example of a “Classical” Discontinuous SAR Any similarity method must recognize these compounds as being “similar“ ... (MACCS Tanimoto similarity) Adenosine deaminase inhibitors

Libraries design Goal: to select a representative subset from a large database

Chemical Space Overlapping similarity radii  Redundancy „Void“ regions  Lack of information

Chemical Space „Void“ regions  Lack of information

Chemical Space No redundancy, no „voids“  Optimally diverse compound library

Subset selection from the libraries • Clustering • Dissimilarity-based methods • Cell-based methods • Optimisation techniques

Clustering in chemistry

What is clustering? • Clustering is the separation of a set of objects intogroups such that items in one group are more likeeach other than items in a different group • A technique to understand, simplify and interpretlarge amounts of multidimensional data • Classification without labels (“unsupervisedlearning”)

Where clustering is used? General: data mining, statistical data analysis, datacompression, image segmentation, document classification(information retrieval) Chemical: • representative sample, • subsets selection, • classification of new compounds

Overall strategy • Select descriptors • Generate descriptors for all items • Scale descriptors • Define similarity measure (« metrics ») • Apply appropriate clustering method to group the items on basis of chosen descriptors and similarity measure • Analyse results

Data Presentation descriptors molecules molecules molecules Pattern matrix Proximity matrix Library contains nmolecules, each molecule is described by pdescriptors dii = 0; dij = dji

Agglomerative Divisive Clustering methods Single Link Complete Link Group Average Hierarchical Weighted Gr Av Monothetic Centroid Polythetic Median Single Pass Ward Jarvis-Patrick Nearest Neighbour Mixture Model Non-hierarchical Relocation Topographic Others

Similarity and Diversity Alexandre Varnek, University of Strasbourg, France