Chemical Diversity

Chemical Diversity Qualify and/or quantify the extent of variety within a set of compounds. Try to define the extent of chemical space. In combinatorial chemistry, we are interested in the diversity of a library.

Example 1: here we are looking at compounds that can possess up to 2 functional groups. How do we define libraries that have different numbers of these cells occupied? How do we quantify those that have duplicates within cells?

Chemical Diversity based on properties Example 2: We can try to define the diversity based on properties of the compounds. For example, we could look at the naturally occurring amino acids and span the space define by their pI. This gives a poor spread, so try pI and MW. Could go to higher dimensions by also looking at the number of H-bonds they make, the number of OH groups, their dipole moment, etc.

Why is Diversity Important? • Similar Property Principle • Structurally similar compounds will exhibit similar physicochemical and biological properties • Test only representative compounds, eliminate redundancies • For lead discovery want a diverse space to locate all possible hits (actives) – called a diverse library • For refining a lead into a drug (lead optimization), want to survey a range of similar compounds – called a focused library • Diversity hypothesis • Diverse reactants will lead to diverse products • Potentially useful for library design • Quantify whether a library can be supplemented by additions of other compounds, other libraries Beno, Drug Discovery Today, 2001, 6, 251 Brown, JCICS, 1996, 36, 572 Gillet, JCICS, 1997, 37, 731

Types of Diversity A library with members that sample chemical space evenly – an ideal situation for lead discovery A library that covers the same chemical space but the compounds cluster and leave large holes. A library with even sampling of space, but only with limited diversity – useful for modification of a lead. From Rose, Drug Discovery Today, 2002, 7, 133.

Quantifying Diversity • Need to define how similar (or dissimilar) two compounds are from each other • Similarity indices • Then need to determine the spread of the compounds throughout space • Distance-based • Cell-based partitioning • Clustering Agrafiotis, Mol. Diversity, 1999, 4, 1

Defining Similarity • Descriptors • Property-based • Structure-based • 2D • 3D • Pharmacophore • Structural keys • Fingerprints • Similarity/Distance Coefficients Beno, Drug Discovery Today, 2001, 6, 251 Willett, Curr. Opin. Biotechnology, 2000, 11, 85 Willett, JCICS, 1998, 38, 983 Daylight, http://www.daylight.com/dayhtml/doc/theory/theory.finger.html

Structural Keys • Boolean array expressing whether a pattern in present (TRUE) or not (FALSE) within a molecule • This array is usually represented as a string of 1s (TRUE) or 0s (FALSE) – a bitmap • So create a list of structural features and then set the corresponding bit to 1 if the feature is present Martin, J. Med. Chem., 1995, 38, 1431 Flower, JCICS, 1998, 38, 379

Fingerprints • Problems with structural keys • Lack of generality • Choice of structural keys is arbitrary and may not be appropriate for the search or question at hand • List of structural keys can be very long and unwieldy to generate and test • Solution – Fingerprint • Also a bitmap but NO assigned meaning to any particular bit! • Your fingerprint is characteristic of you, but there is no meaning to any particular fragment of it • Generate patterns from the molecule itself, such as a pattern for • Each atom • Each atom with nearest neighbors • Each group of atoms and bonds connected by up to 2 bonds long • Continuing with paths up to 3, 4, 5, 6, and 7 bonds long (seven seems to be the longest typically employed) • This list of patterns is exhaustive, meaning all are generated for every molecule

Fingerprints. II. • Since the number of patterns is huge, not possible to assign a particular bit to each pattern • Instead, each pattern is the input into a hash function that creates a number of set bits (typically 4-5 bits). These set bits are then added (with logical OR) to the fingerprint. • Note that bit sets for different patterns may have some bits in common • This conflict is not a problem since every bit set from some pattern (substructure) will be set in the molecule’s fingerprint. • Each pattern (substructure) generates its particular set of bits, and it is unlikely that another pattern will set those exact same bits. So a search for that substructure simply means looking to see if those bits have been set. • Fingerprint advantages • No predefined set of patterns (structural keys) • Structural keys are usually quite sparse, fingerprints are much more dense

Similarity Coefficients a = S xjA number on bits in A b = S xjb number on bits in B c = S xjA xjB number on bits in both A and B D(A,B) is similarity of A and B using bits S(A,B) is similarity of A and B using continuous variables • Euclidean Distance • Tanimoto Coefficient • Cosine Coefficient D(A,B) = [a + b – 2c]1/2 range 0 to n bits S(A,B) = [S (xjA – xjB)2 ]1/2 range 0 to infinity D(A,B) = c/[a + b – c] range 0 to 1 S(A,B) = S xjAxjB / [S xjA2 + S xjB2 + S xjAxjB] range -0.333 to 1 D(A,B) = c/[ab]1/2 range 0 to 1 S(A,B) = S xjAxjB / [S xjA2S xjB2 ]1/2 range –1 to 1 Willett, JCICS, 1998, 38, 983

Example: Bitmap for 2,2-dimethylbutane 1111011000000 a = 6 Ethylcyclobutane 1111110011100 b = 9 c = 5 Euclid distance = (6+9-10)1/2 = 2.24 Tanimoto coefficient = 5/(6+9-5) = 0.5 Cosine coefficient = 5/(6*9)1/2 = 0.68

Problems with Tanimoto and related similarity indices Flower, JCICS, 1998, 38, 379

Quantifying DiversityRules for a diversity function • adding redundant molecules does not change the value of the diversity • Adding non-redundant molecules always increases the value of the diversity • Space-filling behavior should be preferred • Perfect filling of space gives a finite value of the diversity • As dissimilarity of a pair of compounds increases, the diversity should increase asymptotically Waldman, J. Mol. Graph. Model.,2000, 18, 412

Diversity definition 1 Where SIM(J,K) is some similarity measurement between compounds A and B. • Can use this to build up a compound selection procedure • for creating the sublibrary with maximal diversity • Find similarities of all compounds in the library • Select compound that is most dissimilar from all other • Select 2nd compound that is most dissimilar from the first • Select 3rd compound that is most dissimilar from first 2 • Continue until you have selected as many • compounds as you desire

Cell-based Partitioning • Divide each dimension into a number of parts • These divisions are called cells or bins • Place compounds into appropriate bin based on the value of its properties and/or descriptors • Can now create a sublibrary by choosing one compound from each bin, usually the one nearest the center of the bin Schematic representation of different sampling of diversity space (a) Maximize Euclidean distance to create maximum diversity (b) cell-based selection, choosing compound nearest center of each cell From Rose, Drug Discovery Today, 2002, 7, 133

Diversity definition 2 and 3 Suppose 10 molecules divided into 2 cells. Distribution 1: (5,5) – Dc2 = 0 Distribution 2: (7,3) - Dc2 = -8 So the more even distribution is scored as being more diverse. But this may actually go too far – Dc2(2,2,2) > Dc2(4,1,1) = Dc2(3,3,0) Makes these last two equivalent, but the (4,1,1) appears to be intuitively more diverse. This entropy-like definition ranks the three sets Dentropy(2,2,2) > Dentropy(4,1,1) > Dentropy(3,3,0) Waldman, J. Mol. Graph. Model.,2000, 18, 412

Clustering. I. • Hierarchical clusters • Small clusters within larger clusters • Typically some relationship between clusters • Two procedures • Agglomerative Start with singletons and move upwards • Calculate all similarities of all pairs • Merge two most similar into a cluster • Continue until all only one cluster remains • Divisive Start with one cluster and break into smaller clusters • Calculate all dissimilarities of all pairs • Take the pair of most dissimilar structures and assign all other structures to the least dissimilar of these initial cluster centers. • Recursively select the cluster with the largest diameter and partition it intow two such that largest resulting cluster has the smallest diameter • Repeat step (c) for a maximum of n-1 times Brown, JCICS, 1996, 36, 572

Clustering. II. • Nonhierarchical clusters • No relation between clusters • Jarvis-Patrick method • calculate similarities of all pairs • Record top n most similar structures to each structure (nearest-neighbor list) • Assign compounds to clusters. A and B are in the same cluster if: • A is in the top K nearest-neighbor list of B • B is in the top K nearest-neighbor list of A • A and B have at least Kmin of their top K nearest-neighbors in common • Tends to produce lots of small clusters (singletons) under strict conditions or a few very large clusters under less strict conditions Brown, JCICS, 1996, 36, 572

Goals for Diversity Metrics • Insure the exploratory libraries are broad enough to locate active molecules • Insure that focused (directed) libraries are both broad enough to sample space but compact enough to maintain activity • Need to keep libraries small enough to readily manage – so want to insure that sublibraries separate actives from inactives

Other Diversity Comments • Krchnak, Mol. Diversity, 1996, 1, 193 (http://www.5z.com/moldiv/publish/MD023/md_023.html) • General comments of combinatorial methods and diversity • Good, JCICS, 1997, 40, 3926 • Use of 3d pharmacophores demands selection of products not reagents, since they are not additive • Martin, J. Comb. Chem.,1999, 1, 32 • Beyond diversity, library construction should include MW, lipophilicity, ease of synthesis, pharmacophore features, reagent cost, solubility, complementarity to other libraries. • Distance measures assess redundancy, coverage of space is better assessed with maps or binning procedures • Diversity functions often overweight edges • Oprea, J. Comb. Chem.,2001, 3, 157 • Big numbers (lots of compounds) and serendipity are not enough • Martin, J. Comb. Chem.,2001, 3, 231 • Chemical similarity not always good predictor of bioproperties • Unlikely that a few thousand compounds can span all of chemical space • Just how much diversity is enough?

Chemical Diversity