1 / 18

Selecting Diverse Sets of Compounds

Selecting Diverse Sets of Compounds. C371 Fall 2004. Review. Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds. Techniques.

vartan
Télécharger la présentation

Selecting Diverse Sets of Compounds

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Selecting Diverse Sets of Compounds C371 Fall 2004

  2. Review • Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds.

  3. Techniques • High-Throughput Screening (HTS) • Combinatorial Chemistry • Early attempts led to large libraries, but little variability in the molecules created • Need a way to identify subsets of compounds for synthesis, purchase, or testing

  4. Chemical Diversity • No unambiguous definition • Need to quantify the degree of diversity of a subset of compounds • Four main approaches: • Cluster analysis • Dissimilarity-based methods • Cell-based methods • Use of optimization techniques

  5. CLUSTER ANALYSIS • Aim is to divide a group into clusters where objects in the cluster are similar, but objects in other clusters are dissimilar • Many algorithms for doing this • Hierarchical methods seem to be better than non-hierarchical • Sometimes called a “distance-based” approach to compound selection, because distance is measured between pairs of compounds

  6. Key Steps in Cluster Analysis • Generate descriptors for each compound • Calculate the similarity or distance between all compounds • Use a clustering algorithm to group the compounds • Select a representative subset by taking one or more compounds from each cluster

  7. “Distance” • 1-S, where S is the similarity coefficient • When molecules are represented by binary descriptors • Euclidean distance • When molecules are represented by physicochemical properties

  8. Characteristics of Clustering Methods • Non-overlapping: each object in one cluster only (Most use this approach) • Hierarchical methods • Non-hierarchical methods • Overlapping: object can be in more than one cluster • Efficiency and effectiveness issues: some approaches have very intensive computational requirements

  9. Hierarchical Clustering • Clusters increase in size, with each compound in a single cluster (a singleton) at one extreme • Agglomerative methods start at the bottom and merge similar clusters • Ward’s method: clusters are formed to minimize the variance (i.e., the sum of the squared deviations from the mean) • Others: centroid method and the median method • Divisive hierarchical clustering starts with all compounds in a single cluster and partitions the data

  10. Selecting the Appropriate Number of Clusters • Need a cutoff value at which you are going to examine the molecules • Jaccard statistic of two clusters, C1 and C2 a -------------------------- a + b + c Where a is the number of compounds found in both clusters, b is the number that cluster in 1 but not 2, and c is the number in 2 but not 1 • Same as the Tanimoto coefficient

  11. Non-Hierarchical Clustering • Compounds are clustered without forming a hierarchical relationship • Methods: • single-pass assigns a compound to a cluster according to a cut-off value • Problem: doesn’t give same results all of the time, i.e., dependent on the order of the molecules • nearest neighbor: Jarvis Patrick clustering • relocation: K-means

  12. DISSIMILARITY-BASED SELECTION METHODS • Attempt to identify a diverse set of compounds directly • Based on calculating distances or dissimilarities between compounds

  13. Basic Algorithm for Dissimilarity-Based Selection Methods • Decide on a desired size, n, of a final subset • Select a compound and place it in the subset • Calculate the dissimilarity between each of the other compounds and those in the subset • Choose the next compound as the one most dissimilar to those in the subset • If fewer than n in the subset, repeat the calculation of the dissimilarity until n is achieved • Complexity varies as the square of n

  14. CELL-BASED METHODS • Operate within a pre-defined low-dimensional chemistry space, not dependent on the particular set of molecules being examined • Compounds are allocated to cells according to their molecular properties • Methods are very fast with a time complexity of O(N), but restricted to low-dimensional space • good for very large data sets • Examples: MW, logP, polarity, shape, hydrogen bonding, aromatic interactions

  15. BCUT Descriptors • Matrix representation of molecules • Atomic properties used for diagonal • Atomic charges, polarizabilities, hydrogen bonding • Connectivity used for the off-diagonals • 2D graph or interatomic distances from 3D

  16. Partitioning Using Pharmacophore Keys • Each potential 3- or 4-point pharmacophore is considered to constitute a cell • A given molecule could be in more than one cell • Promiscous molecules: those that contain a large number of pharmacophores, e.g., very flexible molecules

  17. OPTIMIZATION METHODS • Techniques for sampling large sets of molecules • May want to spread the compounds evenly in space • Techniques: Monte Carlo, simulated annealing • Selective replacement

  18. CONCLUSIONS • Some research suggests that compounds within 0.85 Tanimoto similarity have between 30% and 80% chance of sharing the same biological activity • No clear consensus on which screening approach is best • Faster computer techniques (e.g., parallel computing) may help • Descriptors used must be related to biological activity

More Related