Molecular Similarity and Chemical Families: The Homogeneity Approach

Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 20002nd Sheffield Chemoinformatics Conference, Sheffield, UK Bioreason, Inc

Presentation Outline • Introduction • Molecular similarity • Observations on chemical data • Analyzing screening data • Using a traditional approach • The Homogeneity Approach • Definitions • Implementation and experimental results • Conclusions Bioreason, Inc

Molecular Similarity • Widely used all over drug discovery process • Sample applications: • Assessing diversity of a chemical dataset • Picking representative dataset from compound library • Given a compound and a compound library, identifying subset of similar compounds • Analyzing screening data • Major step: • Organizing screening data into chemical families Bioreason, Inc

Data Assay Typical Drug Discovery Process Library *Screening* *Data Analysis* Further exploration Start Chemistry Drug Candidates Bioreason, Inc

Technology Employed • Compound representation methods • Fingerprints/bit vectors, graph-based, ... • 2D-keys Vs 3D-keys, fragment Vs distance based, ... • Similarity and distance measures • Tanimoto, Euclidean, …, graph-based, ... • Clustering methods • Classification methods • Substructure searching/(sub)graph matching • ... Bioreason, Inc

Analyzing Chemical Compounds (1) Dictionary of Keys O N-N Q-QH Q-C(-N)-C CH3-A-CH3 Q-N N-A-A-O N-C-O O not % A % A N-A-O Q-Q QH > 1 CH3 > 1 N > 1 NH ... H N N O H O 10111000001... Bioreason, Inc

Analyzing Chemical Compounds (2) • Compounds are multi-domain: • multiple occurrences of a key/substructure • members of more than one chemical family Bioreason, Inc

Analyzing Chemical Compounds (3) Information loss! E.g. “How” a key hits? Bioreason, Inc

Dataset Used • Derived from the NCI anti-HIV program • Latest release, Oct. 99, 43 382 compounds • Cell based, EC50 (effective concentration at which the test compound protects the cells by 50%) • Pre-processing: • Molecular weight <=500 • Multiple EC50 values for compounds; kept highest concentration • 33245 compounds left • Activities: converted from molar concentrations to -log • Activity threshold used: 5.5 • Training set size (actives): 503 Bioreason, Inc

Analyzing Screening DataTypical Approach • Goal: Data Reduction • To manageable size • Organized fashion • With minimal information loss • Represent molecules as vectors, often binary • Similarity/distance measure • Clustering Algorithm • Metacluster selection method (e.g. cluster level selection methods for hierarchical clustering) Bioreason, Inc

Hierarchical Agglomerative Clustering Method • NCI - HIV dataset • 503 subset based on activity • Clustered using Wards, Euclidean distance, bit-vectors obtained via application of MACCS-like keys • Cluster level selection using the Kelley method • Results: • 70 (meta)clusters • Complete coverage of the dataset, no singletons! • Average metacluster size: 7.2 compounds Bioreason, Inc

Method Evaluation - Chemists • Results validation by comparing to known truth: • Some known chemical families were detected, e.g. AZTs, pyrimidine nucleosides, ... • Smaller, less well-represented families not always detected, e.g. stilbenes, ... • Results validation by assessing their quality • On average chemists approved only 20-30 of the 70 clusters as chemical families of related compounds • The remaining clusters(~2/3) were difficult to interpret • Compounds that shouldn’t be in some clusters • Compounds that should have been in some clusters (misclassified or not) • Clusters that were made of dissimilar/diverse compounds • Experts were puzzled by the absence of singletons Bioreason, Inc

Method Evaluation - Computational • Analyzed 70 groups of compounds: • Simple method: • average nearest neighbor distance within a set of compounds • distance computed using the bit-vectors of the compounds • 43/70: pretty low average nearest neighbor distance • 22/70: moderate average nearest neighbor distance • 5/70: quite high average nearest neighbor distance. • Overall most of the groups had a low diversity; expected since the metaclusters were built using bit-vectors Bioreason, Inc

The problem • Confusing? • Method functioned just right from a computational perspective • But, the results were not as satisfying to the human expert • Clustering results often don’t: • match expectations • make chemical sense • Why? • Clustering is performed on molecular representations, often based on small keys, not on the molecules themselves • No chemical “common sense” influence on the clustering process Bioreason, Inc

The road ahead… (1) • What is the end goal of screening data analysis? • Finding the chemical families of interest, i.e. those that exhibit favorable biological characteristics • How are we attempting to do it? • Clustering and classification methods using vector encoding representations of molecules • But, • clustering only gives groups of compounds that have similar vector representations and, • a successful classification session requires that one knows the chemical families of interest a priori. Bioreason, Inc

The road ahead… (2) • So, what do we do now that we are aware of the loose coupling between clusters obtained traditionally and human experts’ expectations? • Discover what the experts want • Adapt our process to match results and expectations Bioreason, Inc

Definitions • Chemical family: • A set of highly similar compounds sharing a common scaffold; else a set of compounds with high homogeneity • Homogeneity: • High structural similarity • Based not only on similarity of molecular vectors but also on the presence of a significant common scaffold • Scaffold: • A substructure defined as a specific configuration of atom types and bond types Bioreason, Inc

Processing traditional method results • Processing the results of traditional methods: • Easier to do than a complete re-design/re-implementation • Will “remove” results not chemically sensible • Will make life easier for human analysts by allowing them to focus on easily recognizable and interpretable pieces of knowledge • Approach: • Compute and use structural homogeneity on results of traditional methods. Basically construct “chemically sensible” methods for selecting the important compound groups Bioreason, Inc

Identifying Scaffolds • Maximum Common Substructure(MCS) extraction: • Using extremely fast and efficient own implementations • Highlights of analysis: • 7 out of 70 compound sets: common scaffold size < 2! • 5 MCSs appeared multiple times • Range: 2-6, mostly benzene rings • A total of 53 different scaffolds • MCS size: • Ranged from less than 2 atoms to greater than 14 atoms Bioreason, Inc

Introducing Homogeneity • Clusters Homogeneity: • Fingerprint Homogeneity: • Overall quite good average nearest neighbor distance • Structural Homogeneity: • Used: # of atoms in mcs / avg. # of atoms in set molecules • Structural Homogeneity Threshold: 1/3 • MCS covering at least a third of the average molecule size • Results: • 23/70 clusters below threshold • 47 above threshold Bioreason, Inc

Method Assessment (1) • Results were used to assign priority to clusters: • Low Priority - low likelihood of chemical sense: • clusters with small scaffolds, low structural homogeneity • clusters with insignificant scaffolds, low-to-moderate structural homogeneity • High Priority - high likelihood of chemical sense: • well defined clusters, with high structural homogeneity and big, significant scaffolds • Approach did make life easier to human analysts • Ability to find important information faster Bioreason, Inc

Method Assessment (2) • Prioritization assessment: • the 23 non-structurally homogeneous clusters were uninteresting to chemists. • the 47 structurally homogeneous included all those (20-30) approved before by chemists as chemical families • However, experts complained about: • low information content of the clustering process results • Too many clusters, too little knowledge • the amount of information never found! • High priority clusters contained only 2/3 of compounds analyzed! • Clusters approved as chemical families from which knowledge could be derived easily contained only 1/3 of the compounds!!! • Known knowledge never found. Bioreason, Inc

The road ahead… (3) • Do traditionally obtained clusters relate to chemical families? • Do we need a different approach? • Introduce chemically “aware” methods • No simple clustering methods • Take into account structural homogeneity • Accommodate multi-domain nature of molecules • Present results in a format that facilitates interpretation and knowledge discovery by chemists Bioreason, Inc

A different approach: Can it work? • Have been working on “chemically aware” screening data analysis methods • Same dataset results with a typical Bioreason analysis: • 102 classes, all with high structural homogeneity • All classes were easy to interpret • Only 10% of classes not interesting to chemists (~50 compounds) • 47 singletons (~10% of dataset) • Information content much higher than traditional approach • 90% of compounds placed in homogeneous clusters (Vs 66% in traditional method) • 80% of compounds placed in clusters approved as structural families (Vs 34% in traditional method) • Multi-domain nature is accommodated Bioreason, Inc

Conclusions • Molecular fingerprint similarity does not supply a certain indication of high structural molecular similarity • Most traditional chemical data analysis methods make heavy use of molecular fingerprint similarity • As a consequence, relations -including clusters- obtained via traditional methods often don’t make chemical sense • Structural Homogeneity may be employed to enable formation of clusters and identification of chemical relations closer to chemists’ expectations Bioreason, Inc

Acknowledgements • Patricia Bacha • Bobi Den Hartog • Info: • nicolaou@bioreason.com • www.bioreason.com Bioreason, Inc

Molecular Similarity and Chemical Families: The Homogeneity Approach

Molecular Similarity and Chemical Families: The Homogeneity Approach

Presentation Transcript

Chemical descriptors and molecular graphs

Chemical Quantities The Mole, % Composition, Empirical and Molecular Formulas

4.14 Chemical Families

Chemical Families

Chemical Bonding and Molecular Structure

Building Biochemical + Chemical Similarity Networks

Chemical Families

Molecular Geometry and Chemical Bonding Theory

Chemical compounds - covalent (molecular) and ionic Chemical formulas

Molecular Notation and Chemical Equations

SIMILARITY SEARCH The Metric Space Approach

Chemical Bonding and Molecular Geometry

SIMILARITY SEARCH The Metric Space Approach

Introducing Families to the Project Approach

Chemical Names and Formulas -- Molecular Compounds --

4. Molecular Similarity

Similarity Search: A Matching Based Approach

SIMILARITY SEARCH The Metric Space Approach

SIMILARITY SEARCH The Metric Space Approach

Homogeneity test