Luddite: An Information Theoretic Library Design Tool

Luddite: An Information Theoretic Library Design Tool Jennifer L. Miller, Erin K. Bradley, and Steven L. Teig July 18, 2002

Outline • Overview • Search Strategy • Cost Function • Algorithms • Algorithm Extensions • Implementation Details • Results

Overview • Genomics and proteomics provide many novel targets • Need to find drugs for targets • Which compound to screen? • What target? • Methods to answer debated for many years • QSAR • Recently combinatorial and parallel synthesis techniques have transformed question of which single compound to analyze to one of which collection of compounds (library).

Overview • Develop algorithm for design libraries • Discrete – collection of individual compounds • Combinatorial – collections of compounds synthesized in a parallel or combinatorial fashion • Based on information theoretic techniques

Overview • Idea – Use molecules to “interrogate” target receptor about what chemical features are required for binding • Objective – Compose library maximizing conclusions drawn from “answers” across all possible experimental outcomes • Goal – Design library that allows discovery of most information about optimization target

Search Strategy • Strategies used in “20 Questions” are applicable • Binary Search • With every guess eliminate half the search space • Codeword Search • Every outcome corresponds to a single codeword • Optimal set of questions can be asked simultaneously • Same set of optimal questions can be used every time

Search Strategy

Search Strategy • Library design analogous to “20 Questions” • Searching for features required for ligand binding, desired phenotype, and/or good pharmacokinetic properties instead of a number • “feature” – four-point pharmacophore

Search Strategy - Example

Search Strategy - Assumptions • “20 Questions” Analogy useful but assumes • Every compound tests half of possible features • Can synthesize any compound in design space • Every assay value is accurate • Goal is a single feature

Search Strategy - Remedies • Eliminating Assumptions • 1. Minimum of log2(F) bits to decode F outcomes • Loose upper bound on number of compounds • 2. Ability of set of questions to decode message is invariant to column reordering – therefore not necessary that every compound in design space be obtainable in order to find a maximally efficient set of questions

Search Strategy - Remedies • 3. Error-correcting codes (ECC) based on Hamming Distance • 4. Adjust probability of features in an iterative process and prune unlikely features. • Will probably lead to convergence • Enhances Efficiency • Improves probability of success

Cost Function • Given set of features search for a set of compounds that allow decoding of each individual feature • If not possible seek to decode as many features as possible with flattest distribution across size of feature classes • Feature Class – subset of features that all have same codeword • Entropy well suited to this calculation

Entropy – measure of uncertainty All codewords same – no uncertainty -> minimal entropy All codewords different -> maximum entropy Wish to optimize following equation M is library measure H is entropy of feature classes C is # distinct classes ||ci|| is size of feature class i F is # of features Cost Function - Entropy

Cost Function – Entropy Example

Algorithm - Overview • Start with list of synthesized compounds • Goal - select subset to maximize entropy • State - set of compounds whose entropy can be calculated • Note: From entropy calculation that state is a function of classes but our moves through state space are a function of the compounds. • In general can’t be calculated incrementally and must be completely reevaluated whenever the state changes • Stark contrast with other library design methods • Despite seeming limitation method is very efficient

Algorithm - Details • Approach to discrete and combinatorial designs very similar • Both use a greedy build-up of library to desired number of compounds • Greedy – technique that utilizes local max to find global max • Followed by a second phase that reevaluates each of the library components looking for a better selection • Repeat till no improvement

Algorithm - Extensions • Often desirable to guarantee certain items included in library • Ability to sub sample source pool during build-up and optimization phases • Dramatically decrease run time • Only slightly impact quality of designs • Define minimum Tanimoto fingerprint similarity between any two compounds in discrete library • 1 implemented for discrete and combinatorial algorithms. • 2 and 3 only implemented for discrete algorithm.

Implementation Details • C++ • Microsoft Window NT • 500 MHz Intel Pentium III • 500 MB RAM

Results • 9 different libraries selected with algorithm • 273,373 compound source pool • 3 component reaction A+B+C->D • Monomer lists of length 33,436 • 19 4-point pharmacophore signatures calculated for all compounds in source pool • Compared final measures to optimal result and random result

Results

Results - Entropy • Combinatorial algorithm lags behind discrete one for performance • Discrete Library of 91 compounds has same measure as optimal combinatorial library of 250 compounds • Still possibly more cost-effective to synthesize combinatorial library • General rule – twice as many compounds required in a combinatorial library to achieve same information as a discrete library • Iterative setting • Use combinatorial algorithm early to discover • Use discrete algorithm later to cherry-pick specific compounds

Luddite: An Information Theoretic Library Design Tool