1 / 57

Machine Learning in Drug Design

Machine Learning in Drug Design. David Page Dept. of Biostatistics and Medical Informatics and Dept. of Computer Sciences. Michael Waddell Paul Finn Ashwin Srinivasan John Shaughnessy Bart Barlogie. Frank Zhan Stephen Muggleton Arno Spatola Sean McIlwain Brian Kay. Collaborators.

Olivia
Télécharger la présentation

Machine Learning in Drug Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning in Drug Design David Page Dept. of Biostatistics and Medical Informatics and Dept. of Computer Sciences

  2. Michael Waddell Paul Finn Ashwin Srinivasan John Shaughnessy Bart Barlogie Frank Zhan Stephen Muggleton Arno Spatola Sean McIlwain Brian Kay Collaborators

  3. Outline • Overview of Drug Design • How Machine Learning Fits Into the Process • Target Search: Single Nucleotide Polymorphisms (SNPs) • Machine Learning from Feature Vectors • Decision Trees • Support Vector Machines • Voting/Ensembles • Predicting Molecular Activity: Learning from Structure

  4. Drugs Typically Are… • Small organic molecules that… • Modulate disease by binding to some target protein… • At a location that alters the protein’s behavior (e.g., antagonist or agonist). • Target protein might be human (e.g., ACE for blood pressure) or belong to invading organism (e.g., surface protein of a bacterium).

  5. Example of Binding

  6. So To Design a Drug: Identify Target Protein Knowledge of proteome/genome Relevant biochemical pathways Crystallography, NMR Difficult if Membrane-Bound Determine Target Site Structure Synthesize a Molecule that Will Bind Imperfect modeling of structure Structures may change at binding And even then…

  7. Molecule Binds Target But May: • Bind too tightly or not tightly enough. • Be toxic. • Have other effects (side-effects) in the body. • Break down as soon as it gets into the body, or may not leave the body soon enough. • It may not get to where it should in the body (e.g., crossing blood-brain barrier). • Not diffuse from gut to bloodstream.

  8. And Every Body is Different: • Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials). • A molecule may work for some people but not others. • A molecule may cause harmful side-effects in some people but not others.

  9. Outline • Overview of Drug Design • How Machine Learning Fits Into the Process • Target Search: Single Nucleotide Polymorphisms (SNPs) • Machine Learning from Feature Vectors • Decision Trees • Support Vector Machines • Voting/Ensembles • Predicting Molecular Activity: Learning from Structure

  10. Places to use Machine Learning • Finding target proteins. • Inferring target site structure. • Predicting who will respond positively/negatively.

  11. Places to use Machine Learning • Finding target proteins. • Inferring target site structure. • Predicting who will respond positively/negatively.

  12. Healthy vs. Disease Healthy Diseased

  13. If We Could Sequence DNA Quickly and Cheaply, We Could: • Sequence DNA of people taking a drug, and use ML to identify consistent differences between those who respond well and those who do not. • Sequence DNA of cancer cells and healthy cells, and use ML to detect dangerous mutations… proteins these genes code for may be useful targets. • Sequence DNA of people who get a disease and those who don’t, and use ML to determine genes related to succeptibility… proteins these genes code for may be useful targets.

  14. Problem: Can’t Sequence Quickly • Can quickly test single positions where variation is common: Single Nucleotide Polymorphisms (SNPs). • Can quickly test degree to which every gene is being transcribed: Gene Expression Microarrays (e.g., Affymetrix Gene Chips™). • Can (moderately) quickly test which proteins are present in a sample (Proteomics).

  15. Outline • Overview of Drug Design • How Machine Learning Fits Into the Process • Target Search: Single Nucleotide Polymorphisms (SNPs) • Machine Learning from Feature Vectors • Decision Trees • Support Vector Machines • Voting/Ensembles • Predicting Molecular Activity: Learning from Structure

  16. Example of SNP Data

  17. Problem: SNPs are not Genes • If we find a predictive SNP, it may not be part of a gene… we can only infer that the SNP is “near” a gene that may be involved in the disease. • Even if the SNP is part of a gene, it may be another nearby gene that is the key gene.

  18. Problem: Even SNPs are Costly • Typically cannot use all known SNPs. • Can focus on a particular chromosome and area if knowledge permits that. • Can use a scattering of SNPs, since SNPs that are very close together may be redundant… use one SNP per haplotype block, or region where recombination is rare.

  19. Why Machine Learning? • There may be no single SNP in our data that distinguishes disease vs. healthy. • Still may be possible to have some combination of SNPs to predict. Can gain insight from this combination.

  20. Outline • Overview of Drug Design • How Machine Learning Fits Into the Process • Target Search: Single Nucleotide Polymorphisms (SNPs) • Machine Learning from Feature Vectors • Decision Trees • Support Vector Machines • Voting/Ensembles • Predicting Molecular Activity: Learning from Structure

  21. Decision Trees in One Picture

  22. Naïve Bayes in One Picture Age SNP 3000 SNP 1 SNP 2 . . .

  23. Voting Approach • Score SNPs using information gain. • Choose top 1% scoring SNPs. • To classify a new case, let these SNPs vote (majority or weighted majority vote). • We use majority vote here.

  24. Task: Predict Early Onset DiseaseFrom SNP Data • Only 3000 SNPs, coarsely sampled over entire genome. • 80 patients (examples), 40 with early onset. • Using technology from Orchid. • Can a predictor be learned that performs significantly better than chance on unseen data?

  25. Results • Use all data, only top 1% of features, or only top 10% of features (according to decision tree’s purity measure). • Use Trees, SVMs, Voting. • SVMs with top 10% achieve 71% accuracy. Significantly better than chance (50%).

  26. Lessons • Feature selection is important for performance. • Methodology note for machine learning specialists: must repeat this entire process on each fold of cross-validation or results will be overly-optimistic. • SNP approach is promising… get funding to measure more SNPs. • More work on SVM comprehensibility.

  27. Outline • Overview of Drug Design • How Machine Learning Fits Into the Process • Target Search: Single Nucleotide Polymorphisms (SNPs) • Machine Learning from Feature Vectors • Decision Trees • Support Vector Machines • Voting/Ensembles • Predicting Molecular Activity: Learning from Structure

  28. Places to use Machine Learning • Finding target proteins. • Inferring target site structure. • Predicting who will respond positively/negatively.

  29. Typical Practice when Target Structure is Unknown • Test many molecules (1,000,000) to find some that bind to target (ligands). • Infer (induce) shape of target site from 3D structural similarities. • Shared 3D substructure is called a pharmacophore. • Perfect example of a machine learning task with spatial target.

  30. An Example of Structure Learning Inactive Active

  31. Inductive Logic Programming • Represents data points in mathematical logic • Uses Background Knowledge • Returns results in logic

  32. The Logical Representation of a Pharmacophore

  33. Background Knowledge I • Information about atoms and bonds in the molecules • atm(m1,a1,o,3,5.915800,-2.441200,1.799700). • atm(m1,a2,c,3,0.574700,-2.773300,0.337600). • atm(m1,a3,s,3,0.408000,-3.511700,-1.314000). • bond(m1,a1,a2,1). • bond(m1,a2,a3,1).

  34. Background knowledge II • Definition of distance equivalence • dist(Drug,Atom1,Atom2,Dist,Error):- • number(Error), • coord(Drug,Atom1,X1,Y1,Z1), • coord(Drug,Atom2,X2,Y2,Z2), • euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1), • Diff is Dist1-Dist, • absolute_value(Diff,E1), • E1 =< Error. • euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):- • Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2, • D is sqrt(Dsq).

  35. Central Idea: Generalize by searching a lattice

  36. Conformational model • Conformational flexibility modelled as multiple conformations: • Sybyl randomsearch • Catalyst

  37. Pharmacophore description • Atom and site centred • Hydrogen bond donor • Hydrogen bond acceptor • Hydrophobe • Site points (limited at present) • User definable • Distance based

  38. Example 1: Dopamine agonists • Agonists taken from Martin data set on QSAR society web pages • Examples (5-50 conformations/molecule)

  39. Pharmacophore identified • Molecule A has the desired activity if: • in conformation B molecule A contains a hydrogen acceptor at C, and • in conformation B molecule A contains a basic nitrogen group at D, and • the distance between C and D is 7.05966 +/- 0.75 Angstroms, and • in conformation B molecule A contains a hydrogen acceptor at E, and • the distance between C and E is 2.80871 +/- 0.75 Angstroms, and • the distance between D and E is 6.36846 +/- 0.75 Angstroms, and • in conformation B molecule A contains a hydrophobic group at F, and • the distance between C and F is 2.68136 +/- 0.75 Angstroms, and • the distance between D and F is 4.80399 +/- 0.75 Angstroms, and • the distance between E and F is 2.74602 +/- 0.75 Angstroms.

  40. Example II: ACE inhibitors • 28 angiotensin converting enzyme inhibitors taken from literature • D. Mayer et al., J. Comput.-Aided Mol. Design, 1, 3-16, (1987)

  41. Experiment 1 • Attempt to identify pharmacophore using original Mayer et al. Data (final conformations). • Initial failed attempt traced to “bugs” in background knowledge definition. • 4 pharmacophores found with corrected code (variations on common theme)

  42. ACE pharmacophore • Molecule A is an ACE inhibitor if: • molecule A contains a zinc-site B, • molecule A contains a hydrogen acceptor C, • the distance between B and C is 7.899 +/- 0.750 A, • molecule A contains a hydrogen acceptor D, • the distance between B and D is 8.475 +/- 0.750 A, • the distance between C and D is 2.133 +/- 0.750 A, • molecule A contains a hydrogen acceptor E, • the distance between B and E is 4.891 +/- 0.750 A, • the distance between C and E is 3.114 +/- 0.750 A, • the distance between D and E is 3.753 +/- 0.750 A.

  43. B A C Pharmacophore discovered Zinc site H-bond acceptor

  44. Experiment 2 • Definition of “zinc ligand” added to background knowledge • based on crystallographic data • Multiple conformations • Sybyl RandomSearch

  45. 4.0 3.9 7.3 Experiment 2 • Original pharmacophore rediscovered plus one other • different zinc ligand position • similar to alternative proposed by Ciba-Geigy

  46. Example III: Thermolysin inhibitors • 10 inhibitors for which crystallographic data is available in PDB • Conformationally challenging molecules • Experimentally observed superposition

  47. Key binding site interactions Asn112-NH O=C Asn112 S2’ Arg203-NH S1’ O=C Ala113 Zn

  48. Interactions made by inhibitors

  49. Pharmacophore Identification • Structures considered 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMN • Conformational analysis using “Best” conformer generation in Catalyst • 98-251 conformations/molecule

More Related