1 / 123

Building a Chemical Informatics Grid

This presentation provides an overview of the problem of connecting various resources in chemical informatics and the solution of building a distributed computing environment. It explores the use of web services and grid resources, as well as domain-specific tools and standards.

jstebbins
Télécharger la présentation

Building a Chemical Informatics Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University

  2. Acknowledgments • CICC researchers and developers who contributed to this presentation: • Prof. Geoffrey Fox, Prof. David Wild, Prof. Mookie Baik, Prof. Gary Wiggins, Dr. Jungkee Kim, Dr. Rajarshi Guha, Sima Patel, Smitha Ajay, Xiao Dong • Thanks also to Prof. Peter Murray Rust and the WWMM group at Cambridge University • More info: www.chembiogrid.org and www.chembiogrid.org/wiki.

  3. Chemical Informatics and the Grid An overview of the basic problem and solution

  4. Chemical Informatics as a Grid Application • Chemical Informatics is the application of information technology to problems in chemistry. • Example problems: managing data in large scale drug discovery and molecular modeling • Building Blocks: Chemical Informatics Resources: • Chemical databases maintained by various groups • NIH PubChem, NIH DTP • Application codes (both commercial and open source) • Data mining, clustering • Quantum chemistry and molecular modeling • Visualization tools • Web resources: journal articles, etc. • A Chemical Informatics Grid will need to integrate these into a common, loosely coupled, distributed computing environment.

  5. Problem: Connecting It Together • The problem is defining an architecture for tying all of these pieces into a distributed computing system. • A “Grid” • How can I combine application codes, web resources, and databases to solve a particular problem that interests me? • Specifically, how do I build a runtime environment that can connect the distributed services I need to solve an interesting problem? • For academic and government researchers, how can I do all of this in an open fashion? • Data and services can come from anywhere • That is, I must avoid proprietary infrastructure.

  6. NIH Roadmap for Medical Researchhttp://nihroadmap.nih.gov/ • The NIH recognizes chemical and biological information management as critical to medical research. • Federally funded high throughput screening centers. • 100-200 HTS assays per year on small molecules. • 100,000’s of small molecules analyzed • Data published, publicly available through NIH PubChem online database. • What do you do with all of this data?

  7. High-Throughput Screening Testing perhaps millions of compounds in a corporate collection to see if any show activity against a certain disease protein

  8. High-Throughput Screening • Traditionally, small numbers of compounds were tested for a particular project or therapeutic area • About 10 years ago, technology developed that enabled large numbers of compounds to be assayed quickly • High-throughput screening can now test 100,000 compounds a day for activity against a protein target • Maybe tens of thousands of these compounds will show some activity for the protein • The chemist needs to intelligently select the 2 - 3 classes of compounds that show the most promise for being drugs to follow-up

  9. Informatics Implications • Need to be able to store chemical structure and biological data for millions of data points • Computational representation of 2D structure • Need to be able to organize thousands of active compounds into meaningful groups • Group similar structures together and relate to activity • Need to learn as much information as possible(data mining) • Apply statistical methods to the structures and related information • Need to use molecular modeling to gain direct chemical insight into reactions.

  10. The Solution, Part I: Web Services • Web Services provide the means for wrapping databases, applications, web scavengers, etc, with programming interfaces. • WSDL definitions define how to write clients to talk with databases, applications, etc. • Web Service messaging through SOAP • Discovery services such as UDDI, MDS, and so on. • Many toolkits available • Axis, .NET, gSOAP, SOAP::Lite, etc. • Web Services can be combined with each other into workflows • Workflow==use case scenario • More about this later.

  11. Basic Architectures: Servlets/CGI and Web Services Browser Browser GUI Client Web Server HTTP GET/POST WSDL SOAP Web Server WSDL Web Server WSDL WSDL SOAP JDBC JDBC DB DB

  12. Solution Part II: Grid Resources • Many Grid tools provide powerful backend services • Globus: uniform, secure access to computing resources (like TeraGrid) • File management, resource allocation management, etc. • Condor: job scheduling on computer clusters and collections • SRB: data grid access • OGSA-DAI: uniform Grid interface to databases. • These have Web Service as well as other interfaces (or equivalently, protocols).

  13. Solution, Part III: Domain Specific Tools and Standards -->More Services • For Chemical Informatics, we have a number of tools and standards. • Chemical string representations • SMILES, InChI • Chemistry Markup Language • XML language for describing, exchanging data. • JUMBO 5: a CML parser and library • Glue Tools and Applications • Chemistry Development Kit (CDK) • OpenBabel • These are the basis for building interoperable Chemical Informatics Web Services • Analogous situations exist for other domains • Astronomy, Geosciences, Biology/Bioinformatics

  14. Solution Part IV: Workflows • Workflow engines allow you to connect services together into interesting composite applications. • This allows you to directly encode your scientific use case scenario as a graph of interacting services. • There are many workflow tools • We’ll briefly cover these later. • General guidance is to build web services first and then use workflow tools on top of these services. • Don’t get married to a particular workflow technology yet, unless someone pays you.

  15. Solution Part V: User Interfaces • Web Services allow you to cleanly separate user interfaces from backend services. • Model-view-controller pattern for web applications • Client environments include • Grid and web service scripting environments • Desktop tools like Taverna and Kepler • Portlet-based Web portal systems • Typically, desktop tools like Taverna are used by power users to define interesting workflows. • Portals are for running canned workflows.

  16. Next steps • Next we will review the online data base resources that are available to us. • Databases come in two varieties • Journal databases • Data databases • As we will discuss, it is useful to build services and workflows for automatically interacting with both types.

  17. Online Chemical Journal and Data Resources

  18. MEDLINE: Online Journal Database • MEDLINE (Medical Literature Analysis and Retrieval System Online) is an international literature database of life sciences and biomedical information. • It covers the fields of medicine, nursing, dentistry, veterinary medicine, and health care. • MEDLINE covers much of the literature in biology and biochemistry, and fields with no direct medical connection, such as molecular evolution. • It is accessed via PubMed. http://en.wikipedia.org/wiki/Medline

  19. PubMed: Journal Search Engine • PubMed is a free search engine offered by the United States National Library of Medicine as part of the Entrezinformation retrieval system. • The PubMed service allows searching the MEDLINE database. • MEDLINE covers over 4,800 journals published in the United States and more than 70 other countries primarily from 1966 to the present. • In addition to MEDLINE, PubMed also offers access to: • OLDMEDLINE for pre-1966 citations. • Citations to articles that are out-of-scope (e.g., general science and chemistry) from certain MEDLINE journals • In-process citations which provide a record for an article before it is indexed with MeSH and added to MEDLINE • Citations that precede the date that a journal was selected for MEDLINE indexing • Some life science journals http://www.ncbi.nlm.nih.gov/entrez/query/static/overview.html

  20. PubChem: Chemical Database • PubChem is a database of chemicalmolecules. • The system is maintained by the National Center for Biotechnology Information (NCBI) which belongs to the United States National Institutes of Health (NIH). • PubChem can be accessed for free through a web user interface. • And Web Services for programmatic access • PubChem contains mostly small molecules with a molecular mass below 500. • Anyone can contribute • The database is free to use, but it is not curated, so value of a specific compound information could be questionable. • NIH funded HTS results are (intended to be) available through pubchem. http://pubchem.ncbi.nlm.nih.gov/

  21. NIH DTP Database • Part of NIH’s Developmental Therapeutics Program. • Screens up to 3,000 compounds per year for potential anticancer activity. • Utilizes 59 different human tumor cell lines, representing leukemia, melanoma and cancers of the lung, colon, brain, ovary, breast, prostate, and kidney. • DTP screening results are part of PubChem and also available as a separate database. http://dtp.nci.nih.gov/

  22. Example screening results. Positive results (red bar to right of vertical line) indicates greater than average toxicity of cell line to tested agent. http://dtp.nci.nih.gov/docs/compare/compare.html

  23. DTP and COMPARE • COMPARE is an algorithm for mining DTP result data to find and rank order compounds with similar DTP screening results. • Why COMPARE? • Discovered compounds may be less toxic to humans but just as effective against cancer cell lines. • May be much easier/safer to manufacture. • May be a guide to deeper understanding of experiments http://dtp.nci.nih.gov/docs/compare/compare_methodology.html

  24. Many Other Online Databases • Complementary protein information • Indiana University: Varuna project • Discussed in this presentation • University of Michigan: Binding MOAD • “Mother of All Databases” • Largest curated database of protein-ligand complexes • Subset of protein databank • Prof. Heather Carlson • University of Michigan: PDBBind • Provides a collection of experimentally measured binding affinity data (Kd, Ki, and IC50) exclusively for the protein-ligand complexes available in the Protein Data Bank (PDB) • Dr. Shaomeng Wang

  25. The Point Is… • All of these databases can be accessed on line with human-usable interfaces. • But that’s not so important for our purposes • More importantly, many of them are beginning to define Web Service interfaces that let other programs interact with them. • Plenty of tools and libraries can simulate browsers, so you can also build your own service. • This allows us to remotely analyze databases with clustering and other applications without modifying the databases themselves. • Can be combined with text mining tools and web robots to find out who else is working in the area.

  26. Encoding chemistry

  27. Chemical Machine Languages • Interestingly, chemistry has defined three simple languages for encoding chemical information. • InChI, SMILES, CML • Can generate these by hand or automatically • InChIs and SMILES can represent molecules as a single string/character array. • Useful as keys for databases and for search queries in Google. • You can convert between SMILES and InChIs • OpenBabel, OELib, JOELib • CML is an XML format, and more verbose, but benefits from XML community tools

  28. SMILES: Simplified Molecular Input Line Entry Specification • Language for describing the structure of chemical molecules using ASCII strings. http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

  29. InChI: International Chemical Identifier • IUPAC and NIST Standard similar to SMILES • Encodes structural information about compounds • Based on open an standard and algorithms. http://wwmm.ch.cam.ac.uk/inchifaq/

  30. InChI in Public Chemistry Databases • US National Institute of Standards and Technology (NIST) - 150,000 structures • NIH/NCBI/PubChem project - >3.2 million structures • Thomson ISI - 2+ million structures • US National Cancer Institute(NCI) Database - 23+ million structures • US Environmental Protection Agency(EPA)-DSSToX Database - 1450 structures • Kyoto Encyclopaedia of Genes and Genomes (KEGG) database - 9584 structures • University of California at San Francisco ZINC - >3.3 million structures • BRENDA enzyme information system (University of Cologne) - 36,000 structures • Chemical Entities of Biological Interest (ChEBI) database of the European Bioinformatics Institute - 5000 structures • University of California Carcinogenic Potency Project - 1447 structures • Compendium of Pesticide Common Names - 1437 (2005-03-03) structures

  31. Journals and Software Using InChI • Journals • Nature Chemical Biology. • Beilstein Journal of Organic Chemistry • Software • ACD/Labs ACD/ChemSketch. • ChemAxon Marvin. • SciTegic Pipeline Pilot. • CACTVS Chemoinformatics Toolkit by Xemistry, GmbH. http://wwmm.ch.cam.ac.uk/inchifaq/

  32. Chemistry Markup Language • CML is an XML markup language for encoding chemical information. • Developed by Peter Murray Rust, Henry Rzepa and others. • Actually dates from the SGML days before XML • More verbose than InChI and SMILES • But inherits XML schema, namespaces, parsers, XPATH, language binding tools like XML Beans, etc. • Not limited to structural information • Has OpenBabel support. http://cml.sourceforge.net/, http://cml.sourceforge.net/wiki/index.php/Main_Page

  33. InChI Compared to SMILES • SMILES is proprietary and different algorithms can give different results. • Seven different unique SMILES for caffeine on Web sites: • [c]1([n+]([CH3])[c]([c]2([c]([n+]1[CH3])[n][cH][n+]2[CH3]))[O-])[O-] • CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 • Cn1cnc2n(C)c(=O)n(C)c(=O)c12 • Cn1cnc2c1c(=O)n(C)c(=O)n2C • N1(C)C(=O)N(C)C2=C(C1=O)N(C)C=N2 • O=C1C2=C(N=CN2C)N(C(=O)N1C)C • CN1C=NC2=C1C(=O)N(C)C(=O)N2C On the other hand, some claim SMILES are more intuitive for human readers. http://wwmm.ch.cam.ac.uk/inchifaq/

  34. A CML Example http://www.medicalcomputing.net/xml_biosciences.html

  35. Clustering Techniques, Computing Requirements, and Clustering Services Computational techniques for organizing data

  36. The Story So Far • We’ve discussed managing screening assay output as the key problem we face • Must sift through mountains of data in PubChem and DTP to find interesting compounds. • NIH funded High Throughput Screening will make this very important in the near future. • Need now a way to organize and analyze the data.

  37. Clustering and Data Analysis • Clustering is a technique that can be applied to large data sets to find similarities • Popular technique in chemical informatics • Data sets are segmented into groups (clusters) in which members of the same cluster are similar to each other. • Clustering is distinct from classification, • There are no pre-determined characteristics used to define the membership of a cluster, • Although items in the same cluster are likely to have many characteristics in common. • Clustering can be applied to chemical structures, for example, in the screening of combinatorial or Markush compound libraries in the quest for new active pharmaceuticals. • We also note that these techniques are fairly primitive • More interesting clustering techniques exist but apparently are not well known by the chemical informatics community.

  38. Non-Hierarchical Clustering • Clusters form around centroids. • The number of which can be specified by the user. • All clusters rank equally and there is no particular relationship between them. http://www.digitalchemistry.co.uk/prod_clustering.html

  39. Hierarchical Clustering • Clusters are arranged in hierarchies • Smaller clusters are contained within larger ones; the bottom of the hierarchy consists of individual objects in "singleton" clusters, while the top of it consists of one cluster containing all the objects in the dataset. • Such hierarchies can be built either from the bottom up (agglomerative) or the top downwards (divisive) http://www.digitalchemistry.co.uk/prod_clustering.html

  40. Fingerprinting and Dictionaries--What Is Your Parameter Space? • Clustering algorithms require a parameter space • Clusters defined along coordinate axes. • Coordinate axes defined by a dictionary of chemical structures. • Use binary on/off for fingerprinting a particular compound against a dictionary. http://www.digitalchemistry.co.uk/prod_fingerprint.html

  41. Cluster Analysis and Chemical Informatics • Used for organizing datasets into chemical series, to build predictive models, or to select representative compounds • Clustering Methods • Jarvis-Patrick and variants • O(N2), single partition • Ward’s method • Hierarchical, regarded as best, but at least O(N2) • K-means • < O(N2), requires set no of clusters, a little “messy” • Sphere-exclusion (Butina) • Fast, simple, similar to JP • Kohonen network • Clusters arranged in 2D grid, ideal for visualization

  42. Limitations of Ward’s method forlarge datasets (>1m) • Best algorithms have O(N2) time requirement (RNN) • Requires random access to fingerprints • hence substantial memory requirements (O(N)) • Problem of selection of best partition • can select desired number of clusters • Easily hit 4GB memory addressing limit on 32 bit machines • Approximately 2m compounds

  43. Scaling up clustering methods • Parallelization • Clustering algorithms can be adapted for multiple processors • Some algorithms more appropriate than others for particular architectures • Ward’s has been parallelized for shared memory machines, but overhead considerable • New methods and algorithms • Divisive (“bisecting”) K-means method • Hierarchical Divisive • Approx. O(NlogN)

  44. Divisive K-means Clustering • New hierarchical divisive method • Hierarchy built from top down, instead of bottom up • Divide complete dataset into two clusters • Continue dividing until all items are singletons • Each binary division done using K-means method • Originally proposed for document clustering • “Bisecting K-means” • Steinbach, Karypis and Kumar (Univ. Minnesota)http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf • Found to be more effective than agglomerative methods • Forms more uniformly-sized clusters at given level

  45. BCI Divkmeans • Several options for detailed operation • Selection of next cluster for division • size, variance, diameter • affects selection of partitions from hierarchy, not shape of hierarchy • Options within each K-means division step • distance measure • choice of seeds • batch-mode or continuous update of centroids • termination criterion • Have developed parallel version for Linux clusters / grids in conjunction with BCI • For more information, see Barnard and Engels talks at: http://cisrg.shef.ac.uk/shef2004/conference.htm

  46. Comparative execution timesNCI subsets, 2.2 GHz Intel Celeron processor 7h 27m 3h 06m 2h 25m 44m

  47. Divisive K-means: Conclusions • Much faster than Ward’s, speed comparable to K-means, suitable for very large datasets (millions) • Time requirements approximately O(N log N) • Current implementation can cluster 1m compounds in under a week on a low-power desktop PC • Cluster 1m compounds in a few hours with a 4-node parallel Linux cluster • Better balance of cluster sizes than Wards or Kmeans • Visual inspection of clusters suggests better assembly of compound series than other methods • Better clustering of actives together than previously-studied methods • Memory requirements minimal • Experiments using AVIDD cluster and Teragrid forthcoming(50+ nodes)

  48. Conclusions • Effective exploitation of large volumes and diverse sources of chemical information is a critical problem to solve, with a potential huge impact on the drug discovery process • Most information needs of chemists and drug discovery scientists are conceptually straightforward, but complex to implement • All of the technology is now in place to implement may of these information need “use-cases”: the four level model using service-oriented architectures together with smart clients look like a neat way of doing this • In conjunction with grid computing, rapid and effective organization and visualization of large chemical datasets is feasible in a web service environment • Some pieces are missing: • Chemical structure search of journals (wait for InChI) • Automated patent searching • Effective dataset organization • Effective interfaces, especially visualization of large numbers of 2D structures

  49. Divisive K-Means as a Web Service • The previous exercise was intended to show that Divisive K-Means is a classic example of Grid application. • Needs to be parallelized • Should run on TeraGrid • How do you make this into a service? • We’ll go on a small tour before getting back to our problem.

  50. Wrapping Science Applications as Services • Science Grid services typically must wrap legacy applications written in C or Fortran. • You must handle such problems as • Specifying several input and output files • These may need to be staged in • Launching executables and monitoring their progress. • Specifying environment variables • Often these have also shell scripts to do some miscellaneous tasks. • How do you convert this to WSDL? • Or (equivalently) how do you automatically generate the XML job description for WS-GRAM?

More Related