Collaborative Cheminformatics and Cyberinfrastructure Workshop

Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk II July 18 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org http://www.chembiogrid.org

Chemical Informatics and Cyberinfrastructure Collaboratory • Collaboration between School of Informatics (Cheminformatics, Bioinformatics, Computer Science), departments of Biology and Chemistry at Indiana University Bloomington and Indianapolis (IUPUI) • Thrusts are Education, use of Cyberinfrastructure for Cheminformatics and Computational Chemistry and Tool research • NSF has an Office of Cyberinfrastructure running (roughly) TeraGrid (100 TF distributed supercomputers) and eScience • eScience describes “modern Science as a team sport” with distributed Computers, Databases, Instruments, Sensors and People (>100 such projects worldwide) • eScience builds applications as Grids using large scale managed Web services

Training People for your Centers!Cheminformatics Education at IU • Linked to bioinformatics in an Indiana University’s School of Informatics • http://www.informatics.indiana.edu • School of Informatics degree programs • BS, MS, PhD • Programs offered at both the Indianapolis (IUPUI) and Bloomington (IUB) campuses • Bioinformatics MS and track on PhD • Chemical InformaticsMS and track on PhD • Informatics Undergraduates can choose a chemistry cognate • PhD in Informatics started in August 2005 and offers tracks in • bioinformatics; chemical informatics; health informatics; human-computer interaction design; social and organizational informatics; more to come!

Formal Cheminformatics Courses • I571 Chemical Information Technology (3 cr.) • Distance Ed section had 10 students in Fall 2005, from California to Connecticut • I572 Computational Chemistry and Molecular Modeling (3 cr.) • I573 Programming Techniques for Chemical and Life Science Informatics (3 cr.) • I553 Independent Study in Chemical Informatics (3 cr.) • Above courses required for the new Graduate Certificate Program in Chemical Informatics • I533 Seminar in Chemical Informatics • Spring 2006 Topic: Molecular Informatics, the Data Grid, and an Introduction to eScience • http://www.indiana.edu/~cheminfo/I533/533home.html • I647 Seminar in Chemical Informatics • Fall 2006 Topic: Bridging Bioinformatics and Chemical Informatics • http://www.indiana.edu/~cheminfo/I647/647home.html

Related Courses • L519 Bioinformatics: Theory and Application (3 cr.) (at IUPUI: CSCI 548) • L529 Bioinformatics in Molecular Biology and Genetics: Practical Applications (4 cr.) (not offered at IUPUI) • I619 Structural Bioinformatics (3 cr.) • I617 Informatics in Life Sciences and Chemistry (3 cr.) (for non-majors) • B649 Topics in Systems: Service Architectures and Science (3 cr.) • I590 Topics in Informatics: Scientific Applications of XML (IUPUI)

Other Educational Activities • Graduate Certificate Program in Chemical Informatics (4 courses by Distance Education) • Required courses: I571, I572, I573, I553 • Enrollees pay in-state graduate fees regardless of location • Special section of I571 will be taught as CIC CourseShare offering with Michigan, Fall 2006 • University of Michigan School of Pharmaceutical Engineering ChE531 Introduction to Chemoinformatics • Experiments with teleconferencing as a distance education tool (Raindance, Macromedia Breeze) • Mesa Analytics Cheminformatics Virtual Classroom • http://www.chemvc.com:8020/ • Workshop, July 30, 2006 at the Biennial Conference on Chemical Education

Some Grid Concepts I • Services are “just” (distributed) programs sending and receiving messages with well defined syntax • Interfaces (input-output) must be open; innards can be open source (allowing you to modify) or proprietary • Services can be any language from Fortran, Shell scripts, C, C#, C++, Java, Python, Perl – your choice!! • Web Services supported by all vendors (IBM, Microsoft …) • Service overhead will be just a few milliseconds (more now) which is < typical network transit time • Any program that is distributed can be a Web service • Any program taking execution time ≥ 20ms can be an efficient Web service

Some Grid Concepts II • Systems are built from contributions from many different groups – you do not need one “vendor” for all components as Web services allow interoperability between components • One reason DoD likes Grids (called Net-Centric computing) • Grids are distributed in services and data allowing anybody to store their data and to produce “their” view • Some think that University Library of future will curate/store data of their faculty • “2 level programming model”: Classic programming of services and services are composed using workflow consistent with industry standards (BPEL) • Grid of Grids: (System of Systems) Realistically Grid-like systems will be built using multiple technologies and “standards” – use wrapping to integrate Pipeline Pilot, CBIS, Chembank,ChemBench, ChemModLab, PowerMV etc. with OGSA(Open Grid Service Architecture from OGF) systems into a single Grid

Grid Capabilities for Science (Cheminformatics) • Open technologies for any large scale distributed system that is adopted by industry, many sciences and many countries (including UK, EU, USA, Asia) • Security, Reliability, Management and state standards • Many bioinformatics grids including BIRN, caBIG, MyGrid • Also computational chemistry and related (materials) grids • Service and messaging specifications • User interfaces via portals and portlets virtualizing to desktops, email, PDA’s etc. • ~20 TeraGrid Science Gateways including RENCI Bio portal • OGCE Portal technology effort led by Indiana • Uniform approach to access distributed (super)computers supporting single (large) jobs and spawning lots of related jobs • Data and meta-data architecture supporting real-time and archives as well as federation • Links to Semantic web and annotation • Grid (Web service) workflow with standards and several successful instantiations (such as Taverna and MyLead)

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg Taverna • Taverna is typical Grid workflow developed in UK for bioinformatics in MyGrid project • Not maybe better than well known tools like Pipeline Pilot but links to “all Grid services” • Taverna being robustified and extended by UK eScience program

Streaming Data Support Transformations Data Checking Hidden MarkovDatamining (JPL) Display (GIS) Grid Workflow Datamining in Earth Science NASA GPS • Work with Scripps Institute • Grid services controlled by workflow process real time data from ~70 GPS Sensors in Southern California Earthquake

Use a Portlet-based user portal to access and control services and workflow Grid Workflow Data Assimilation in Earth Science • Grid services triggered by abnormal events and controlled by workflow process real time data from radar and high resolution simulations for tornado forecasts

Next steps? • Define WSDL interfaces to enable global production of compatible Web services; refine CML • Ready to try “Prototype Production” • Develop more training material • Refine/go into production with key services including both tools, workflows and TeraGrid style simulations in capacity and capability modes • In-house algorithm work for new services in clustering, diversity analysis, QSAR methodologies CICC Prototype Web Services Basic cheminformatics Key Ideas Molecular weights Molecular formulae Tanimoto similarity 2D Structure diagrams Molecular descriptors 3D structures InChi generation/search CMLRSS • Add value to PubChem with additional distributed services and databases • Wrapping existing code in web services is not difficult • Provide “core” (CDK) services and exemplars of typical tools • Provide access to key databases via a web service interface • Provide access to major Compute Grids Application based services Compare (NIH) Toxicity predictions (ToxTree) Literature extraction (OSCAR3) Clustering (BCI Toolkit) Docking, filtering, ... (OpenEye)Varuna simulation

Web Service Locations Cambridge University • InChi generation / search • CMLRSS • OpenBabel Indiana University • Clustering • VOTables • OSCAR3 • Toxicity classification • Database services SDSCTypical TeraGrid Site InfoChem • SPRESI database NIH PubChem ….. Compare ….. Penn State University CDK based services • Fingerprints • Similarity calculations • 2D structure diagrams • Molecular descriptors

Workflows Using Chemical Literature Find similar documents Bulk download of Pubmed abstracts Find similar molecules All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red PDBBind OSCAR3 Service OSCAR3 program PubChem Local DTP database Extract chemical structures SMILES NAME Pubmed ID CCC propane 1425356 CC ethane 3546453 ..... ............. ............. Searchable (structure/similarity) Grid database Clustering of documents linked to clustering of chemicals

Large Scale Calculations on “All of PubChem/Med” • TeraGrid: 100 Teraflop now to 1000 Teraflop next year • IU 2048 node Big Red supercomputer: 20 Teraflop today • The CDK can currently calculate approx. 107 Descriptors • Whole of PubChem (6M compounds) – 276 hours, 1 CPU • On IU's Big Red, 2048 CPU's, 20 TF: < 7 minutes • Even increasing the descriptor count by 5 times gives us < 35 minutes of compute time on Big Red • OSCAR3 takes a few seconds per abstract to text-mine all compounds in it • All of PubMed would take < a day on Big Red • Cleanup and Iteration would take some time • Can pre-calculate properties of smaller compounds using CDK (logP, BCUT, CPSA, …) and programs likes GAMESS • 100,000 compounds take < a week each on a single CPU and would be a practical computation over next year

Web Service to generate custom force fields Prototype CICC Project: Controlling the TGFb pathwayCollaboration between Baik & Zhang at IU Simulations in-house Molecules in Varuna QM Database AutoGeFF Can afford few ms overhead! TeraGridSupercomputers“Flocks” VARUNA Conceptual Understanding of TGFb Inhibition Inactive TGFb Active TGFb With inhibitor 1IAS • Questions: • - What molecular feature controls inhibitor binding? • - How do mutations impact binding? PubChem Experimentsin the Zhang Lab PDB

Simulating the Structure and Reactivity of Cu-Ab Complex • One of the speculation about the pathogenesis of Alzheimer’s Disease involves complexes of Cu-ions and b-Amyloid plaques. • We will test the hypothesis that a Cu-Ab complex can catalytically activate dioxygen to give hydrogen peroxide in a molecular modeling study: • Unfortunately, the structure of the Cu-Ab complex is currently not known.(a) We will carry out a combined QM/QM-MM/MD study to propose a structure(b) We will evaluate the plausibility of the catalysis by constructing a reaction profile(c) We will use PubChem in combination with our in-house Quantum Chemical Database to identify small molecules and molecule classes that may inhibit the catalysis. • This study serves as a Prototype application where • an unknown protein fragment structure is computed a priori and saved in an in-house database. • the diversity of small molecule targets are derived from clustering PubChem and other federated databases, including an in-house structural database

Other Chemical Projects Utilizing the Cyberinfrastructure • Mechanistic studies on • how the anticancer drug cisplatin interacts with DNA to kill cancer cells. • how xanthine oxidase catalyzes the oxidation of xanthine. • how the bacterial enzyme methane monooxygenase catalyzes methane oxidation. • electrocyclization of small organic molecules that are relevant for natural product synthesis. • stereoselective carbocyclizations that are catalyzed by Rhodium complexes. • how molecular probes for in vivo detection of Zinc, Mercury and Lead can be designed in a rational fashion. • Utilizing a molecular modeling database • By registering and saving the structures, charge distributions and molecular orbitals of computer simulations, we can conduct a new kind of similarity searches and recognize trends. • The molecular modeling database will allow for curating the structural information of other databases, such as PubChem, by providing more detailed simulated information.

MLSCN Post-HTS Biology Decision Support Percent Inhibition or IC50 data is retrieved from HTS Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis A Grid of Grids linking collections of services atPubChem ECCR centers MLSCN centers Workflows encoding plate & control well statistics, distribution analysis, etc Question: Was this screen successful? Workflows encoding distribution analysis of screening results Question: What should the active/inactive cutoffs be? Question: What can we learn about the target protein or cell line from this screen? Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc Compounds submitted to PubChem PROCESS CHEMINFORMATICS GRIDS

MLSCN Data - How services and workflows are used PubChem interfaces to workflows via SOAP Data is stored in Pubchem MLSCN submits HTS data to Pubchem and/or sends directly to workflow for real-time feedback Workflows perform different kinds of analysis on the MLSCN data, including SAR, clustering, literature searching, protein searching, toxicity testing, etc… End-user applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis

Example HTS workflow: finding cell-protein relationships A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex) The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand. Docking results and activity patterns fed into R services for building of activity models and correlations LeastSquares Regression RandomForests NeuralNets Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet. SImilar structures to the ligand can be browsed using client portlets.

Next steps in workflows • Expansion of HTS Workflows • Inclusion of ToxTree for toxicity flagging • Prediction of protein binding through PDB ligand similarity search • Inclusion of literature text mining (OSCAR) • Using PubChem data instead of tumor cell dataset • More workflows • Incorporating VARUNA, PubChem, PDBBind and other services • Workflows from Cambridge collaboration • Making workflows available in other systems • Taverna SCUFL <-> BPEL conversion • Use of workflows in other execution environments (starting with myLEAD supporting triggering)

Methods Development at the CICC • Tagging methods for web-based annotation exploiting del.icio.us and Connotea • Development of QSAR model interpretability and applicability methods • RNN-Profiles for exploration of chemical spaces • VisualiSAR - SAR through visual analysis • See http://www.daylight.com/meetings/mug99/Wild/Mug99.html • Visual Similarity Matrices for High Volume Datasets • See http://www.osl.iu.edu/~chemuell/new/bioinformatics.php • Fast, accurate clustering using parallel Divisive K-means • Mapping of Natural Language queries to use cases and workflows • Advanced data mining models for drug discovery information

MPI Parallel Divkmeans clustering of PubChem AVIDD Linux cluster, 5,273,852 structures (Pubchem compound collection, Nov 2005)

Exploring Chemical Spaces • The problem • Thousands of compounds • 10's to 100's of descriptors • Requirements • In my chemistry space what are the outliers? Which compounds are in the dense regions of space? • I don't want to / can't do descriptor selection • I don't want to squash things into a lower dimensional space • I want a simple way to view all this • Our approach (Guha, R. et al;, J. Chem. Inf. Model., 2006) : Use the R-NN profile technique

R-NN Profiles & Exploring Chemical Space • 4337 molecules • <MW> = 240, 5 descriptors • 2 known outliers • Molecules at the top are in sparse regions • Molecules at the bottom are in dense regions • Drill down into specific regions (GGgobi, VOPlot ...), annotate with activity, ... • Simple & intuitive, can be very fast

R-NN Profiles & HPC • R-NN profiles require a pairwise distance matrix • Can be sped up with approximate NN methods • R-NN profiles can be trivially parallelized • 1000 x 100 data matrix => 1000 x 1000 distance matrix -> 2.1 sec (P4 1GhZ laptop) • Evaluating R-NN profiles for 1000 compounds -> 43 sec • The current parameters allow a 100x speedup if we use 100 CPU's

Measuring Model Applicability • We have many ways to build multiple models • We perform validation • But can we use a stored model for a new molecule(s)? • Trivially, yes • But does it really make sense to do this? • Depends on similarities to the training set • Also depends on a global chemistry space • We can provide a component that attempts to answer this question for arbitrary model types Guha, R. et al., J. Chem. Inf. Model., 2005, 45, 65-73

Measuring Model Applicability Stored OLS/CNN/SVM/... Model • Our initial approach • Considers regression models • Considers similarity to the TSET • We and P&G are working on more robust methods that try and take into account a global chemistry space • Alternate methods can be easily included in workflows Predict property Training set residuals Choose cutoff New (unseen) molecule Obtain applicability Auxillary classification model Decide whether it makes sense to go with this prediction

More detailed Slides not used

Run Workflow Load Workflow Result Output URL Result Output Current Process

Preliminary Results • Shown is a fully equilibrated structure (highest population in a 10 ns MD @ room temp) of theCu-Ab structure. • The Cu-(peptide) bonds require special attention,as standard force-fields do not allow simulationsof this type. • We use a new tool AutoGeFF (to be implented as a Webservice) that recognizes bonds for which no force field parameters exist. AutoGeFF can generate appropriate force fields by automatically carrying out a QM calculation on a small model system and fitting new forces to computed vibrational frequencies. We use a guided Monte Carlo approach to iteratively derive these forces. Currently, AutoGeFF recognizes most of the transition metals. Future work will include organic moieties (such as drug candidates).

Example HTS workflow: organization & flagging A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database) OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs The compounds are clustered on chemical structure similarity, to group similar compounds together The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT

Example of workflow output - LogP vs GI50 Plotting XLogP against GI50 can help identify highly active compounds with good logP profiles (1 - 4 range)

Example of workflow output - Cluster # vs GI50 Plotting Cluster against GI50 can help identify groups of highly active, structurally similar compounds, and also clusters which might yield good QSAR information

Example workflow output - docked complexes NSC_ID 685478Docking score -29.74 NSC_ID 719175Docking score -30.78 NSC_ID 725806Docking score -32.15 NSC_ID 685477 Docking score -35.51 Example output of most similar compounds to PDB 1Y4 complex ligands docked into the target protein using OpenEye FRED

Collaborative Cheminformatics and Cyberinfrastructure Workshop

Collaborative Cheminformatics and Cyberinfrastructure Workshop

Presentation Transcript

Mathew Palakal, Computer Science Susan Tennant, Informatics

July 18-20, 2006

June 8 2004 APAC Meeting Sydney Australia Geoffrey Fox Community Grids Lab, Pervasive Technologies Laboratories Indian

MapReduce TG11 BOF FutureGrid Team (Geoffrey Fox)

February 11 2013 Geoffrey Fox gcf@indiana.edu

Geoffrey Fox gcf@indiana.edu

July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

Computer Science and Engineering Geoffrey M. Voelker Assistant Professor

PTLIU Laboratory for Community Grids Geoffrey Fox Computer Science, Informatics, Physics

CEN Network Technology Briefing – July 2006

Presented at Modeling Subcommittee July 18, 2006

Shrideep Pallickara and Geoffrey Fox

INTERACTIVE TECHNOLOGY LABORATORIES

Geoffrey Fox gcf@indiana infomall School of Informatics and Computing

Berlin 18 July 2006

Geoffrey Fox Indiana University gcf@indiana infomall

Western Pacific Geophysics Meeting (WPGM) Beijing Convention Center July 26 2006 Geoffrey Fox

Geoffrey Fox, Shrideep Pallickara and Xi Rao

July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories

INTERACTIVE TECHNOLOGY LABORATORIES

Jan 16 2004 Los Angeles Geoffrey Fox Community Grids Lab, Pervasive Technologies Laboratories

Geoffrey Fox Indiana University gcf@indiana infomall