caGrid 1.0 Reference Implementations How we got on the Grid Patrick McConnell1, Rakesh Nagarajan2, Tony Pan3, Martin Morgan4, Ted Liefeld5, Kiran Keshav6, Ram Chilukuri7,Scott Oster3 1Duke Comprehensive Cancer Center2Washington University3Ohio State University4Fred Hutchinson Cancer Research Center5Broad Institute6Columbia University7SemanticBits
Agenda • Background (Patrick, Scott) • caGrid overview, background, landscape, features • Reference implementations • caTRIP (Ram) • caB2B (Rakesh) • geWorkbench (Kiran) • Bioconductor (Martin) • GenePattern (Ted) • GeneConnect (Rakesh) • GridIMAGE (Tony) • Summary (Patrick) • Panel discussion (All)
BackgroundWhat is caBIG? • Common, widely distributed infrastructure that permits the cancer research community to focus on innovation • Shared, harmonized set of terminology, data elements, and data models that facilitate information exchange • Collection of interoperable applications developed to common standards • Cancer research data is available for mining and integration
BackgroundWhat is caGrid? • What is Grid? • Evolution of distributed computing to support sciences and engineering • Sharing of resources (computational, storage, data, etc) • Secure Access (global authentication, local authorization, policies, trust, etc.) • Open Standards • Virtualization • What is caGrid? • Development project of Architecture Workspace • Helping define and implement Gold Compliance • Implementation of Grid technology • Leverages open standards, community open source projects • No requirements on implementation technology necessary for compliance • Specifications will be created defining requirements for interoperability • caGrid provides core infrastructure, and tooling to provide “a way” to achieve Gold compliance • Gold compliance creates the G in caBIG™ • Gold => Grid => connecting Silver Systems
Background caGrid community involvement • caGrid itself provides no real “data” or “analysis” to caBIG™; its the enabling infrastructure which allows the community to do so • Community members add value to the grid as applications, services, and processes (for example: shared workflows) • caGrid provides the necessary core services, APIs, and tooling • The real “value” of the grid comes from bringing this information to the “end user” • Community members develop end user applications which consume of the resources provided by the grid
BackgroundReference implementation overview • Ref - er - ence im'ple - men - ta'tion – (rěf'ər-əns ĭm'plə-mənt-tā'shən)nounApplications that expose caGrid services and leverage caGrid core services before the full release of the caGrid toolkit. • Also “early adopters” • Objectives • Gather feedback on features • Identify bugs • In the caGrid software • Provide grid-based examples for other caBIG development projects • Means • Select at least four reference implementations • Two analytical services and two data services
BackgroundReference implementation landscape CAGRIDDorian CAGRIDIndexService CAGRIDGME CAGRIDWorkflow CAGRIDGTS CAGRIDcaDSR CAGRIDPortal COLUMBIAgeWorkbench FRED HUTCHBioconductor DUKEcaTRIP OSUGridIMAGE WASH UGeneConnect BROADGenePattern NCICBcaArray Grid-Enabled Client Grid Services Infrastructure (Secure Communication, Service Invocation, Data Transfer)
caGrid 1.0 Reference Implementations caTRIP Ram Chilukuri, SemanticBits
caTRIPApplication background • Cancer Translational Research Informatics Platform • Motivating use case: outcomes analysis • Use data from existing patients to inform the treatment of another patient • Provides a mechanism to query across caGrid data services • Tumor Registry: diagnosis, treatment, recurrence, follow-up, etc. • CAE: pathology biomarkers, other annotations • caTissue CORE: tissue banking data • caTIES: pathology reports • caIntegrator: SNP data • Uses the MRN to join data across services • Provides a Java-based graphical interface to select services, filters, and query results
caTRIPcaGrid integration • Grid data services • Provides common way to query information systems • Grid metadata • Queries are object oriented and tied to CDEs • Services advertise what CDEs the they expose • Grid client • Query builder is metadata-driven, semantically rich • Grid security (coming soon in caTRIP) • Communication is secured at the transport layer • Authentication can be performed using institutional credentials • Authorization can be enforced at the data element level • Leverage core caGrid services/tools • Introduce, GME, caDSR Service, Index Service, Federated Query Processor, Authentication Service, Grid Grouper
caTRIPcaGrid integration Distributed Query Engine query GUI authenticate discover Domain Grid Services Core Grid Services authorize CGEMSSNP caTissueCORE CAE caTIES TR GridGrouper IdPService IndexService Duke caTIES TR caTissue CORE CAE caIntegrator Domain Controller Illumina MAW3 Tumor Registry
caTRIPProcess of silver compliance and grid enabling • The process of silver compliance • Most models already registered: caTissue CORE, CAE, caTIES, caIntegrator • Tumor Registry is a new model • Went through process of model creation, curation, and registration • CAE required some modifications • Extended model, added classes, curated, and registered • Grid data service APIs do not require any further registration • Describe the process of grid enabling • Leveraged CQLProcessor out of the box, then extended • Implemented first pass at the caGrid 1.0 Federated Query Processor • Initially leveraged Introduce-built Java clients for invoking the services • Migrated to handling all domain data at the XML level • caGrid team provided this functionality in the FQP • We implemented in the graphical client
caTRIPTechnical difficulties and wish-list • Technical difficulties • Working infrastructure as it is being developed can be challenging • Metadata registration process is long and arduous • Worked around requirement for model to be registered • Developing the Federated Query Processor was complex/challenging • Building a metadata-driven system was complex/challenging • Overcame various bugs in the CQLProcessor • Political difficulties • IT staff very supportive in caTRIP leveraging/deploying caGrid • Many challenges around data sharing and deidentification • IRBs do not understand grid technology • Wish-list • More features in data services and the Federated Query Processor • Selecting attributes to return
caTRIPLessons learned • Silver compatibility • Register metadata early in the project • Take time to carefully craft class/attribute names and definitions • It will pay off in time and semantics, trust me • Building services • Use workarounds to build services in parallel with metadata registration • Understand the underlying mechanisms that Introduce employs • When things go wrong, you will need to manually modify services • Leverage metadata • Leveraging metadata can be technically challenging • Makes for very semantically rich applications • Garbage-in, garbage-out • The semantics are only as good as you make them • Data sharing and deidentification • Approach IRB early and draft proposals in terms that they will understand
caGrid 1.0 Reference Implementations caBench-to-Bedside (caB2B) Rakesh Nagarajan, Washington University
caB2BApplication background • caBench-to-Bedside (caB2B) is a client-server (3-tier) application developed in Java (Swing + J2EE) • Leverages caGrid services to facilitate translational research • Goals (Iteration 1): • Search for biospecimens and associated microarray expression profiling data across the caGrid • Annotate microarray data sets using caGrid services • Co-analyze microarray data sets and clinicopathology information using caGrid services • Graphically capture and orchestrate the above steps using the caGrid workflow engine • Examine the results using a rich set of visualization windows
caB2BArchitecture E J B L O C A T E R Server Client SC: Static cache EXP: Experiment data MDR: Metadata repository TMP: Temporary cache Query Engine CQL Generator Query UI SC EXP Metadata Search Engine Async Job Manager MDR TMP Results Viewers Local database Experiment UI Path Resolver Data Cache Metadata Loader caGrid Service Locator Metadata Search UI
caB2B • Metadata based search, forming CQL/DCQL(s). • Analysis of data obtained by querying grid services. • caGrid core services utilized • GME • IndexService
caB2B-caGRID Interaction WASH UcaTissueCore1 WASH UcaFE NCICBcaArray WASH UGeneConnect WASH U caTissueCore2 Grid Services Infrastructure (Secure Communication, Service Invocation, Data Transfer) caB2B
caB2BTechnical difficulties and wish-list • Wish-list • Deploying multiple services on the same container. • Allow multiple target objects in DCQL. • Allow for selection of limited attributes of a target object in DCQL. • Documentation on how to create grid services for non caCORE SDK generated systems.
caB2BLessons learned • Get technical advice at an early stage of the grid integration process from the caGRID team. • Time to integrate disparate applications is decreased using caCORE and caGrid infrastructure
caGrid 1.0 Reference Implementations geWorkbench Kiran Keshav Columbia University Center for Computational Biology and Bioinformatics
geWorkbenchApplication background • geWorkbench - Java based integrative genomics platform. • Component based architecture allows for complex applications to be added/removed as individual plugins. • caGrid is no exception • caGrid component that can invoke grid services, retrieve the results, and use other components of the framework (ie. for visualization). • Function • Discover services registered at a given IndexService and allow service invocation.
geWorkbench caGrid integration • Registered analytical services. • Exposes two clustering algorithms, and one mutual information algorithm. • Exposed services: • Hierarchical Clustering • SOM Clustering • Aracne • Core services reused: • GME • IndexService • Clients • geWorkbench – caGrid Explorer • Introduce created clients • Benefits of Grid Integration • Resource intensive tasks can be run remotely. • Consolidation of effort - 3 groups have access to the same service, as opposed to each group creating their own.
geWorkbench caGrid integration geWorkbench caGrid Component Workflow • Analytical Services • HierarchicalClustering • SomClustering • Aracne geWorkbench – caGrid Type Converter • Core Services • Index Service
geWorkbench Process of silver compliance and grid enabling • Silver compliance • Models registered • MAGE • StatML • Microarray • These can be viewed at • http://cdebrowser.nci.nih.gov/CDEBrowser/ • APIs exposed • Hierarchical Clustering • public HierarchicalCluster execute(MicroarraySet microarraySet, HierarchicalClusteringParameter hierarchicalClusteringParameter); • public HierarchicalCluster execute(BioAssayData bioAssayData, HierarchicalClusteringParameter hierarchicalClusteringParameter); • Self Organized Maps Clustering • public SomCluster execute(MicroarraySet microarraySet, SomClusteringParameter somClusteringParameter); • public SomCluster execute(BioAssayData bioAssayData, SomClusteringParameter somClusteringParameter); • Aracne • public AdjacencyMatrix execute(MicroarraySet microarraySet, AracneParameter aracneParameter); • public AdjacencyMatrix execute(BioAssayData bioAssayData, AracneParameter aracneParameter); • Grid Enablement Process • Generated analytical service plumbing code with the Introduce Toolkit. • Added “business logic” to this. • Created caGrid clients with the Introduce toolkit, and wrapped these in a geWorkbench plug-in to connect to grid APIs. • Wrapped the Discovery API in a geWorkbench plug-in to discover services. • Created a series of converters to convert between caGrid and geWorkbench types.
geWorkbench Technical difficulties and wish-list • Technical Difficulties • Inconsistencies with some of the tools. • UML • Model inputs and outputs. • Create .xsd from model for testing. • Generate code with introduce based on types defined in the xsd file. • The @documentation tags added to the attributes, methods, classes in the model are ignored by the Introduce Toolkit. We have to add things like javadoc comments manually. • Why model services in UML if it is not picked up by Introduce? • Why model services in UML if it is not picked up by Semantic Integration Workbench (SIW)? • Model Registration in caDSR • The need to map CDEs to every attribute for models already registered in the caDSR is a little confusing. Goes against model reuse. • The time/process between submitting a model for registration and having it registered is manual/length (although Claire Wolfe has helped with this process considerably). • Large Scale Data Transfer • Resolution • “Manually” overcame UML and caDSR related issues. • Scott Oster had some useful suggestions for large scale data transfer. • Political Difficulties • None in house thus far. Could be more of an issue as we get more traffic. • Wish-list • Lower level services, typical of other grid systems like job control/monitoring, job queuing, packet interleaving. • The caGrid team has improved their documentation. Consolidation of technical guides, process, etc. would be useful. Since this is a WIP, I recommend wikis as opposed to white papers.
geWorkbench Lessons learned • Technical • Many of the tools have previously been used with data services. • Creating analytical services exercises a suite of infrastructure tooling. These tools should be maximized as much as possible. • Infrastructure/toolkit • Introduce was very helpful in creating service “plumbing” code. Seeing that it creates a suite of files to supplement the generated code, is anyone really building services without this? • Political • Reuse should be a central theme, not just applied to data types. That is, groups undergoing similar tasks should be able to consolidate efforts. • geWorkbench, bioconductor, and gene pattern came across this when trying to register the MAGE model. • Benefits of using caGrid infrastructure • Brings together a wide variety of tools. • Allows end users to seek data in existing data repositories (as opposed to having to do data dump). • Allows end users to reuse analytical algorithms (as opposed to having to create their own). • caBIG Compatibility • Again, the central theme is reuse. This should be applied at all levels of the development process. • An integral part of all client code is that you have to write converters to/from grid types.
caGrid 1.0 Reference Implementations Bioconductor Martin MorganFred Hutchinson Cancer Research CenterSeattle, WA email@example.com://bioconductor.org
BioconductorApplication background • Open source statistical software • >150 user-contributed packages • R programming language • High-throughput gene expression analysis • Expression array pre-processing, linear models, clustering and machine learning, expression pathways • Flexible ad hoc analyses • A very different project • Interpreted & interactive language • Weakly typed • Object-oriented programming concepts • Object / method separation • Multiple inheritance • C implementation; not thread-safe
Bioconductor Approaching caGrid integration • Motivation • Access: web service • Ability: computing resources • Exchange: data sharing • Implementation: Analytic caGrid services • Microarray pre-processing – caAffy • DNA copy number variation – caDNAcopy • Mass spec. peak finding – caPROcess • Three challenges • Speaking the language • Serving the service • Exposing appropriate services
Bioconductor Challenge I: Speaking the language • The R language • Weakly typed • Implemented in C • Solution: existing and new software infrastructure • TypeInfo: apply strong typing to R functions and methods • SJava: native R / Javadata and method interface • RWebServices: augment and integrate SJava and web services
Bioconductor Challenge II: Serving the service • Technical challenges • Non-threadable R • Specialized computational needs • Multiple users • Solution: Java implementation layer • Messaging protocol (activeMQ) • One or many persistent R ‘workers’ • Specialized tasks
Bioconductor Challenge III: Exposing appropriate services • From research to established protocol • Transition from exploratory research methods to established analytic protocols • Standardizing data and services for interoperability • Semantic types • Method signatures • Solutions • Identify appropriate workflows & granularity of exposure • Map (portions of) original R objects to registered types, e.g., ExpressionSet to MAGE-OM • Standardize method signature to data + parameter objects • Several open issues: NA values; composite results
Bioconductor Lessons learned and future directions • A use case for local grid service deployment? • Benefits of grid-enabled packages • Control of internal resources, sensitive data • Exploiting caGrid and Globus facilities • Implementing available ‘big data’ transfer • Implementing stateful services • Service replication
caGrid 1.0 Reference Implementations Ted Liefeld, The Broad Institute of MIT and Harvard
GenePatternApplication background • An analysis workflow tool designed to support multidisciplinary genomic research programs and to encourage the rapid integration and dissemination of new analytical techniques • Provides a means to • Integrate and share computational tools in any programming language • Generate workflows to capture methodologies • Easily add new analysis and visualization tools • Support reproducible in silico research in genomics, proteomics, SNP analysis and other domains • GenePattern is used by • Lab technicians – automated processing of raw data files • Biologists – perform sophisticated analyses and data visualization • Bioinformaticians/Comp. Biologists – develop and share new methodologies and analyses
GenePatterncaGrid integration GenePattern integrates with caGrid in two ways; As a client As a service provider Three services available and published (IndexService and GME) PreprocessDataset Consensus Clustering Comparative Marker Selection Grid Integration permits us As a client – to provide access to additional analyses and data sources from within GenePattern As a service provider – to make GenePattern’s ~70 analyses services available to people who already have caGrid enabled clients
GenePatterncaGrid integration caGrid Clients GenePattern Clients caGrid Web Browser Client Graphical Client caGrid Services caGrid caGrid Services caGrid Services caGRID proxy HTTP SOAP SOAP caGrid SOAP Analysis Task Manager caGrid Client PPD Algorithm CMS Algorithm Consensus Clustering Algorithm caGrid GenePattern Engine
GenePattern Process of silver compliance and grid enabling Achieving Silver compliance We registered a model that covered each service’s Input data types Input parameter sets Output data types Input data types were reused from other projects (STAT-ML, MAGE) Parameters (and some outputs) are unique to each service GenePattern is based on a loose-type system in its pre-existing SOAP APIs. new, strongly typed, APIs were created for each of these services using the modeled data types Grid Enablement Introduce was used to connect our APIs to our server A new ‘grid proxy’ was written to map between GenePattern’s native API and the caGrid published API