caBIG Data Structures CS584 Lecture on 4/6/2007 Patrick McConnell Duke Comprehensive Cancer Centerpatrick.email@example.com
Agenda • caBIG background (5 min, 8 slides) • Goals, program structure, organizations • caTRIP background (5 min, 6 slides) • Background, use cases, architecture • caBIG compatibility (30 min, 21 slides + demonstration) • Interoperability, compatibility, syntactics, and semantics • Building caBIG compatible systems (10 min, 7 slides) • Interoperability, compatibility, syntactics, and semantics • caGrid (10 min, 8 slides) • Background, service creation, metadata • caTRIP demonstration (10 min, 2 slides + demo) • Demonstration • Discussion/questions (5 min + throughout)
caBIG Background Goals, program structure, organizations
caBIG backgroundBiomedical information tsunami • overwhelming volume of data • multitude of sources
caBIG backgroundInformatics tower of Babel • Each cancer research community speaks its own scientific “dialect” • Integration critical to achieve promise of molecular medicine
caBIG backgroundGoals and principles • 50 Cancer Centers are working towards a common goal of integrated data, tools and methodologies to accelerate cancer research goals at the National Cancer Institute for Bioinformatics (NCICB), the cancer Biomedical Informatics Grid (caBIG™) • The goal of caBIG™ is to create a virtual web of interconnected data, individuals, and organizations which will: • redefine how research is conducted • care is provided • patients / participants interact with the biomedical research enterprise • The principles driving caBIG™ are: • Open Source • Open Access • Open Development • Federated Model
caBIG backgroundWorkspaces DOMAIN WORKSPACE 1 Clinical Trial Management Systems addresses the need for consistent, open and comprehensive tools for clinical trials management. DOMAIN WORKSPACE 2 Integrative Cancer Research provides tools and systems to enable integration and sharing of information. DOMAIN WORKSPACE 3 Tissue Banks & Pathology Tools provides for the integration, development, and implementation of tissue and pathology tools. DOMAIN WORKSPACE 4 Imaging provides for the sharing and analysis of in vivo imaging data. responsible for evaluating, developing, and integrating systems for vocabulary and ontology content, standards, and software systems for content delivery CROSS CUTTING WORKSPACE 1 Vocabularies & Common Data Elements developing architectural standards and architecture necessary for other workspaces. CROSS CUTTING WORKSPACE 2 Architecture
caBIG backgroundCommunities Ohio State University-Arthur G. James/Richard Solove Oregon Health and Science University Roswell Park Cancer Institute St Jude Children's Research Hospital Thomas Jefferson University-Kimmel Translational Genomics Research Institute Tulane University School of Medicine University of Alabama at Birmingham University of Arizona University of California Irvine-Chao Family University of California, San Francisco University of California-Davis University of Chicago University of Colorado University of Hawaii University of Iowa-Holden University of Michigan University of Minnesota University of Nebraska University of North Carolina-Lineberger University of Pennsylvania-Abramson University of Pittsburgh University of South Florida-H. Lee Moffitt University of Southern California-Norris University of Vermont University of Wisconsin Vanderbilt University-Ingram Velos Virginia Commonwealth University-Massey Virginia Tech Wake Forest University Washington University-Siteman Wistar Yale UniversityNorthwestern University-Robert H. Lurie 9Star Research Albert Einstein Ardais Argonne National Laboratory Burnham Institute California Institute of Technology-JPL City of Hope Clinical Trial Information Service (CTIS) Cold Spring Harbor Columbia University-Herbert Irving Consumer Advocates in Research and Related Activities (CARRA) Dartmouth-Norris Cotton Data Works Development Department of Veterans Affairs Drexel University Duke University EMMES Corporation First Genetic Trust Food and Drug Administration Fox Chase Fred Hutchinson GE Global Research Center Georgetown University-Lombardi IBM Indiana University Internet 2 Jackson Laboratory Johns Hopkins-Sidney Kimmel Lawrence Berkeley National Laboratory Massachusetts Institute of Technology Mayo Clinic Memorial Sloan Kettering Meyer L. Prentis-Karmanos New York University
Pankaj Agarwal Bob Annechiarico Bill Banks Vijaya Chadaram Jamie Cuticchia Raj Dash Mohammad Farid Seth Fehrs Patrick McConnell Salvatore Mungal Mark Peedin CALGB CCR Coalition of Cooperative Groups Dana Farber Georgetown Mayo Oregon Health Sciences University SemanticBits LLC University of Pennsylvania Wake Forest Yale Integrative Cancer Research Workspace participant RProteomics developer caTRIP developer Architecture Workspace participant caGrid developer caGrid scientific liaison Guide to Mentors Vocabularies and Common Data Elements Workspace participant Guide to Mentors Clinical Trials Management Systems Workspace participant C3PR developer CTMS Interoperability architect C3D developer Tissue Banking and Pathology Tools Workspace participant caTissue adopter Strategic Planning Workspace participant caBIG backgroundDuke’s role in caBIG
The Cancer Translational ResearchInformatics Platform (caTRIP) Background, use cases, architecture
Duke Bioinformatics Jamie Cuticchia (PI) Patrick McConnell (lead architect) Duke Information Systems Bob Annechiarico (PM) Wilma Stanley (developer) Mark Peedin (developer) Mohamad Farid (DBA) Jeff Allred (IT manager) Duke Pathology Raj Dash (domain expert) Chris Hubbard (developer) Duke Oncology Kelley Marcom (domain expert) Gretchen Kimmick (domain expert) Kimberly Blackwell (domain expert) Lee Wilke (domain expert) Duke CALGB Kimberly Johnson (DataMart liaison) SemanticBits Ram Chilukuri (lead developer) Srini Akkala (developer) Sanjeev Agarwal (developer) 5 AM Solutions Bill Mason (developer) NCI Julie Klemm (ICR WS lead) Carl Shaefer (NCI rep) Subha Madhavan (caIntegrator PM) BAH Curtis Lockshin Mehul Shah (tech support) caTRIPWho is involved? Managers and Architects Software Developers Database Developers and IT NCI/BAH Domain Experts
caTRIP What is translational research? • Bench-to-Bedside • Wikipedia (the source of all knowledge):Translational medicine is a branch of medical research that attempts to more directly connect basic research to patient care. • Basic research occurs in the lab • Patient care occurs in the clinic • Translational research broadened…Translational medicine can also have a much broader definition, referring to the development and application of new technologies in a patient driven environment - where the emphasis is on early patient testing and evaluation.…facilitate the interaction between basic research clinical medicine, particularly in clinical trials.
caTRIP Initial focus • Our initial focus will be on connecting existing data systems, including basic science data, to enhance patient care • Initial problem scenario: outcomes analysis • Use data from existing patients to inform the treatment of another patient • Leverage clinical, pathology, tissue, and basic science data • Scenario:Patient A enters the clinic. What treatments were applied with success on other patients with similar characteristics (race, sex, symptoms, pathology results, adverse events, biomarkers).
caTRIP Broadened focus: scientific use cases • Find available tumor tissue • What are all the tissue specimens from her2/neu positive patients that have a primary tumor in the breast and are BRCA1 positive? • Find factors of survival • What are all the ER positive patients that have survived breast cancer after radiation treatment? • Find patients for trials • What are all the patients that are triple negative (ER, PR, and HER2/NEU negative)? • Determine the distribution of disease factors over time • Does a change in pathology biomarkers over time contribute to recurrence or death? • Determine correlation of factors pre and post surgery • Does a change in ER or PR status before and after surgery correlate with other factors? • Find pathology reports of interest • Show me all of the pathology reports for Her2/Neu positive patients with a lobular carcinoma.
caTRIP Connecting disparate data systems CAEPathology Biomarkers Tumor RegistryDiagnosis, Treatment, Recurrence, Follow-up caTissue CORETissue Bank MRN caTRIP caTRIP caTRIP caTRIP caIntegratorSNP Data caTIESPathology Reports
caTRIP Architecture overview Distributed Query Engine query GUI authenticate discover Domain Grid Services Core Grid Services authorize CGEMSSNP caTissueCORE CAE caTIES TR GridGrouper IdPService IndexService Duke caTIES TR caTissue CORE CAE caIntegrator Domain Controller Illumina MAW3 Tumor Registry
caBIG Compatibility Interoperability, compatibility, syntactics, and semantics
caBIG compatibility Interoperability defined Courtesy: Charlie Mead ability of a system to access and use the parts or equipment of another system Syntacticinteroperability Semanticinteroperability
caBIG compatibility How does this apply to caBIG? • Connect scientists and practitioners through a shareable and interoperable infrastructure • Develop standard rules and a common language to more easily share information (compatibility guidelines) • Build or adapt tools for collecting, analyzing, integrating, and disseminating information associated with cancer research and care. “The cancer community is united in its mission to eliminate suffering and death due to cancer. It is now connected by caBIG™. “
caBIG compatibility What is compatibility in caBIG? The four areas of the caBIG compatibility guidelines: • Information Models - Individual types of data are rarely collected or presented in isolation. Rather, they are assembled into a contextual environment that includes closely and more distantly associated data and information. These associations and relationships can be presented in the form of an information model. • CDEs - Data that is collected on a given study or trial must be defined and described such that remote users of that data can understand what it means. These metadata descriptions are referred to as data elements. • Vocabularies and Ontologies - Biomedical information includes a substantial body of specialized concepts that are represented by terms. Agreement upon the basic concepts, terms and definitions that are inherent in all biomedical information is essential for achieving semantic interoperability. • Programming and Messaging Interfaces - Computer programs and the people who write them are able to access resources from other programs through programming and messaging interfaces. Each of these interfaces responds to a particular syntax for its communications. Agreement upon standards for these interfaces is necessary to overcome barriers to syntactic interoperability.
caBIG compatibility Levels of compatibility The four levels of the caBIGTM compatibility guidelines: • Legacy - Implies no interoperability with an external system or resource. A system that was designed without awareness of or prior to the availability of these compatibility guidelines, and which does not meet any of the requirements for interoperability. • Bronze - Classifies the minimum requirements that must be met to achieve a basic degree of interoperability. • Silver - A rigorous set of requirements that, when met, significantly reduce the barrier to use of a resource by a remote party who was not involved in the development of that resource. • Gold - Currently being defined by caBIG. Is expected to provide for a formalized grid architecture and data standards that will enable standardized advertising, discovery, and use of all federated caBIG resources.
caBIG compatibility caBIG compatibility guidelines Syntactic Semantic Semantic & Syntactic
caBIG compatibility Syntactic interoperability Gene + name: String + hugoGeneSymbol: String + sequence: String • The solution for syntactic interoperability in caBIG at the silver level of compatibility is for all systems to provide an Object Oriented Application Programmer Interface (API). • Object Oriented Interfaces can be implemented in many programming languages. • This interface can be connected to the caGrid so that the local data repository is globally accessible in a language independent way. • The interface is described by an information model, which acts as the junction between the syntactic components and the semantic components.
caBIG compatibility Programming and messaging interfaces • Types of APIs • Client APIs in a programming language • Messaging APIs via a messaging protocol • Types of systems • Data services provide access to an information model • Query method • Associations are “traversable” • Analytical services provide methods tomanipulate data • Hybrid services provide methods to manipulate information models • Analytical tools consumer of silver compatible data, but don’t produce it
caBIG compatibility caTRIP API Hyperlinks to caTRIP API docs
caBIG compatibility caTRIP grid service WSDL Hyperlinks to caTRIP API WSDL
caBIG compatibility caTRIP grid service WSDL Hyperlinks to caTRIP FQP UML
caBIG compatibility Semantic interoperability • The Solution for semantic interoperability lies in object oriented UML design of the service, an unambiguous description of elements within the system and storage of the description in a publicly accessible repository (metadata). • UML model • Use of publicly accessible terminologies/ vocabularies/ontologies (EVS-NCI Thesaurus) • Use of publicly accessible metadata repository (caDSR)
Enterprise Vocabulary Services caBIG compatibility Metadata stored in caDSR • Storage of Metadata • caDSR = cancer Data Standards Repository • Common Data Elements = CDEs • Enable end-users to access information about data and services without having to access human developers • = Fusion of UML models + Concepts/Definitions
caBIG compatibility caTRIP CDEs Hyperlinks to caTRIP CDEs
Enterprise Vocabulary Services caBIG compatibility Publicly accessible terminologies • Controlled vocabulary resources for the cancer research community • Vocabulary Products and Services • NCI Thesaurus • NCI Metathesaurus • External Vocabularies • NCI Thesaurus - controlled vocabulary source for metadata • Has excellent coverage of cancer terminology • Expands based on needs for additional terminology • Based on concepts rather than terms • Each concept has a unique identifier or CUI with definitions and synonym • Housed by the Enterprise Vocabulary Service (EVS) • LexBIG • a caBIG-funded vocabulary server to enable a Federated Vocabulary environment.
caBIG compatibility caTRIP CDEs Hyperlinks to a caTRIP concept
caBIG compatibility Domain information modeling • A Domain Information Model is a representation of our understanding of an area of knowledge. • Domain Information Models consist of ‘Classes’ that represent ‘things’ in the real world • Classes contain ‘attributes’ that are characteristics of different instances of things in the real world. • Relationships between the classes are described by ‘associations’ and indicated by lines with directionality and cardinality • Each class plus attribute creates one Common Data Element (CDE)
caBIG compatibility Tumor Registry model Diagnosis Participant Collaborative Staging Follow up and Recurrence Hyperlinks to caTRIP UML Treatment
Building caBIG compatible systemsSteps for creating an analytical system • Step 1: model and register metadata • Model the domain objects • Register metadata • Step 2: implement the analytical system • Implement an interface • Map data objects to existing inputs • Plug-in analytics • Step 3: create the data service • Create an XML Schema • Use the caGrid 1.0 Introduce toolkit to create a service • Configure the service • Deploy • Step 4: invoke the service • Java-based client • Use caTRIP
Building caBIG compatible systemsSteps for creating a data system • Step 1: model and register metadata • Model the domain objects • Register metadata • Step 2: implement the information system • Model the databases (via scripts or EA) • Build the database • Generate Java beans • Create Hibernate mappings • Jar it all up • Step 3: create the data service • Create an XML Schema • Use the caGrid 1.0 Introduce toolkit to create a service • Configure the service • Deploy • Step 4: invoke the service • Java-based client • Use caTRIP
Building caBIG compatible systemsN-tier architecture Index Service advertise advertise Distributed Query Engine CQL Query caGrid Data Service caCORE SDK CQL Engine domainmodel Object-relational mapping database
Building caBIG Compatible SystemscaCORE SDK Vocabularies Info Model Common Data Elements Messaging Interfaces/ API
caBIG compatibility Mapping UML to CDEs example Created Data Element Gene Entrez Gene Genomic Identifier java.lang.String Class: Gene Attribute: entrezGeneID Datatype: String Gene Entrez Gene Genomic Identifier java.lang.String
caGrid Background, service creation, metadata
caGridWhat is caGrid? • What is Grid? • Evolution of distributed computing to support sciences and engineering • Sharing of resources (computational, storage, data, etc) • Secure Access (global authentication, local authorization, policies, trust, etc.) • Open Standards • Virtualization • What is caGrid? • Development project of Architecture Workspace • Helping define and implement Gold Compliance • Implementation of Grid technology • Leverages open standards, community open source projects • No requirements on implementation technology necessary for compliance • Specifications will be created defining requirements for interoperability • caGrid provides core infrastructure, and tooling to provide “a way” to achieve Gold compliance • Gold compliance creates the G in caBIG™ • Gold => Grid => connecting Silver Systems