Andrew Hart NASA Jet Propulsion Laboratory David Kale Whittier VPICU, Children’s Hospital LA

Distributed, Modular Grid Software for Management and Exploration of Data in Patient-Centric Healthcare IT • Andrew Hart • NASA Jet Propulsion Laboratory • David Kale • Whittier VPICU, Children’s Hospital LA • Heather Kincaid • NASA Jet Propulsion Laboratory

Agenda • Health Care Data Challenges for Large-scale Research • Intro to Object Oriented Data Technology (OODT) • Applications of OODT in distributed scientific data systems • NASA’s Planetary Data System • NCI’s Early Detection Research Network • Whittier Virtual Pediatric Intensive Care Unit (VPICU) • OODT as Open Source • Learning More & Keeping in Touch

Health care research • Increasingly collaborative • Increasingly geographically distributed • Scale, Complexity, Cost drive cooperation • Opportunities for discovery emerge through larger data sets • Increase in need for technology to support for “virtual organizations” carrying out distributed scientific research

OODT – What Is It? “A data grid software infrastructure for constructing large-scale, distributed data-intensive systems” • Reference Architecture • Software Product Line • Reusable Components • Common Patterns

A Brief History of OODT • Funded out of NASA’s Office of Space Science in 1998 • Funded to address critical software engineering challenges affecting the design of mission science data systems • Designed, implemented, and refined over the past 7 years across multiple scientific domains: • Planetary Science, • Earth Science, • Cancer Research, • Space Physics, • Modeling and Simulation, • Pediatric Intensive Care • Runner up NASA software of the year in 2003

Principles behind OODT • Division of LaborAvoid making one component the workhorse, configurable • Technology Independence Guard against unexpected changes in the technology landscape • Metadata as a first-class citizenDescriptions of resources come in handy • Separation of software and data modelsAllow each to evolve independently • Modular, domain-agnostic Pick and choose from adaptable components with defined interfaces

OODT Core Framework Services • Archive ServiceIngest data + metadata, processing algorithms, workflow support • Profile ServiceDeliver metadata from an underlying data store • Product ServiceDeliver data from an underlying data store • Query ServiceManage sets of profile servers • Data Grid ServiceInterfaces and tools for connecting distributed resources over the web

Applications of OODT: PDS • Planetary Data System • National Aeronautics and Space Administration • http://pds.nasa.gov

NASA Planetary Data System • Official NASA archive for all planetary data • 9 Nodes with data located at discipline sites • All missions must add theirdata (required as part of mission Announcement of Opportunity • Prior to October 2002, no ability to find and share data between PDS nodes

PDS Data Key Challenges Challenges to building a science data system for the PDS: • NASA often flies unique, one of a kind missions • A static infrastructure won’t work: Nodes and models change • Data stored at PDS nodes differs dramatically in structure • Missions are required to share science data results with the research community

PDS Data Architecture • Distributed data system environment with federated governanceEach site maintains their own database and infrastructure • Common domain information model (regularly updated) used to drive system implementationsOntology and Common Data Elements (based on ISO/IEC 11179) • Common query interface to distributed servicesimplemented with OODT Query Handlers • Software services that wrap existing data systems to share data Implemented with OODT Product & Profile servers • Publishing of data products to a common portal Implemented using Resource Description Format (RDF)

PDS Architecture Decomposition

Applications of OODT: EDRN • Early Detection Research Network • Division of Cancer Prevention, National Cancer Institute • http://cancer.gov/edrn

EDRN Overview • Focus: investigator-initiated, collaborative research on molecular, genetic and other biomarkers for cancer detection and risk assessment. • Funded since 2000 by the Division of Cancer Prevention in the National Cancer Institute (NCI) • 40+ geographically distributed centers performing parallel, complementary studies • Strong emphasis on therole of informatics

EDRN Participants • Biomarker Development LaboratoriesResponsible for the development and characterization of new biomarkers or the refinement of existing biomarkers. • Biomarker Reference LaboratoriesServe as a Network resource for clinical and laboratory validation of biomarkers, which includes technological development, quality control, refinement, and high throughput. • Clinical Epidemiology and Validation CentersConduct clinical and epidemiological research regarding the clinical application of biomarkers. • Data Management and Coordinating CenterCoordinate EDRN research activities, provide logistic support, conduct statistical and computational research for data analysis, analyzing data for validation.

OODT and EDRN • OODT’s success lead to interagency agreements with both NIH and NCI, resulting in: • EDRN Informatics CenterSupport EDRN's efforts through the development of software systems for information management. Located at NASA Jet Propulsion Laboratory, Pasadena, CA. • Principal Investigator: Dan Crichton, JPL.

EDRN Data • EDRN collects, generates, analyzes, and stores a wide variety of different data, including: • Specimen Inventories Map specimens collected (blood, sputum, etc.) to patient characteristics • Studies and Publications Information about studies conducted in the EDRN as well as published results (publications, outputs) • Biomarkers Information about indicators of early disease • Science DataOutputs of experiments on specimens, regarding biomarkers, driven by particular studies and protocols

EDRN Data Flow • Moving beyond the local laboratory • Scalability, interoperability

Case Study: ERNE • ERNE: EDRN Resource Network Exchange • Challenge: Overcome differences in local schema to develop a national distributed specimen information infrastructure • All sites running different software and following own procedures • Rely on a common informationmodel for distributed querying,and provide site-specific mappings at each participant

ERNE Architecture

Connecting Research • Designing the EDRN informatics architecture as a collection of well-defined components via OODT has simplified the process of building interfaces to non-EDRN systems • Wrappers can be built to link non-EDRN systems • Translators can be developed to deal with different semantic architectures • caBIG • ERNE/caTissue Wrapper • EDRN-Canary Collaboration • A cloud computing effort that shares raw science data via Amazon S3 between EDRN and the Canary group which uses software from GenoLogics Life Sciences

EDRN Knowledge Environment • Building a Semantic Bioinformatics Grid for the EDRN

Lessons From EDRN • Architecture and a vision has been critical • Technology hasn’t been as critical • Keep it simple • Science support has been critical • Getting buy-in and participation from domain experts is key • Incremental development and deployment • Starting with a few sites was very helpful in understanding the issues • We had both development sites and observer sites initially • The IRB process has been a big schedule driver • Distributed architecture can be a challenge • Not all sites up to maintaining the implementation • Loosely coupled architecture with simple interfaces helped

Applications of OODT: VPICU • Whittier Virtual Pediatric Intensive Care Unit • Childrens Hospital Los Angeles • http://picu.net Collaboration between 85 Multi-disciplinary pediatric intensive care units across the U.S.

Collaboration with VPICU • Laura P. and Leland K. Whittier Virtual Pediatric Intensive Care Unit (VPICU), founded in 1998 by clinicians at CHLA • Leverage advances in technology to: • Improve patient care • Educate practitioners • Conduct research • Reduce cost of providing care

VPICU Research Data Real Health Care Data Set • Massive, grows continuously • Heterogeneous formats, types, etc. • Incomplete, proprietary, descriptions • Fragmented across stores, organizational boundaries • Incomplete, inconsistent • Highly restricted (legal, privacy, ethical considerations) Secondary use of observational clinical (EHR, monitor, annotations) data • Ideal Research Data Set • Manageable size, Static • Homogeneous • Complete, standardized descriptions and annotations • Available as single unit • Complete, consistent • Minimal usage restrictions

VPICU Project Areas • Data extraction and managementTake data from proprietary stores, make it accessible • Transformation of data into knowledgeProcess (and re-process) the data to extract insight • Data-driven decision supportDevelop tools that learn continuously from the data • Distributed data-sharing over a national networkEnable research on scales previously impossible while maintaining security, privacy, compliance

Principles behind VPICU • Decouple from (proprietary) vendor databases • Integrate disparate data sources into a single model • Dynamically (re)generate research database(s) • we don’t know for sure what queries will be most useful at the outset • Provide web services for multi-faceted access to the data to enable discovery & analysis • Support federation among multiple PICU sites

“Algorithm” for VPICU Data System • Develop a common Domain Ontology to describe the information space • Develop compute services that support extraction of data from existing databases • Identify mechanisms to integrate information objects from disparate repositories and map them to the common domain ontology • Construct a set of online research databases to enable data mining and analysis • Deploy a “data grid” infrastructure of hardware & software to facilitate utilization of the data environment at CHLA and beyond (external entities and applications) • Deploy a set of compute services to support data mining and analysis • Develop an architectural plan and roadmap for scaling and integrating other PICUs

VPICU Architecture File-based storage

VPICU Architecture • Original data sources/stores at backend • Proprietary schema • Hardware that we don’t “own” or control • Production systems (very load-sensitive) • Legacy technologies (sometimes) • Unreliable (can’t guarantee always available) • Includes: • Hospital-wide commercial EHR system(s) • Homegrown critical care database • Specialized clinical applications • Raw bedside monitor data EHR Homegrown File-based storage Clinical apps Monitor data Proprietary data sources

VPICU Architecture • Regular extraction of new data • VPICU-controlled resources(Our hardware and software) • Transform to VPICU schema • Link data belonging to same patient • May contain PHIMust be highly secure • Data at this stage is normalized, stored in a format suitable for ingestion into any number of research databases File-based storage VPICU-owned resources

VPICU Architecture • Research databases • Application-specific • Optimized • Contain de-identified or anonymized data • VPICU ontology, schema • Access via configurable web services File-based storage

What are “research databases?” • Designed for specific research questions, analytical techniques • Need not always be relational or databases at all • Available via web interfaces and software servicesResearcher using R can connect directly through R bindings • Examples: • Relational database for traditional retrospective studies • Search engine over free text clinical notes, etc. • Patient/patient comparison, retrieval (find patient like this one) • Data-backed patient simulator for “testing” interventions

VPICU Architecture File-based storage

OODT and the VPICU Data System • Develop an Information Model (Ontology) to describe the domain • Develop compute services that support extraction of data from existing CHLA databases (OODT Query Handlers) • Identify mechanisms to integrate information objects from disparate repositories and map them to the common domain ontology (OODT CAS crawler, catalog services) • Construct a set of online research databases to enable data mining and analysis (OODT Catalog and Archive Services) • Deploy a “data grid” infrastructure of hardware & software to facilitate utilization of the data environment at CHLA and beyond (external entities and applications) (OODT Data Grid Services) • Deploy a set of compute services to support data mining and analysis • Develop an architectural plan and roadmap for scaling and integrating other PICUs

OODT as Open Source • Jan 2010: OODT Accepted as a podling in the Apache Software Foundation (ASF) Incubator • First NASA software licensed and incubating within the ASF • Learn more and track our progress at: • http://incubator.apache.org/projects/oodt.html • Join the mailing list: • oodt-dev@incubator.apache.org • Chat on IRC: • #oodt on irc.freenode.net

Acknowledgements • Jet Propulsion Laboratory: Dan Crichton, Chris Mattmann, Sean Kelly, Steve Hughes, Amy Braverman, Thuy Tran • National Cancer Institute: Sudhir Srivastava, Christos Patriotis, Don Johnsey • Fred Hutchinson Cancer Research Center: Mark Thornquist, Ziding Feng, Jackie Dalhgren, Suzanna Reid • Children’s Hospital Los Angeles: Randall Wetzel, Robinder Khemani,Paul Vee, Jeff Terry, Robert Kaptan,Doug Hallam

Andrew Hart NASA Jet Propulsion Laboratory David Kale Whittier VPICU, Children’s Hospital LA