Developing a Reference Architecture for Scientific Data Systems

Developing a Reference Architecture for Scientific Data Systems Dan Crichton April 2009

Background • Employed by Jet Propulsion Laboratory since 1995; prior software engineering positions at Hughes Aircraft Company and in private industry • MS in Computer Science, USC 1996 • Program Manager for • Planetary Data System Engineering in Solar System Exploration Directorate • Data Systems and Technology in Earth and Technology Directorate • Principal Investigator for • Informatics Center, Early Detection Research Network, National Cancer Institute • Facilitating Integration of NASA and Earth System Grid, NASA • Object Oriented Data Technology

Science data systems • Covers a wide variety of science disciplines • Solar system exploration • Astrophysics • Earth science • Biomedicine • etc • Each has its own communities, standards and systems • How do you define a reference architecture vs a point solution?

Context: Space data systems Relay Satellite Simple Information Object Spacecraft and Scientific Instruments Spacecraft / lander Science Data Archive External Science Community Primitive Information Object Primitive Information Object Science Information Package Science Information Package Science Data Processing Science Products - Information Objects Telemetry Information Package Science Information Package Data Analysis and Modeling Science Information Package Planning Information Object Instrument Planning Information Object Science Team Data Acquisition and Command Mission Operations Instrument /Sensor Operations • Common Meta Models for Describing Space Information Objects • Common Data Dictionary end-to-end

Increasing data volumes Increased emphasis on usability and analysis of the data across the end-to-end system Mining/discovery Increasing diversity of data sets and complexity for integrating across missions/experiments (E.g., common information model for describing the data) Increasing distribution of coordinated processing and operations (E.g., federation) Increased pressure to reduce cost of supporting new missions Increasing desire for PIs to have integrated tool sets to work with data products with their own environments (E.g. perform their own generation and distribution) Challenges in science data systems Planetary Science Archive

Architecture: why do I care? • Data system costs per mission, project, investigation, etc is high • Technology infusion is limited • Need to capture and leverage domain knowledge and experience across projects

Architecture: what is it? • The fundamental organization of a system embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution. (ANSI/IEEE Std. 1471-2000)

Architects: what are they? • Effective Architects have… • Years of experience • Holistic view of domain • Look at both aesthetics and practical details • Variable technical depth • Lifecycle roles • Strong involvement up-front • May oversee development • Chooses stable steps in development • Effective Architects are not… • Lone inventors or scientists • The architect is a good communicator and politician -- architectures must be sold and explained and their integrity maintained • Architecting is not a science, but depends on science • Purely technologists • Architecture is a strategy • “Top level only” designers • Details are often critical • Collaborators • A coherent vision is critical; they drive it

The view is what you see The viewpoint is where you look from Architecture frameworks • A viewpoint is a template for constructing a view • A view is a description of the entire system from the perspective of a set of related concerns. A view is composed of one or more models. • A model is an abstraction or representation of some aspect of a thing • Examples: RM-ODP, FEAF, TOGAF, etc (Project Managers, Engineers, Scientists, Business Analysts, …)

Planetary science example

Defining the reference architecture • In science data systems, construction of multiple architecture views of a system are critical • Process • Information/Data • Technology • We find the “views” are similar, but models can be domain specific • This is the opportunity to develop a reusable reference architecture if the “patterns” can be extracted

Domain Specific Software Architectures* • Domain model • Leverage experts who have the “holistic” view and can drive the need for product lines • An unambiguous view is critical (in fact, this has been a problem in science arenas) • Reference requirements • Drives the reference architecture • However, it is critical to map domain models to reference requirements in order to understand the solution space • Reference architecture • Satisfies an abstracted set of functions from the reference requirements • It’s engineered for the “ilities” reusability, extensibility and configurability • It demonstrates the separation of functional elements of the architecture * Tracz, Will, Domain-Specific Software Architecture, ACM SIGSOFT, 1995

Planetary Data System Decomposition PDS-2010 System Architecture Process Architecture Data Architecture Technology Architecture Ingest (Receive, Validate, Accept) Archive (APG, PAG) Information Model Catalog/Data Mgmt Preservation Planning Archive Storage Data Node Integration Distributed Infrastructure Query/Access Data Standards Portal Data Formats Search Technology Standards Data Dictionary Data Distribution Administration Grammar Data Movement Archive Tools Peer Review Archive Organization User Tools/Services Deep Archive

Separation of the architectures • Data/Information Architecture • Components, middleware, and communication • NOTE: Process is implicit here

Software product lines • This is about strategy more than technology • Goal is a software product line that • Implements our reference architecture • Allows for construction of core software components that can be reused across projects and science disciplines • Can demonstrate sufficient cost and schedule benefits without sacrificing flexibility in meeting requirements and adapting to technology change

Object Oriented Data Technology • Represents both a reference architecture AND a software product line for science data systems • Exploits common patterns • Delivers reusable software components as building blocks for construction of higher order data systems • Applied to multiple science disciplines • Funded originally back in 1998; runner up for NASA Software of the Year in 2003 • Heavily used by NASA and NIH projects

Architectural principles* • Separate the technology and the information architecture • Encapsulate the messaging layer to support different messaging implementations • Encapsulate individual data systems to hide uniqueness • Provide data system location independence • Require that communication between distributed systems use metadata • Define a model for describing systems and their resources • Provide scalability in linking both number of nodes and size of data sets • Allow systems using different data dictionaries and metadata implementations to be integrated • Leverage existing software, where possible (e.g., open source, etc)` * Crichton, D, Hughes, J. S, Hyon, J, Kelly, S. “Science Search and Retrieval using XML”, Proceedings of the 2nd National Conference on Scientific and Technical Data, National Academy of Science, Washington DC, 2000.

Architectural focus • Consistent distributed capabilities • Resource discovery (data, metadata, services, etc), “grid-ing” loosely coupled science system, workflow management • On-demand, shared services (E.g. processing, translation, etc) • Processing • Translation • Deploy high throughput data movement mechanisms • End-to-end capabilities across the science environment • Reduce local software solutions that do not scale • Increasing importance in developing an “enterprise” approach with common services • Build value-added services and capabilities on top of the infrastructure

Exploiting common patterns • How data is managed (registry/repository, information objects themselves)… • How data is generated, captured, etc (e.g., workflow and data processing)… • How data is accessed (metadata, data)… • How information is discovered … • How data is distributed (e.g., transformed)… • How data is visualized…

What does OODT do? • Tie together loosely coupled distributed heterogeneous data systems into a virtual data grid • Support critical functions • Data Production and workflow • Data Distribution • Data Discovery (including query optimization across highly distributed systems) • Data Access • An architectural approach first, an implementation second • Adapt to different distributed computing deployments • Promotes a REST-style architectural pattern for search and retrieval • Scalability in linking together large, distributed data sets

OODT data architecture focus • On types of and relationships among a software system’s data • Decomposition of data within a software system to its logical components and interactions • Components: Data Elements, Data Dictionary, Data Models of individual data sources • Interactions: Mappings between Data Dictionary to Data Models, Data Element structural comparison • Some standards currently exist for data architecture • ISO: ISO-11179 Standardization and Specification of Data Elements • Dublin Core Metadata Initiative: Dublin Core Data Elements to describe any electronic resource • Specifications for the Data Architecture • Common XML schema for managing information about data resources • Common XML schema for messaging between distributed services • Methods for integrating existing domain models within architecture

nasa.pds.xmlquery XMLQuery XMLQuery 1 fromSet - - resultModeId: String resultModeId: String - - propogationType: String propogationType: String QueryElement QueryElement selectSet 1 - - propogationLevels: String propogationLevels: String - - role: String role: String - - maxResults: int maxResults: int whereSet 1 - - value: String value: String - - kwqString: String kwqString: String - - numResults: int numResults: int - - mimeAccept: List mimeAccept: List 1 result 1 queryHeader 1 QueryHeader QueryHeader - - id: String id: String QueryResult QueryResult 1 - - title: String title: String - - list: List list: List - - description: String description: String - - type: String type: String - - statusID: String statusID: String - - securityType: String securityType: String - - revisionNote: String revisionNote: String - - dataDictID: String dataDictID: String OODT data architecture models Based on Dublin Core Request/Response Model Resource Metadata Model Based on ISO/IEC 11179

OODT software components • Profile Service – A server-based registry that is able to either serve local XML profiles or plug-into an existing catalog. This component provides resource discovery. • Product Service – A server component that plugs into existing repositories and serves products. This includes translation serves, etc • Catalog and Archive Service – Transaction-based server that catalogs and archives products providing profile and product servers for discovery and distribution • Query Service – Provides query management across distributed services to enable discovery.

Distributed architecture 1. Science data tools and applications use “APIs” to connect to a virtual data repository 2. Middleware creates the data grid infrastructure connecting distributed heterogeneous systems and data 3. Repositories for storing and retrieving many types of data Mission Data Repositories OODT Reusable Data Grid Framework OODT API Visualization Tools Biomedical Data Repositories OODT API Web Search Tools Engineering Data Repositories OODT API Analysis Tools

Technology architecture Service Registry Name Server Name Server Registry Server Node 1 Profile Server WSDL WSDL Web I/F Node 1 Profile Server Query Integration Node 1 Profile Server XML Request Information Object Product Catalogs XML Request Repository Product Server XML Request Desktop I/F Information Object Information Object Science Products XML Request Repository Product Server Info Object Information Object Science Products … XML Request Repository/Archive Server • Common Meta Models for Describing Space Information Objects • Common Data Dictionary end-to-end Science Products

OODT software implementation • OODT is Open Source • Developed using open source software (i.e. Java/J2EE and XML) • Implemented reusable, extensible Java-based software components • Core software for building and connecting data management systems • Provided messaging as a “plug-in” component that can be replaced independent of the other core components. Messaging components include: • CORBA, Java RMI, JXTA, Web Services, etc • REST seems to have prevailed • Provided client APIs in Java, C++, HTTP, Python, IDL • Simple installation on a variety of platforms (Windows, Unix, Mac OS X, etc) • Used international data architecture standards • ISO/IEC 11179 – Specification and Standardization of Data Elements • Dublin Core Metadata Initiative • W3C’s Resource Description Framework (RDF) from Semantic Web Community

EDRN Knowledge Environment • EDRN has been a pioneer in the use of informatics technologies to support biomarker research • EDRN has developed a comprehensive infrastructure to support biomarker data management across EDRN’s distributed cancer centers • Twelve institutions are sharing data • Same architectural framework as planetary science • It supports capture and access to a diverse set of information and results • Biomarkers • Proteomics • Biospecimens • Various technologies and data products (image, micro-satellite, …) • Study Management

Application to planetary science • Often unique, one of a kind missions • Can drive technological changes • Instruments are competed and developed by academic, industry and industrial partners • Highly distributed acquisition and processing across partner organizations • Highly diverse data sets given heterogeneity of the instruments and the targets (i.e. solar system) • Missions are required to share science data results with the research community requiring: • Common domain information model used to drive system implementations • Expert scientific help to the user community on using the data • Peer-review of data results to ensure quality • Distribution of data to the community • Planetary science data from NASA (and some international) missions is deposited into the Planetary Data System

Typical pipeline architecture One or More Instruments One or More Spacecraft A Space Tracking Network An Instrument Control Center Commodity Space Communications Systems Commodity Space Navigation Systems A Spacecraft Control Center A Ground Tracking Network A Science Facility Source: A. Hooke, NASA/JPL

Planetary Data System • NASA’s official archive for research results from solar system exploration • Distributed across the United States at “PDS Nodes” • 8 nodes including both science nodes and support nodes • Data and Services reside at each node • Unified by a common data architecture and broad technical architecture

Planetary Data System Mars Odyssey THEMIS/ASU Data Node Geosciences/Washington University Rings/SETI Radio Science/Stanford Small Bodies/UMD Planetary Plasma/UCLA Imaging/JPL Engineering/JPL MRO-HiRISE/UofA Data Node Imaging/USGS NAIF/JPL Atmospheres/New Mexico State

PDS Image Class (Object-Oriented) PDS Image Label (ODL) Describes An Image The data architecture is key • The planetary community has developed a diverse model, that is enforced and used in data management • NASA-led, but ESA, ISRO, JAXA, etc are leveraging planetary science data standards • Core “information” model that has been used to describe every type of data from NASA’s planetary exploration missions and instruments • ~4000 different types of data • Unique to planetary, but the concept of models and how they apply to science data is not

2001 Mars Odyssey: A paradigm change • Pre-Oct 2002, no unified view across distributed operational planetary science data repositories • Science data distributed across the country • Science data distributed on physical media • Planetary data archive increasing from 4 TBs in 2001 to 100 TBs in 2009 • Traditional distribution infeasible due to cost and system constraints • Mars Odyssey could not be distributed using traditional method • Current work with the OODT Data Grid Framework has provided the technology for NASA’s planetary data management infrastructure to • Support online distribution of science data to planetary scientists • Enable interoperability between nine institutions • Support real-time access to data products • Provided uniform software interfaces to all Mars Odyssey data allowing scientists and developers to link in their own tools • Operational October 1, 2002 • Moving to multi-terrabyte online data movement in 2009 2001 Mars Odyssey

The architecture reuse opportunity • While planetary has unique constraints and requirements , the broader architecture patterns are exhibited in other science areas • Planetary can be very unforgiving when it comes to system failures • Biology and Earth, for example, are • Distributed • Have similar pipelines and processes • Focus on instruments that perform observations and then analysis of those instruments • Work with data in similar ways • Are PI and science-driven

Explosion of data in Biomedical Research • “To thrive, the field that links biologists and their data urgently needs structure, recognition and support. The exponential growth in the amount of biological data means that revolutionary measures are needed for data management, analysis and accessibility. Online databases have become important avenues for publishing biological data.” – Nature Magazine, September 2008 • The capture and sharing of data to support collaborative research is leading to new opportunities to examine data in many sciences • NASA routinely releases “data analysis programs” to analyze and process existing data EDRN Data Repositories 9-Aug-14 35

National Cancer Institute Early Detection Research Network (EDRN) • Initiated in 2000, renewed in 2005 • 100+ Researchers (both members and associated members) • ~40 + Research Institutions • Mission of EDRN • Discover, develop and validate biomarkers for cancer detection, diagnosis and risk assessment • Conduct correlative studies/trials to validate biomarkers as indicators of early cancer, pre-invasive cancer, risk, or as surrogate endpoints • Develop quality assurance programs for biomarker testing and evaluation • Forge public-private partnerships • Leverage building distributed planetary science data systems for biomedicine

EDRN Science Pipeline Data Distribution (EDRN Public Portal) eCAS - EDRN Biorepository Laboratory Biorepository External Science Community Instrument Publish Data Sets EDRN Bioinformatics Tools Analysis Team Instrument Operations Science Data Processing Local Laboratory Science Data System EDRN Researchers

EDRN’s Ontology Model • EDRN has developed a High level ontology model for biomarker research which provides standards for the capture of biomarker information across the enterprise • Specific models are derived from this high level model • Model of biospecimens • Model for each class of science data • EDRN is specifically focusing on a granular model for annotating biomarkers, studies and scientific results • EDRN has a set of EDRN Common Data Elements which is used to provide standard data elements and values for the capture and exchange of data EDRN CDE Tools EDRN Biomarker Ontology Model

EDRN Model Mapping to Applications ERNE -- EDRN Resource Network Exchange BMDB -- NCI Biomarker DB ESIS -- EDRN Study Information System eCAS -- EDRN Catalog and Archive System The EDRN Knowledge Environment

Reuse of access pattern

Moving to an integrated semanticarchitecture • Semantic science portal driven by the EDRN ontology • Schema loaded into the ontology via RDFS (and Protégé) • Metadata from distributed applications dumped into the portal via RDF • Moving EDRN towards a “pure” model-driven environment

Distributed knowledge portal

Other science areas • Earth Science • Leveraged OODT software framework for constructing ground data systems for earth science missions • Used OODT Catalog and Archive Service software • Constructed “workflows” • Execution of “processors” based on a set of rules • Medical Research • Support for distributed analysis of pediatric intensive care units • Climate Research • Support for distributed modeling SeaWinds on ADEOS II (Launched Dec 2002)

Related work…. • The plethora of middleware, e-science and grid efforts… • Major agency efforts in physical and life sciences… • Standards efforts…. • All the technology support (but see my message on next slide as an architect!)

My message… • Distributed service architectures • Not anything new (my experience with them goes back to the early 1990s) • But, often, newer technologies and approaches are seen as a panacea • Technology is not a replacement for a conceptual architecture • My experience is that definition of the architecture independent of technology is critical • The goal should be stability in the architecture model; the selection of appropriate technology will change over time • This is why an architect is much more of a strategist than a technologist

More preaching… • Think about the entire system and identify the abstractions • You need the holistic view • What are the patterns • Will an architecture framework help? (separation of process, data, technology, etc views)? Can these evolve independently?

Resources • (1) Tracz, Will. Domain-Specific Software Architecture. ACM SIGSOFT, 1995. • (2) D. Crichton, S. Kelly, C. Mattmann, Q. Xiao, J. S. Hughes, J. Oh, M. Thornquist, D. Johnsey, S. Srivastava, L. Esserman, and B. Bigbee. A Distributed Information Services Architecture to Support Biomarker Discovery in Early Detection of Cancer. In Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing, pp. 44, Amsterdam, the Netherlands, December 4th-6th, 2006. • (3) C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Proceedings of the 28th International Conference on Software Engineering (ICSE06), pp. 721-730, Shanghai, China, May 20th-28th, 2006. • ,

Developing a Reference Architecture for Scientific Data Systems

Developing a Reference Architecture for Scientific Data Systems

Presentation Transcript

A Reference Architecture for Web Servers

Reference Architecture for IT Optimization

SEMANTIC AGENT SYSTEMS Towards a Reference Architecture for Semantic Agent Systems Applied to Symposium Planning

SEMANTIC AGENT SYSTEMS Towards a Reference Architecture for Semantic Agent Systems Applied to Symposium Planning

Data Encapsulating Services : A New Paradigm for Developing Enterprise Systems

Kepler + Hadoop A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems

Reference Architecture for NASA’s Earth Science Data Systems

Developing a Layered Reference Model for Information

Developing a Learning Technology Architecture

PROSA, a reference architecture for holonic manufacturing systems

An Architecture for Real-Time Warehousing of Scientific Data

Developing a Reference Model of ePortfolio

ESDS Reference Architecture

A Software Architecture for Highly Data-Intensive Systems

A Systems Architecture for Ubiquitous Video

A Reference Architecture for All IP Wireless Networks

REFERENCE ARCHITECTURE

IBM Reference Architecture for Education

SP99 Reference Architecture

Developing a Layered Reference Model for Information

ERA Reference Architecture

BROADWAY: A SOFTWARE ARCHITECTURE FOR SCIENTIFIC COMPUTING