Space Data Integration Tools and Strategies

Flexible tools for integrating observations and models Johan De Keyser Emmanuel Gamby Belgian Institute for Space Aeronomy

Objective • There exist several packages for processing and visualizing space-related data. Some are meant to be general (e.g. QSAS, MIM, NSSDC), and some are specific (e.g. Cluster Science Data System, Cluster Active Archive, Themis Data Analysis System). • The goal is to infer some general conclusions about the needs of such software infrastructure, and to offer useful recommendations for data and modeling services in the space science and space weather arena. ESWW 2007

Space weather clients and servers • The commercial model: • The scientist’s model: • user, service provider, and data provider coincide ESWW 2007

external repository local repository instrument data instrument data model data model data science input empirical models physical models Business model • User: looking at various spacecraft, ground-based or model data provided by colleagues or external data sources • Service provider: offering his know-how in the form of models to colleagues or as model output to end users • Data provider: offering processed data to colleagues and end users visualization and processing interpretation ESWW 2007

Sharing scientific know-how • Scientists are end users as they derive knowledge, by bringing together different kinds of information. • What are the observations? • Observational data • What is the interpretation? • Model data • How do you bring it together? • Algorithms • This is all brought together by the data processing and visualization tool. • Scientists turn into service providers • by making observational data available, • by offering their model data, or • by publishing their algorithms. ESWW 2007

Examples • Scientists develop algorithms to bring together data from different sources. The algorithm proposes a model, possibly parameterized, to compute model output. • Example 1 : Gradients • Gradients are computed from measurements on the 4 Cluster spacecraft. • The model assumes locally constant gradients. • Model input parameters: e.g. estimate of the distance over which the gradients can safely be considered constant • Model output: the computed gradient vector with its error margins. ESWW 2007

ESWW 2007

Example 2 : Modeling of cometary comae • Computing chemical composition in a cometary coma. • The model assumes thermodynamic equilibrium and computes how the composition evolves due to chemistry. • Model input: chemical reaction constants, neutral gas production rates, numerical parameters. • Model output: particle abundances throughout the coma ESWW 2007

ESWW 2007

Sharing data • Access type • Manual: • Interactively look up and download data through a human-oriented graphic interface (e.g. web browser to CSDS, CAA, NSSDC) • Automatic: • Automated machine-based data access procedure. • Definition of “channels”: generic specification of where and how to find spacecraft data for a given time (TDAS, MIM) • Physical access is always based on some protocol • NFS access: for a local repository • FTP access: NSSDC, Themis repository • Web access: Cluster Active Archive • Access restrictions require the use of login/password ESWW 2007

Automated access downloads data to a local repository or cache. • Cache management based on reserved cache size and minimum guaranteed lifetime of files. File removal exploits the time of last usage. • Automated access can lead to significant wait times • E.g. access to a 20 Mb data set over a 0.1 Mb/s connection takes several minutes; cache hits are therefore important. • A high cache hit rate can be achieved as scientists often work for a prolonged time with a limited set of events (if the cache is big enough to hold that set). • Caching is of not much help when scanning the whole archive, e.g. for statistical studies. • Access may be done as a background activity. ESWW 2007

There is a plethora of available data formats • Archived data may be structured in ways that reflect their origin: • time series of scalar or vector values, of particle distribution functions, or of wave spectra; multi-dimensional spatial fields; images … • data might be grouped in a particular way, e.g. particle distribution moments are usually provided together on a common timescale • Archived data may be stored in a common file exchange format, such as ASCII, CDF, or HDF files. • NSSDC offers data in these formats; ASCII only for low time resolution data. • Archived data might be compressed. • NSSDC compresses ASCII data files. ESWW 2007

Data fed into a visualization/processing tool need a specific format to load quickly. • MIM expresses time in Julian Days, enforces SI units. • Therefore there is a need to convert various archive formats into the desired input format. • MIM uses a generic data format description to steer a data translator. This process maintains/provides metadata. • QSAS uses the QTRAN data format translator • The formatted data volume is usually bigger than that of the archived data. It is the formatted data that are stored in the local cache, while the archive data from which they are derived have transient downloaded copies. ESWW 2007

Recommendations • Even if you offer a sophisticated web protocol with graphic data selection and preview possibilities, make the data accessible via FTP-server: is the easiest solution for automated access. • Make data available in ASCII table form, or a compressed version of it, or in CDF or HDF. • Do not invent a new ad hoc format, such as CEF (Cluster Exchange Format) • Data should always be accompanied with error estimates, both in terms of systematic and random errors. • Offer adequate metadata. • Provide documentation. ESWW 2007

Sharing model data • Sharing model output is similar to sharing observations (calibrated observations are the output from an instrument model anyway). It is essential to specify the systematic and random errors on the model output. • Example: Gradient computation from 4 non-coplanar data points, as often done with Cluster, cannot provide an estimate of the total error on the gradient: Specified error margins usually refer only to the effect of measurement errors – such limitations should be clearly stated when publishing model output. ESWW 2007

Sharing model parameters may warrant even more attention since the meaning of the parameters might be less obvious. • Example: Modeling the chemistry in cometary comae is a complicated thing. Among the input parameters is a database containing a compilation of relevant reactions and temperature-dependent reaction rates, including uncertainty. • Sharing model parameters is essential for comparing • model output obtained with different sets of model parameters; • model output obtained from different models, in order to be sure that the same input is used. ESWW 2007

Recommendations • Try to parameterize your models as much as possible. Do not hardcode model parameters. • Offer the model parameter sets and the model results in a readable form; ASCII will often be preferred for the model parameters. • Provide clear documentation about the model input parameters. • Model output should be treated in the same way as observational data. ESWW 2007

Sharing algorithms • Sharing algorithms is still in its infancy. • There is no standard interface, depends on the software environment you want to incorporate it into; • issue of programming language and portability; • provide documentation. • Preference for high-level languages • Matlab, IDL routines: offer features to assist defining and documenting the interface, automatically ensuring portability over a range of platforms • C++ library: also a portable format ESWW 2007

Sharing algorithms can be avoided if the algorithm is run on demand as a web service. • Advantages • No portability issues • Version control is easy • Secrecy to safeguard commercial interests • Disadvantages • The data have to be imported and the results have to be exported over the web: slow • The server must be powerful enough to run the service for all clients • The algorithm is not open for critical review; no improvements/extensions from other parties. ESWW 2007

Provide interactive on-line documentation for your algorithms, e.g. through a hypertext-based documentation system. ESWW 2007

Recommendations: Algorithms • Publish your algorithms, have them reviewed by as many people as possible. • Describe algorithms in a high-level language, in terms of a number of simpler primitive operations, to enhance implementation on different platforms. • Carefully compare different algorithms to establish correctness, efficiency, and error propagation properties. • Provide detailed documentation as well as test examples. ESWW 2007

Conclusions • There is a need for general-purpose packages for processing and visualizing space-related data since data interpretation is a multi-instrument and multi-spacecraft activity, so mission-specific packages are too limited (though they can be useful for mission-specific archiving). • Portability across a variety of platforms is desirable. • Such packages should be well-documented, easily installed, and have an intuitive graphical user interface. • Computational efficiency is a must since data volumes become increasingly larger. ESWW 2007

Such a package should support • manual and automated data access; • conversion of various formats; • simultaneous processing of data from various sources, always including error estimates; • commanding from an interactive graphical user interface as well as running batch jobs, i.e. it must implement some scripting language; • documentation of observational data and model output data sets, including access to meta-data; • interactive definition, manipulation, and documentation of model input parameter sets; • implementation and documentation of new algorithms. ESWW 2007

Space Data Integration Tools and Strategies

Space Data Integration Tools and Strategies

Presentation Transcript

Integrating Collaborative Tools

Integrating Tools for Practical Software Analysis

Models of migration Observations and judgments

Models and Tools for Portfolio Planning

Integrating GIS and environmental models

Models and Tools for Portfolio Planning

Integrating Carbon and Greenhouse Gas Observations and Analysis

Web Tools for Flexible and Effective Teaching and Learning

Tools and Models

From observations to models:

Large-scale hydrometeorology: Integrating land-surface models with observations

Integrating MS Office tools

From observations to models:

Integrating GIS and environmental models

CAE Tools and Models

Flexible Graph Models for Complex Networks

Integrating Online Tools and Internet Resources

Design Patterns for Integrating Product and Process Models

Integrating Statistical Models and Database Systems

Integrating Collaborative Tools

Promote AQ Services integrating Observations –