1 / 25

Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis. Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research Network Office, University of New Mexico University of Kansas

ondrea
Télécharger la présentation

Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research Network Office, University of New Mexico University of Kansas San Diego Supercomputer Center http://seek.ecoinformatics.org December 4, 2003 Edinburgh, Scotland

  2. Outline • Quick history • SEEK overview • Ecological Metadata Language • Using workflows in Ecology • Workflow editing with Kepler • Future visions

  3. History • Late 1990s – patterns noticed in the problems surrounding data synthesis at NCEAS • 1999 - Michener et al paper on ecological metadata • 2000 – Knowledge Network for Biocomplexity • Morpho, Metacat, Ecological Metadata Language • Some footholds into workflow creation and execution • 2003 – Scientific Environment for Ecological Knowledge (SEEK) Grant • Continues the work done on the KNB grant • Emphasis on using metadata for advanced data processing

  4. SEEK approach • General approach to specific ecological problems • Data described with adequate metadata in a grid accessible repository • Reasoning engine (ontology based) to locate and extract data and processes • Modeling system to put it all together and control execution flow

  5. SEEK Components • Ecogrid • Analysis Library • Metadata and data repository • Semantic Mediation System • Controlled semantic vocabulary • Ontological discovery system • Analysis and Modeling System (Kepler) • Workflow control system • Utilizes resources from other components

  6. SEEK Architecture

  7. Ecological Metadata Language • Common language for archiving and transport of datasets • XML based • Designed for/by the ecological community • Describes physical and logical structure of data • Also includes project, literature and software information • SEEK will add semantic information

  8. Workflows in SEEK • In the SEEK model, data ingestion/cleaning is metadata driven (specifically with EML) • Output generation includes creating appropriate metadata • The analysis pipeline itself becomes metadata

  9. Metadata driven data ingestion • Key information needed to read and machine process a data file is in the metadata • File descriptors (CSV, Excel, RDBMS, etc.) • Entity (table) and Attribute (column) descriptions • Name • Type (integer, float, string, etc.) • Codes (missing values, nulls, etc.) • In the future, this will include semantic typing

  10. Metadata revision • Metadata is revised following any transformation • Versioning of metadata and data is very important • This process results in a lineage of the data file as it has been transformed

  11. Typical ecological workflow example • Workflows can automate the integration process if data is described with adequate structured metadata

  12. Homogeneous data integration • Integration of homogeneous or mostly homogeneous data via EML metadata is relatively straightforward

  13. Heterogeneous Data integration • Integration of heterogeneous data requires much more advanced metadata and processing • Attributes must be semantically typed • Collection protocols must be known • Units and measurement scale must be known • Measurement mechanics must be known (i.e. that Density=Count/Area)

  14. Semantic typing and ontologies • Label data with semantic types • Label inputs and outputs of analytical components with semantic types • Use Semantic Mediation System (SMS) to generate transformation steps • Beware analytical constraints • Use SMS to discover relevant components • Ontology – specification of a conceptualization (a knowledge map) Data Ontology Workflow Components

  15. Measurement Ontology • Density is part of a larger measurement ontology • SEEK’s intent is to create one or more community created ecological ontologies • Creates a controlled vocabulary for ecological metadata • More about this in Bertram’s talk

  16. About Kepler • Kepler is the name of the SEEK/SDM additions to the Ptolemy modeling system • Ptolemy was designed by the UC Berkeley EECS department • Primary use is modeling EE circuits • Free, opensource, pure Java • Flexible design GUI for building workflows

  17. Kepler • A Kepler model consists of linked “actors” (which correspond to workflow steps) • Timing is controlled by a “director” • All actors are written in Java but can call other applications (such as SAS and MATLAB or native language code via JNI) • Actors can call arbitrary Web (or Grid) Services • Ptolemy already has a very large inventory of actors • Easy to use, drag ‘n drop interface

  18. SEEK Contributions to Kepler (so far) • EML data ingestion actor • Actor design tool

  19. EML data ingestion actor • Ingests any data format described by EML metadata • Converts raw data to Kepler format • Data can then be operated on with other actors • Produces one output port for each attribute in the dataset • Individual attributes can then be mapped to other actors

  20. Ptolemy model with EML ingestion actor

  21. SEEK Contributions to Kepler (so far) • EML data ingestion actor • Actor design tool

  22. Actor design tool • Allows “place-holder” actors to be defined on the fly by non-programmers during workflow creation • Domain scientists can thereby create workflows without programming knowledge • Workflows created with these actors can be executed once their functionality is implemented by a programmer • Allows quick prototyping of workflows by domain scientists • “Place-holder” actors can still be linked to other working actors

  23. Ptolemy and dynamically created actor

  24. How domain scientists will benefit • More fully automated integration systems • A library of pre-defined analytical processes which can be executed on heterogeneous data • Semantic data discovery and processing • Automated unit and measurement scale conversions • A fuller understanding of cross site research implications

  25. Acknowledgements More info: http://seek.ecoinformatics.org Questions? IRC: irc.ecoinformatics.org #seek This material is based upon work supported by: The National Science Foundation under Grant Numbers 9980154, 9904777, and 0225676 to NCEAS and its collaborators. The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. Primary Collaborators: University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research)

More Related