Cloud Computing for Ecological Modeling in the D4Science Infrastructure

Cloud Computing for Ecological Modeling in the D4Science Infrastructure A. Manzi (CERN), L. Candela, D. Castelli, G. Coro, P. Pagano, F. Sinibaldi (ISTI-CNR) EGI Community Forum 2013 Manchester 8-12 Apr 2013

Overview

Species Distribution modeling Species distribution modelsaiming at estimating the presence of a species in a given area are essential instruments in the development of strategies and policies for the management and the sustainable and equitable use of living resources. 2 Main issues to face: Need for large computing capabilities and appropriate modeling tools Need for both a sufficient amount of good quality occurrence point datasets and suitable environmental datasets

The AquaMaps scenario Model-based, large-scale predictions of known natural occurrence of marine species. Predictions are made by matching species tolerances against local environmental conditions. ( e.g. salinity, temperature) Computation is based on algorithms such as AquaMaps: • Developed by Kashneret al. (2006) to predict global distributions of marine mammals • Color-coded species range map, using a half-degree latitude and longitude dimensions

The AquaMaps scenario HCAF HSPEN HSPEC • Species Environmental Envelope (HSPEN) • Range of environmental tolerance and preference of a species • Cells Authority File (HCAF) • Metadata about half degree cells: membership, physical attributes • Cells Species Assignments (HSPEC) • Probability of occurrence of a species in a given cell

The AquaMaps scenario 11,549 species ( from FishBase) 2 Days of sequential computation Very large volume of input and output data • Less than 7,000 species: • HSPEC native range = 56,468,301 • HSPEC suitable range = 114,989,360 • Estimate for 50,000 species: • HSPEC native range = 350,000,000 • HSPEC suitable range = 715,000,000 [Eli E. Agbayani, FishBase Project/INCOFISH WP1, WorlFish Center] Very large number of computation • One Multispecies map computed on 6,188 half degree cells (over 170k) and 2,540 species • requires 125 millions computations • One global map (extended to all species and cells around the world) • requires about 400 billions computations [N. Bailly, WorldFish Center]

D4Science Infrastructure VRE VRE VRE VRE VRE VENUS-C GeoNetwork GENESI-DEC AquaMaps GBIF EGI OBISs DRIVER Production level infra deployed and maintained during D4Science (2007) and D4Science II (2009) projects

D4Science Infrastructure D4Science hosts biodiversity communities federated by the iMarineand the EUBrazilOpenBioinitiatives D4Science will provide ENVRI RIs with seed resources • Well suited for typical biodiversity processes likeEcological Modeling • Provides access to • computational and storage resources offered by commercial cloud providers • new storage technologies generally identified as no-sql databases • several algorithms for performing data analysis and mining • Offers scalable platforms fordata interoperability and efficient data management • Offers a scalable infrastructure for efficient spatialdata access, processing, and visualization

D4Science: example of communities 1920 Collaborators, 33 M Hits/month 50 K/month unique visitors from 26 countries 400 Experts Aquamaps Observation Data OpenModeller Operational Data Cloud

gCubeFramework gCubeis a JAVA service-oriented framework managing: • creation and interconnection of e-Infrastructures in a controlled and highly configurable environment. • deployment of dynamic Virtual Research Environments • Enabling Layer Allows deployments of: • Native components on Tomcat (hot deployments) • gCube components on Axis container (dynamic deployments) Implements Infrastructure components optimal deployment and allocation (automatic or admin driven)

gCube Framework: Main Components • Information System • This service is a key one in a gCube-based infrastructure since it offers functionalities for publishing, monitoring, discovering and accessing the set of resources forming the infrastructure • Storage Manager • the management of files storage is based on a network of distributed storage nodes managed via specialized software for document-oriented databases. The Storage Manager Library in its current implementation offers files management over two possible document store software: MongoDB and Terrastore. • Message Queue • This service is based on Apache Active Message Broker to support a queue-based mechanism for distributing messages to consumers

gCube Framework: Main Components • Executor • This service is a key component to endow a gCube-empowered infrastructure with cloud processing. It acts as a container for gCubetasks ( as plugins of the service) which can be dynamically deployed into the service and executed through its interface. • Generic Worker • task of the Executor which is exploited in cloud computation tasks. It is able to execute “processes”, either binary executables or scripts, along with their dependencies in a sandbox.

gCube Framework: Main Components • Geospatial Data Manager • Service for discovering and accessing to distributed environmental data and maps. This service relies on maps stored on several GeoServerinstances. A set of PostGIS databases store the concrete values and geometries and the GeoServer distributes them according to standard Open Geospatial Consortium (OGC)protocols like Web Map Service (WMS), Web Coverage Service (WCS) and Web Feature Service (WFS). A GeoNetworkinstance is endowed with an OGC CSW based search engine which allows for retrieving meta-information

gCubeStatistical Manager D4Science Workspace D4Science DB Source External Features Sources User’s DB Storage Manager Statistical Manager CSV Java Objects HTTP CALLS SDMX

Scope • Statistical Manager is able to: • Generate Geographical Probability models for • species (e.g. Aquamaps) • Perform transformations on data • (e.g. interpolations) • Perform data mining operations • (e.g. modeling, clustering) • Evaluate models, distributions and experiments (e.g. ROC curve, AUC, Accuracy) • Perform data quality analysis • (e.g. Habitat Representativeness Score)

Architecture

Advanced Graphical Interfaces

D4Science Cloud Processing

Statistical Manager & AquaMaps • Data generation is up to 50-times faster on D4Science cloud • Adds the generation and publication of GIS layers representing the species distribution • Supports generation of transect • Supports dataset management facilities • Solves scalability issues The Statistical Manager is instantiated with the AquaMaps algorithm

Conclusions • Ecological Modeling in D4Science: • Perform modeling by using Cloud Computing in a transparent way to users • Take care of parallelization issues • Evaluate models performances • Next Step: • Transparent generation of Geospatial features at different resolutions by implementing geospatial data processing by means of cloud computing facilities, endowed with a WPS protocol interface. D4Science

AppliFish iMarine application for iOS and Android to discover over 500 worldmarine species and stayinformed on iMarine news & activities Try AppliFish ! iOS Android Go mobile with iMarine

Thanks for your attention Questions? www.i-marine.org www.d4science.org https://portal.i-marine.d4science.org

Cloud Computing for Ecological Modeling in the D4Science Infrastructure

Cloud Computing for Ecological Modeling in the D4Science Infrastructure

Presentation Transcript

Infrastructure-as-a-Service Cloud Computing for Science

Secure Cloud Computing with Virtualized Network Infrastructure

Eucalyptus: An Open-source Infrastructure for Cloud Computing

Metadata in the Cloud Computing

The Cloud Computing

Accelerating Cloud Computing Infrastructure: Cisco Nexus 1000V

Developing Sustainable Infrastructure Plenary on Cloud Computing

Cloud Computing for Geophysics: Virtualization of Infrastructure

Network Virtualization in Infrastructure-as-a-Service Cloud Computing

Combining the Cloud with the Computing - Enabling Cloud Computing for the Enterprise

The Cloud and Cloud Computing

Compute Intensive Research on Cloud Computing Infrastructure

Cloud Computing – The Cloud

Cloud Computing Infrastructure Security

D4Science: a Data Infrastructure Ecosystem for Science

Scientific Cloud Computing Infrastructure for Europe – Strategic Plan

Acquisition Roadmap for Cloud Computing and Consolidated IT Infrastructure

Ecological Modeling: Algae

The Computing Infrastructure

Computing in the Cloud

Learning in the Cloud! Cloud Computing for Teachers & Schools

Eucalyptus: An Open-source Infrastructure for Cloud Computing