220 likes | 236 Vues
This overview discusses the use of cloud computing for ecological modeling in the D4Science infrastructure. It highlights the main issues faced in species distribution modeling and introduces the AquaMaps scenario as a model-based prediction tool. The D4Science infrastructure, powered by the gCube framework, provides a scalable platform for efficient data management, analysis, and visualization.
 
                
                E N D
Cloud Computing for Ecological Modeling in the D4Science Infrastructure A. Manzi (CERN), L. Candela, D. Castelli, G. Coro, P. Pagano, F. Sinibaldi (ISTI-CNR) EGI Community Forum 2013 Manchester 8-12 Apr 2013
Species Distribution modeling Species distribution modelsaiming at estimating the presence of a species in a given area are essential instruments in the development of strategies and policies for the management and the sustainable and equitable use of living resources. 2 Main issues to face: Need for large computing capabilities and appropriate modeling tools Need for both a sufficient amount of good quality occurrence point datasets and suitable environmental datasets
The AquaMaps scenario Model-based, large-scale predictions of known natural occurrence of marine species. Predictions are made by matching species tolerances against local environmental conditions. ( e.g. salinity, temperature) Computation is based on algorithms such as AquaMaps: • Developed by Kashneret al. (2006) to predict global distributions of marine mammals • Color-coded species range map, using a half-degree latitude and longitude dimensions
The AquaMaps scenario HCAF HSPEN HSPEC • Species Environmental Envelope (HSPEN) • Range of environmental tolerance and preference of a species • Cells Authority File (HCAF) • Metadata about half degree cells: membership, physical attributes • Cells Species Assignments (HSPEC) • Probability of occurrence of a species in a given cell
The AquaMaps scenario 11,549 species ( from FishBase) 2 Days of sequential computation Very large volume of input and output data • Less than 7,000 species: • HSPEC native range = 56,468,301 • HSPEC suitable range = 114,989,360 • Estimate for 50,000 species: • HSPEC native range = 350,000,000 • HSPEC suitable range = 715,000,000 [Eli E. Agbayani, FishBase Project/INCOFISH WP1, WorlFish Center] Very large number of computation • One Multispecies map computed on 6,188 half degree cells (over 170k) and 2,540 species • requires 125 millions computations • One global map (extended to all species and cells around the world) • requires about 400 billions computations [N. Bailly, WorldFish Center]
D4Science Infrastructure VRE VRE VRE VRE VRE VENUS-C GeoNetwork GENESI-DEC AquaMaps GBIF EGI OBISs DRIVER Production level infra deployed and maintained during D4Science (2007) and D4Science II (2009) projects
D4Science Infrastructure D4Science hosts biodiversity communities federated by the iMarineand the EUBrazilOpenBioinitiatives D4Science will provide ENVRI RIs with seed resources • Well suited for typical biodiversity processes likeEcological Modeling • Provides access to • computational and storage resources offered by commercial cloud providers • new storage technologies generally identified as no-sql databases • several algorithms for performing data analysis and mining • Offers scalable platforms fordata interoperability and efficient data management • Offers a scalable infrastructure for efficient spatialdata access, processing, and visualization
D4Science: example of communities 1920 Collaborators, 33 M Hits/month 50 K/month unique visitors from 26 countries 400 Experts Aquamaps Observation Data OpenModeller Operational Data Cloud
gCubeFramework gCubeis a JAVA service-oriented framework managing: • creation and interconnection of e-Infrastructures in a controlled and highly configurable environment. • deployment of dynamic Virtual Research Environments • Enabling Layer Allows deployments of: • Native components on Tomcat (hot deployments) • gCube components on Axis container (dynamic deployments) Implements Infrastructure components optimal deployment and allocation (automatic or admin driven)
gCube Framework: Main Components • Information System • This service is a key one in a gCube-based infrastructure since it offers functionalities for publishing, monitoring, discovering and accessing the set of resources forming the infrastructure • Storage Manager • the management of files storage is based on a network of distributed storage nodes managed via specialized software for document-oriented databases. The Storage Manager Library in its current implementation offers files management over two possible document store software: MongoDB and Terrastore. • Message Queue • This service is based on Apache Active Message Broker to support a queue-based mechanism for distributing messages to consumers
gCube Framework: Main Components • Executor • This service is a key component to endow a gCube-empowered infrastructure with cloud processing. It acts as a container for gCubetasks ( as plugins of the service) which can be dynamically deployed into the service and executed through its interface. • Generic Worker • task of the Executor which is exploited in cloud computation tasks. It is able to execute “processes”, either binary executables or scripts, along with their dependencies in a sandbox.
gCube Framework: Main Components • Geospatial Data Manager • Service for discovering and accessing to distributed environmental data and maps. This service relies on maps stored on several GeoServerinstances. A set of PostGIS databases store the concrete values and geometries and the GeoServer distributes them according to standard Open Geospatial Consortium (OGC)protocols like Web Map Service (WMS), Web Coverage Service (WCS) and Web Feature Service (WFS). A GeoNetworkinstance is endowed with an OGC CSW based search engine which allows for retrieving meta-information
gCubeStatistical Manager D4Science Workspace D4Science DB Source External Features Sources User’s DB Storage Manager Statistical Manager CSV Java Objects HTTP CALLS SDMX
Scope • Statistical Manager is able to: • Generate Geographical Probability models for • species (e.g. Aquamaps) • Perform transformations on data • (e.g. interpolations) • Perform data mining operations • (e.g. modeling, clustering) • Evaluate models, distributions and experiments (e.g. ROC curve, AUC, Accuracy) • Perform data quality analysis • (e.g. Habitat Representativeness Score)
Statistical Manager & AquaMaps • Data generation is up to 50-times faster on D4Science cloud • Adds the generation and publication of GIS layers representing the species distribution • Supports generation of transect • Supports dataset management facilities • Solves scalability issues The Statistical Manager is instantiated with the AquaMaps algorithm
Conclusions • Ecological Modeling in D4Science: • Perform modeling by using Cloud Computing in a transparent way to users • Take care of parallelization issues • Evaluate models performances • Next Step: • Transparent generation of Geospatial features at different resolutions by implementing geospatial data processing by means of cloud computing facilities, endowed with a WPS protocol interface. D4Science
AppliFish iMarine application for iOS and Android to discover over 500 worldmarine species and stayinformed on iMarine news & activities Try AppliFish ! iOS Android Go mobile with iMarine
Thanks for your attention Questions? www.i-marine.org www.d4science.org https://portal.i-marine.d4science.org