Taxonomy Techniques in Cloud Computing for Scientific Applications

Abstract • In the last few years, cloud computing has emerges as a computational paradigm that cables scientists to build more complex scientific applications to manage large data sets or high-performance applications, based on distributed resources. By following this paradigm, scientists may use distributed resources (infrastructure, storage, database, and applications) without having to deal with implementation or configuration details.

Abstract • In fact, there are many cloud computing environment available for use. Despite its fast growth and adoption, the definition of cloud computing is not a consensus. This makes it very difficult to comprehend the cloud computing field as a whole, correlate, classify, and compare the various existing proposals. Over the years, taxonomy techniques have been used to create models that allow for the classification oc concepts within a domain.

Abstract • The main objective of this chapter is to apply taxonomy techniques in the cloud computing domain. This chapter discusses many aspects involved with cloud computing that are important from a scientific perspective. It contributes by proposing a taxonomy based on characteristics that are fundamental for scientific applications typically associated with the cloud paradigm

3.1 Introduction • The evolution of computer science in the last decade enabled the advent of e-Science, which is entirely carried out in computational environments. The term “e-Science” is strictly related silico experiments.

3.1 Introduction • The development of technologies such as grids fostered the popularity of e-Science and consequently in silico experiments. In silico experiments are commonly found in many scientific domains, such as oil exploration. An in silico experiment is conducted by a scientist, who is responsible for managing the entire experiment, which comprise composing, executing, and analyzing it.

3.1 Introduction • Currently, most of the work of scientists during an in silico experiment is related to the execution of a sequence of programs. Each program produces a collection of data with certain semantics. These data used as input to the next program to be executed in the chain sequence. The chaining of these programs may become unfeasible without systematic computational support.

3.1 Introduction • A scientific workload may be defined as an abstraction that allows the structured composition of programs and data as a sequence of operations aiming at a desired result as defined by Mattoso et al.

3.1 Introduction • Simultaneously, in the last few years, cloud computing emerged as a new computation paradigm where web-based service enabled different kinds of users to obtain a huge variety of capabilities, in infrastructure, software, and hardware, without having to deal with configuration and implementation details.

3.1 Introduction • The programs and data (that are fundamental parts of scientific workloads) are moving from local environments to the cloud. Foster et al. examined the differences between grid and cloud computing, offering a good foundation to categorize the existing cloud computing projects and/or services. They define cloud computing as “A large-scale distributed computing paradigm that is driven by economies of scale, in which a pool of abstracted, virtualized, dynamically-scalable, managed computing power, storage, platforms, and services are delivered on demand to external customers over the Internet.”

3.1 Introduction • The main advantage of cloud computing is that the average user is able to access a great variety of resources without having to acquire or configure the whole infrastructure. This is fundamental need for scientific applications, since the scientists can be isolated from the complexity of the environment, focusing only on their in silico experiment.

3.1 Introduction • The volume of published white papers and scientific papers evidences that cloud computing has both emerged and is already being adopted by some scientific projects. Several technologies, platforms, applications, infrastructures, and standards have been proposed. However, the concepts involved with cloud computing are not fully detailed or explained. Considering the growing interest in cloud computing and the difficulty in finding organized definitions of concepts associated to this paradigm, we present in this chapter a taxonomy for the cloud computing from an e-Science perspective.

3.1 Introduction • Taxonomies are a particular classification structure where concepts are arranged in a hierarchical way. The proposed cloud taxonomy provides an understanding of the domain and aims to help scientists when comparing different cloud computing environments. The cloud computing e-Science taxonomy presented in this chapter is useful for scientific community to classify environments to compare different cloud computing environment that are available for use.

3.1 Introduction • By consulting this taxonomy, they may consider the features that meet their needs, which may vary depending on the scientific experiment being conducted. The taxonomy considers a broad view of cloud computing, comprising all its major issues. Using the proposed taxonomy as a common vocabulary may facilitate scientists to find common characteristics of the existing environments and may help to choose the most adequate cloud envirnment.

3.2 scientific Workflows and e-Science • This section presents the main definitions regarding e-Science and scientific workflow concepts. These concepts are presented along with some important aspects to be considered when modeling or executing scientific experiments using cloud computing. These aspects are used as a basis for elaborating the classes of the cloud computing taxonomy.

3.2.1 Scientific Workflows • According to the Workflow Management Coalition, a workflow may be defined as “the automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.” A workflow defines the order of task invocation or conditions under which tasks must be invoked and the task synchronization. This definition is related to business workflow; however, it can be exploited in the scientific domain, where tasks will be related to scientific applications instead of business ones. An example of scientific workflow is presented in Fig. 3.1. This workflow is part of a real deep water oil exploitation scientific experiment.

3.2.1 Scientific Workflows

3.2.2 Scientific Workflow Management Systems • Scientific Workflow Management Systems (SWfMSs) are responsible for coordinating the invocation of programs, either locally or in remote environments. Many different SWfMSs can be found in the literature. Although current SWfMSs have many important characteristics and evolutions, according to Weske et al, these SwfMS need to offer adequate support for the scientist throughout the experimentation process, including: (i) designing the workflow through a guided interface; (ii) controlling several variations of workflows; (iii) executing the workflow in an efficient way; (iv) handling failure and ; (v) accessing, storing, and managing data.

3.2.2 Scientific Workflow Management Systems • Most of this support can be achieved using the cloud computing paradigm. More specifically, efficient execution of scientific experiments, as well as management of the large amount of scientific data produced by the experiment, is provided by the computational infrastructure of cloud computing environments. The next section presents some important aspects for scientific experiments to be considered when choosing a cloud computing environment.

3.2.3 Important Aspects of In Silico Experiments • In silico experiments (that are usually modeled as scientific workflows) have some important aspects to be considered when being modeled or executed. Many of these aspects should be taken into account when choosing a supporting cloud computing environment. Cloud computing environment present some important characteristics that are related to those aspects and may influence when scientists choose a cloud environment to use. This section presents these aspects (business model, privacy, pricing, technological infrastructure, architecture, access, and standards) as they guide us to choose the classes of the proposed taxonomy.

3.2.3 Important Aspects of In Silico Experiments • One of the most important aspects for scientific experiments is reproducibility. To reproduce and validate an experiment, scientists must have all available information related to the experiment, including which parameter values were used in each instance of execution, the result (both final and intermediary) produced during its execution. This type of information is called provenance.

3.2.3 Important Aspects of In Silico Experiments • This data is stored in databases or via specialized services to store provenance, thus handling failure and retaining data integrity. Therefore, to achieve experiment reproducibility, the supporting cloud computing environment should provide two fundamental features, data storage and environment configuration. Data storage is required store provenance data. Preferably, there should be a service that provides storage or database mechanisms to enable the scientists to access provenance data and track how the result of an experiment execution were obtained. Environment configuration is required since the whole environment used to execute the experiment should be able to reconfigured. Those characteristics are related to the business model followed by a cloud computing environment.

3.2.3 Important Aspects of In Silico Experiments • Privacy is also a major issue for the scientific community. Usually, provenance data and programs related to scientific experiment are considered intellectual property and because of that, they are not public until public until the research is published in a scientific paper. This way, the privacy aspect of cloud environments must be analyzed when dealing with scientific experiments.

3.2.3 Important Aspects of In Silico Experiments • Another important aspect to be considered is related to pricing. Scientifics frequently use open-source and community environments. This type of programs and environments is freely available for general use, thus contributing to the reproducibility of experiment executions. The open-software culture of the scientific community must be considered, since most cloud environment environments are commercial, which means that the service is paid for. Thus, scientists should take into account the pricing of enviroments.

3.2.3 Important Aspects of In Silico Experiments • The architecture characteristics of the environment chosen to execute the experiment should also be taken into account. Scientific experiments need to be monitored and controlled by scientists. This way, the chosen cloud environment should provide characteristics such as monitoring, as well as individual control of an experiment execution independent from others’ executions.

3.2.3 Important Aspects of In Silico Experiments • In many scenarios the execution of a whole experiment requires running programs in different technological platforms (operational systems, database servers), requiring that the cloud computing environment deals with heterogeneity.

3.2.3 Important Aspects of In Silico Experiments • Another important aspects is related to performance. These experiments usually need high-performance computational environment to run. Even using these environments, experiments may need days, weeks, or even months to finish. It is important to know (and classify) the technology infrastructure involved with the experiment to discover if this technology is able to offer the necessary computational resources to execute the entire experiment.

3.2.3 Important Aspects of In Silico Experiments • Another important topic is related to how scientifists access the cloud environment to run experiments. The in silico scientific experiment mus be able to access cloud environments in different ways. For example, in a specific experiment, result must be provides in a web page through a web browser; in another experiment, there must be an API to control the execution of the experiment, and so on.

3.2.3 Important Aspects of In Silico Experiments • In silico scientific experiments should be based on standards, ideally already used on the experiment domain or recommended by entities such as W3C. These standards are important when modeling an in silico scientific experiment. Scientific experiments are usually based on open standards. The next section presents the proposed taxonomy for cloud computing that takes into the account the aspects listed in this section.

3.3 A Taxonomy for Cloud Computing • A taxonomy is a particular classification arranged in a hierarchical structure. It is typically organized by a parent-child relationship. Originally the term “taxonomy” referred only to the classification of living organisms. However, it has become popular in certain domains of science to apply the term in a wider, more general sense, where it may refer to a classification of things or concepts.

3.3 A Taxonomy for Cloud Computing • The cloud computing taxonomy presented in this chapter provides the classification of the components of the cloud computing domain into categories based on different aspects of this field and the requirements of scientific experiment. This section describes a cloud computing taxonomy (presented in Fig. 3.2), which is decomposed into eight subtaxonomies.

3.3 A Taxonomy for Cloud Computing • The proposed taxonomy classifies the characteristics of cloud computing in terms of architectural characteristics, business model, technology infrastructure, privacy, standards, pricing, orientation, and access. Many of the classes of the taxonomy are interrelated. In Fig. 3.2, these relations are represented in orange arrows. Each one of these relations is explained throughout the chapter.

3.3.1 Business Model • According to the business model adopted, clouds are usually classified into three major categories (Fig. 3.3): Software as a Service (SaaS), Platform as a Service (PaaS), and infrastructure as a Service (IaaS), creating a model named SPI.

3.3.1 Business Model

3.3.1 Business Model • In SaaS, the software is deployed by a service provider (just like an application to end-user) for commercial or free use as a service on demand. In IaaS, the provider delivers a computational infrastructure (such as a supercomputer) to the end-user on the web. In IaaS, the end-user is usually responsible for configuring the environment to use. PaaS is the delivery of a programming environment as a Service. The process of delivering platforms as services facilitates the deployment of applications into the cloud.

3.3.1 Business Model • However, these three categories are to generic. More classification levels are indeed needed. For example, in the e-Science field, the generated data is one of the most valuable resources. This classification does not take into account services that are based on storage or database.

3.3.1 Business Model • The business model subtaxonomy should include the following areas: Storage as a Service (StaaS) and Database as a Service (DaaS), which are fundamental for e-Science and Scientific workflows. We may define Storage as a Service as a service that provides structured ways to access and maintain a storage facility that is remotely located. However, this kind of business model provides only the space and structure to store data.

3.3.1 Business Model • In Scientific experiments, the scientific usually need a database to store provenance data, because a database provides feature such as indexing and concurrency control, that is a simple storage does not provide.

3.3.1 Business Model • This way, Database as a Service (DaaS) provides operations and functions of a remotely hosted database, sharing it with other users, and having it logically function as if the database were local. This way, we may see the Database-as-a-Service as one specialization of Storage-as-a-Service.

3.3.1 Business Model • The business model directly influences the orientation of the cloud environment. For example, an IaaS business model allows a user-centric environment, since the user is in control. On the other hand, as SaaS business model does not. This class of the taxonomy is essential to guarantee the reproducibility of scientific experiments. The business model directly defining if the cloud environment offers data, infrastructure, or application as a service, essential to guarantee reproducibility.

3.3.1 Business Model • For example, there should be a way to store provenance data to be further analyzed, thus the cloud computing environment should follow DaaS allow data storage.

3.3.2 Privacy • According to the privacy aspect, we may classify cloud environments as private, public, and mixed (Fig. 3.4). Public clouds may be considered as the most traditional of all types. In this kind of cloud, the various resources are dynamically provided over the Internet, via web applications or web services, to any user. Private clouds are environments that emulate cloud computing on private networks, inside a cooperation or scientific or a scientific institution.

3.3.2 Privacy

3.3.2 Privacy • A mixed cloud environment is one that is composed by multiple public and/or private clouds. The concept of mixed cloud is still dubious. Some authors call a mixed cloud also as hybrid. Although this term is not wrong, it is also used to define clouds that are implemented by different technologies, which may cause confusion.

3.3.2 Privacy • This class of the taxonomy is important for e-Science because of the importance of privacy levels in scientific experiments. Programs and data are usually not public and scientists may prefer not to install programs or store data in public envirnments.

3.3.3 Pricing • Since it is important for the scientific experiments to deal with costs, we must classify cloud environment according to a pricing criterion. This subtaxonomy (Fig. 3.5) is composed of three main types of pricing. Free pricing is the pricing model applied when you are using your own cloud environment, where the resources are freely available for authorized users.

3.3.3 Pricing • The pay-per-use model is the one where the user pays a specific value related to his resource utilization. Also, it can be specialized to a component-based prici, where each component (storage, CPU, and so on) has s different price and the real-time bill broken down by exact usage of components.

3.3.3 Pricing • These pay-per-use model are usually applied in both commercial clouds and scientific clouds. Science users pay for cloud usage in the same way as commercial users do. To our knowledge, there are no scientific institutions that share their resources at no cost. • Pricing is influenced access characteristics. Since a cloud environment offers more access methods, each one of them is a component that can be priced by the provider.

3.3.4 Architecture • This subtaxonomy (Fig. 3.6) classifies the main architectural characteristics of a cloud computing environment. One Fundamental architectural aspect of a cloud is heterogeneity. A cloud must support the aggregation of heterogeneous hardware and software resources, as it happens with scientific experiments. The concept of vituralization is also a key aspect for clouds.

Taxonomy Techniques in Cloud Computing for Scientific Applications

Taxonomy Techniques in Cloud Computing for Scientific Applications

Presentation Transcript

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

ABSTRACT

Abstract

ABSTRACT

Abstract

ABSTRACT

Abstract

ABSTRACT

ABSTRACT

Abstract

Abstract

Abstract

ABSTRACT THE ABSTRACT / TUTORIALOUTLETDOTCOM

Abstract