Effective Capacity Planning for Cloud: Best Practices and Strategies

Capacity Planning • Capacity planning examines what systems are in place, measures their performance, and determines patterns in usage that enables the planner to predict demand. • Resources are provisioned and allocated to meet demand. • The goal of capacity planning is to accommodate the workload and not to improve efficiency. • Performance tuning and optimization is not a primary goal of capacity planners. • A system uses resources to satisfy cloud computing demands that include processor, memory, storage, and network capacity. • Each of these resources has a utilization rate, and one or more of these resources reaches a ceiling that limits performance when demand increases.

Capacity Planning for Cloud • The reality of cloud computing is rather different than the ideal might suggest; cloud computing is neither ubiquitous, nor is it limitless. Often, performance can be highly variable, and you pay for what you use. • “Capacity and performance are two different system attributes. With capacity, you are concerned about how much work a system can do, whereas with performance, you are concerned with the rate at which work gets done”.

Capacity planning is an iterative process with the following steps: 1. Determine the characteristics of the present system. 2. Measure the workload for the different resources in the system: CPU, RAM, disk, network, and so forth. 3. Load the system until it is overloaded, determine when it breaks, and specify what is required to maintain acceptable performance. Knowing when systems fail under load and what factor(s) is responsible for the failure is the critical step in capacity planning. 4. Predict the future based on historical trends and other factors. 5. Deploy or tear down resources to meet your predictions. 6. Iterate Steps 1 through 5 repeatedly.

Defining Baseline and Metrics • The first item of business is to determine the current system capacity or workload as a measurable quantity over time. Because many developers create cloud-based applications and Web sites based on a LAMP solution stack. • These four technologies are open source products, although the distributions used may vary from cloud to cloud and from machine instance to machine instance. • On Amazon Web Services, machine instances are offered for both Red Hat Linux and for Ubuntu. LAMP stacks based on Red Hat Linux are more common than the other distributions, but SUSE Linux and Debian GNU/Linux are also common.

Baseline measurements • Let's assume that a capacity planner is working with a system that has a Web site based on APACHE, and let's assume the site is processing database transactions using MySQL. There are two important overall workload metrics in this LAMP system: • Page views or hits on the Web site, as measured in hits per second • Transactions completed on the database server, as measured by transactions per second or perhaps by queries per second. • Figure shows the historical record for the Web server page views over a hypothetical day, week, and year are graphed.

Contd.. • A number of important characteristics are determined by these baseline studies: • WT, the total workload for the system per unit time • To obtain WT, you need to integrate the area under the curve for the time period of interest. • WAVG, the average workload over multiple units of time • To obtain WAVG, you need to sum various WTs and divide by the number of unit times involved. You may also want to draw a curve that represents the mean work done. • WMAX, the highest amount of work recorded by the system • This is the highest recorded system utilization.

Load testing • Examining your server under load for system metrics isn't going to give you enough information to do meaningful capacity planning. Load testing seeks to answer the following questions: • 1. What is the maximum load that my current system can support? • 2. Which resource(s) represents the bottleneck in the current system that limits the system's performance? • 3. Can I alter the configuration of my server in order to increase capacity? • 4. How does this server's performance relate to your other servers that might have different characteristics?

Resource Ceilings • Whatever performance measurement tool you use, the goal is to create a set of resource utilization curves similar to the ones shown in Figure for individual server types in your infrastructure. To do this, you must examine the server at different load levels and measure utilization rates. • The graphs in Figure indicate that over a certain load for a particular server, the CPU (A), RAM (B), and Disk I/O (C) utilization rates rise but do not reach their resource ceiling. • In this instance, the Network I/O (D) reaches its maximum 100-percent utilization at about 50 percent of the tested load, and this factor is the current system resource ceiling.

Contd.. • WT is the sum over all the Web servers in your infrastructure: WT = Σ(WSnPWSnV) • In this equation, WSnP represents the workload of your physical server(s) and WSnV is the workload of the virtual servers (cloud-based server instances) of your infrastructure. • Database servers are known to exhibit resource ceilings for either their file caches or their Disk I/O performance. To build high-performance applications, many developers replicate their master MySQL database and create a number of slave MySQL databases. All READ operations are performed on the slave MySQL databases, and all WRITE operations are performed on the master MySQL database. • Figure shows this sort of database architecture.

Resource contention in a database server

Network Capacity • If any cloud-computing system resource is difficult to plan for, it is network capacity. There are three aspects to assessing network capacity: • Network traffic to and from the network interface at the server, be it a physical or virtual interface or server • Network traffic from the cloud to the network interface • Network traffic from the cloud through your ISP to your local network interface (your computer).

Contd.. • This makes analysis complicated. You can measure factor 1, the network I/O at the server's interface with system utilities, as you would any other server resource. • For a cloud-based virtual computer, the network interface may be a highly variable resource as the cloud vendor moves virtual systems around on physical systems or reconfigures its network pathways on the fly to accommodate demand. • But at least it is measurable in real time.

Contd.. • Factor 2 is the cloud's network performance, which is a measurement of WAN traffic. A WAN's capacity is a function of many factors: • Overall system traffic (competing services) • Routing and switching protocols • Traffic types (transfer protocols) • Network interconnect technologies (wiring) • The amount of bandwidth that the cloud vendor purchased from an Internet backbone provider.

Scaling • Scaling, from an IT resource perspective, represents the ability of the IT resource to handle increased or decreased usage demands. • The following are types of scaling: • Horizontal Scaling - scaling out and scaling in • Vertical Scaling - scaling up and scaling down

Horizontal Scaling • The allocating or releasing of IT resources that are of the same type is referred to as horizontal scaling (Figure 1). The horizontal allocation of resources is referred to as scaling out and the horizontal releasing of resources is referred to as scaling in. Horizontal scaling is a common form of scaling within cloud environments.

Vertical Scaling • When an existing IT resource is replaced by another with higher or lower capacity, vertical scaling is considered to have occurred .Specifically, the replacing of an IT resource with another that has a higher capacity is referred to as scaling up and the replacing an IT resource with another that has a lower capacity is considered scaling down. Vertical scaling is less common in cloud environments due to the downtime required while the replacement is taking place.

Using Microsoft Cloud Services • Microsoft has a very extensive cloud computing portfolio under active development. • Efforts to extend Microsoft products and third-party applications into the cloud are centered around adding more capabilities to existing Microsoft tools. • Microsoft calls their cloud operating system the Windows Azure Platform. You can think of Azure as a combination of virtualized infrastructure to which the .NET Framework has been added as a set of .NET Services. • The Windows Azure service itself is a hosted environment of virtual machines enabled by a fabric called Windows Azure AppFabric.

Contd.. • Windows Azure service is an Infrastructure as a Service offering. • A number of services interoperate with Windows Azure, including SQL Azure (a version of SQL Server), SharePoint Services, Azure Dynamic CRM, and many of Windows Live Services comprising what is the Windows Azure Platform, which is a Platform as a Service cloud computing model.

Contd.. • Azure and its related services were built to allow developers to extend their applications into the cloud. Azure is a virtualized infrastructure to which a set of additional enterprise services has been layered on top, including: • • A virtualization service called Azure AppFabric that creates an application hosting environment. AppFabric (formerly .NET Services) is a cloud-enabled version of the .NET Framework. • • A high capacity non-relational storage facility called Storage. • • A set of virtual machine instances called Compute. • • A cloud-enabled version of SQL Server called SQL Azure Database.

Contd..

Google Web Services • Google applications are cloud-based applications. The range of application types offered by Google spans a variety of types: productivity applications, mobile applications, media delivery, social interactions, and many more. • Google has a very large program for developers that spans its entire range of applications and services. • Among the services highlighted are Google's AJAX APIs, the Google Web Toolkit, and in particular Google's relatively new Google Apps Engine hosting service. • Using Google App Engine, you can create Web applications in Java and Python that can be deployed on Google's infrastructure and scaled to a large size.

Amazon Web Services • Amazon.com is one of the most important and heavily trafficked Web sites in the world. It provides a vast selection of products using an infrastructure based on Web services. • Amazon Web Services is based on SOA standards, including HTTP, REST, and SOAP transfer protocols, open source and commercial operating systems, application servers, and browser-based access. • Virtual private servers can provision virtual private clouds connected through virtual private networks providing for reasonable security and control by the system administrator.

Working with the Elastic Compute Cloud (EC2) • Amazon Elastic Compute Cloud (EC2) is a virtual server platform that allows users to create and run virtual machines on Amazon's server farm. • With EC2, you can launch and run server instances called Amazon Machine Images (AMIs) running different operating systems such as Red Hat Linux and Windows on servers that have different performance profiles. • You can add or subtract virtual servers elastically as needed; cluster, replicate, and load balance servers; and locate your different servers in different data centers or “zones” throughout the world to provide fault tolerance.

Contd.. • Consider a situation where you want to create an Internet platform that provides the following: • • A high transaction level for a Web application • • A system that optimizes performance between servers in your system • • Data driver information services • • Network security • • The ability to grow your service on demand

Contd.. • Implementing that type of service might require a rack of components that included the following: • • An application server with access to a large RAM allocation • • A load balancer, usually in the form of a hardware appliance such as F5's BIG-IP • • A database server • • Firewalls and network switches • • Additional rack capacity at the ISP

Contd.. • A physical implementation of these components might cost you something in the neighborhood of $25,000 depending upon the scale of your application. • With AWS, you might be able to have an equivalent service for as little as $1,000 and have a high level of availability and reliability to boot. • This difference may surprise you, but it is understandable when you consider that AWS can run its services with a much greater efficiency than your company would alone and therefore amortize its investment in hardware over several customers.

Working with Amazon Storage Systems • When you create an Amazon Machine Instance you provision it with a certain amount of storage. That storage is temporal, it only exists for as long as your instance is running. • All of the data contained in that storage is lost when the instance is suspended or terminated, as the storage is reassigned to the pool for other AWS users to use. For this and other reasons you need to have access to persistent storage.

Amazon Simple Storage System (S3) • Amazon S3's cloud-based storage system allows you to store data objects ranging in size from 1 byte up to 5GB in a flat namespace. In S3, storage containers are referred to as buckets, and buckets serve the function of a directory, although there is no object hierarchy to a bucket, and you save objects and not files to it. It is important that you do not associate the concept of a filesystem with S3, because files are not supported; only objects are stored. Additionally, you do not “mount” a bucket as you do a filesystem. • The S3 system allows you to assign a name to a bucket, but that name must be unique in the S3 namespace across all AWS customers. Access to an S3 bucket is through the S3 Web API (either with SOAP or REST) and is slow relative to a real-world disk storage system.

Contd.. • You can do the following with S3 buckets through the APIs: • • Create, edit, or delete existing buckets • • Upload new objects to a bucket and download them • • Search for and find objects and buckets • • Find metadata associate with objects and buckets • • Specify where a bucket should be stored • • Make buckets and objects available for public access

Check Your Knowledge • Explain the essential characteristics of Cloud Computing. • How is Cloud beneficial compared to the traditional IT model? • Explain the different Cloud services models. • Describe the various Cloud deployment models. • What are the challenges of Cloud Computing?

Effective Capacity Planning for Cloud: Best Practices and Strategies

Effective Capacity Planning for Cloud: Best Practices and Strategies

Presentation Transcript

Capacity Planning

Capacity Planning

Capacity Requirements Planning

Strategic Capacity Planning

Strategic Capacity Planning

Capacity Planning

TEMPDB Capacity Planning

Capacity Planning

Capacity Planning

Capacity Planning

Capacity Planning Methodology

Capacity Planning

Capacity Planning

EE590 Capacity Planning

Capacity Planning

Strategic Capacity Planning

Capacity Planning

Capacity Planning

Capacity Planning

Capacity Planning