Private Cloud Sample Architectures for >1000 VM Singapore, Oct 2011

Private CloudSample Architectures for >1000 VMSingapore, Oct 2011 Iwan ‘e1’ Rahabok virtual-red-dot.blogspot.com | tinyurl.com/SGP-User-Group M: +65 9119-9226 | e1@vmware.com VCAP-DCD

Purpose of This Document • There is a lot of talk about Cloud Computing. But how does it look like at technical level? • How do we really assure SLA, and have 3 Tier of service? • If I have 1000 VM, what does the architecture it look like? • This is my personal opinion. • Please don’t take it as official and formal VMware Inc recommendation. I’m not authorised to do so. • Also, generally we should judge the content, rather than the organisation/person behind the content. A technical fact is a technical fact, regardless who said it  • Technology changes • SSD disk, >10 core CPU, FCoE, CNA, vStorage API, storage virtualisation, etc will impact the design. A lot ot new innovation coming within next 2 years. • New modules/products from VMware & Ecosystem Partners will also impact the design. • This is just a sample • Not a Reference Architecture, let alone a Detailed Blueprint. • So please don’t print and follows to the dot. This is for you to think and tailor. • It is written for hands-on vSphere Admin who have attended Design Workshop & ICM • You should be at least a VCP 5, preferably VCAP-DCD • No explanation on features. • A lot of the design consideration is covered in vSphere Design Workshop. Folks, some disclaimer, since I am employee of VMware

Table of Contents • Introduction • Requirements, Assumptions, Consideration, and Design Summary • vSphere Design: Data Center • Data Center, Cluster (DRS, HA, DPM, Resource Pool) • vSphere Design: Server • ESXi, physical host • vSphere Design: Network • vSphere Design: Storage • vSphere Design: Security • vCenter roles/permission, • vSphere Design: VM • vSphere Design: Management • Performance troubleshooting

Design Methodology • Architecting a Private Cloud is not a sequential process • There are 6 components. • The components are inter-linked. Like a mash. • In >1000 VM category, where it takes >2 years to implement, new vSphere will change the design. • Even the Bigger picture is not sequential • Sometimes, you may even have to leave Design and go back to Requirements or Budgetting. • Again, there is no perfect answer. Below is one example. • This entire document is about Design only. Operation is another big space. • I have not taken into account Audit, Change Control, ITIL, etc. The steps are more like this

Introduction

Assumptions • Assumptions are needed to avoid the infamous “It depends…” answer. • The architecture for 50 VM differs with that for 500 VM, which in turn differs with that for 5000 VM. • A design for large VM (16 vCPU, 128 GB) differs with a design for small VM (1 vCPU, 1 GB) • A design for Server farm differs to Desktop farm. • This assumes 100% virtualised, not 99% • It is easier to have 1 platform than 2. • Certain things in company, you should only have 1 (email, directory, office suite, back up). Something as big as a “platform” should be standardised. That’s why they are called platform  • Out of the 1000 VM, we assume some will be… • Huge. 10 vCPU, 96 GB RAM, 10 TB storage • Latency sensitive. 0.01 ms end to end latency • Secret. Holding company secret data. • We assume, it will have … • 50 databases, mixed of Oracle and SQL • Other Oracle softwares (they are charged per “cluster”) • The design is “forward looking” • Based on 10 GE network. • Assume Security team can be convinced on mixed-mode.

Assumptions used in this example

Application consideration

Architecting a private cloud: what to consider • Architecture is an Art • Balancing a lot of things, some are not even technical. • It considers future (unknown requirements). • Trying to be close to best practice • Not in any particular order, below is what I consider in this vSphere based architecture • My personal principle: Do not design something you cannot troubleshoot. • A good IT Architect does not setup potential risk for Support Person down the line. • Not all counters/metrics/info are visible in vSphere. • Consideration • Upgradability • This is unique in the virtual world. A key component of cloud that people have not talked much. • After all my apps run on virtual infrastructure, how do I upgrade the virtualisation layer itself? • Based on historical data, VMware releases major upgrade every 2-3 years. vSphere 4.0 was released on May 2009, 5.0 was Sep 2011. • If you are laying down an architecture, check with your VMware rep for NDA roadmap presentation. • Debugability • Troubleshooting in virtual environment is harder than physical, as boundary is blurred and physical resources are shared. • 3 types of troubleshooting: • Configuration. This does not normally happen in production, as once it is configured, it is not normally changed. • Stability. Stability means something hang or crash (BSOD, PSOD, etc) or corrupted • Performance. This is the hardest among the 3, especially if the slow performance is short lived and in most cases it is performing well.

Architecting a private cloud: what to consider • Consideration • Supportability • This is related, but not the same with Debug-ability. Support relates to things that make day to day support easier. Monitoring counters, reading logs, setting up alerts, etc. For example, centralising the log via syslog and providing intelligent search (e.g. using Splunk or Integrien) improves Supportability • A good design makes it harder for Support team to make human error. Virtualisation makes task easy, sometimes way to easy relative to physical world. Consider this operational/phychological impact in your design. • Support also means using components that are support by the vendors. For example, SAP support is from certain versions onwards (old version not supported) • Availability • Software has Bugs. Hardware has Fault. We cater for hardware fault mostly. What about software bugs? • Cater for software bug, which is why the design has 2 VMware clusters with 2 vCenter. This lets you test cluster-related features in one cluster, while keeping your critical VM on another cluster. • Tier 0 can be added that uses Fault Tolerant hardware (e.g. Stratus) • Reliability • Related to availabity, but not the same. Availability is normally achieved by redundancy. Reliability is normally achieved by keeping things simple, using proven components, separating things, standardising. • You will notice a lot of standardisation in the design. The drawback of standardisation is overhead, as we have to round up to the next bracket. A VM with 6 GB RAM ends up getting 8 GB. • Performance • Storage, Network, VMkernel, VMM, Guest OS, etc are considered. • We are aiming for <1% CPU Ready Time and near 0 Memory Ballooning in Tier 1. In Tier 3, we can and should have higher ready time and some ballooning, so long it still meet SLA. • Scalability • Includes both horizontal and vertical. Includes both hardware and software.

Architecting a private cloud: what to consider • Consideration • Cost • An even bigger Cost is ISV. Dedicating cluster for them is cheaper. • DR Site serves multiple purpose. • VMs from different Business Units are mixed in 1 cluster. If they can share same Production LAN and SAN, same reason can apply to hypervisor. • Window, Linux and Solaris VMs are mixed in 1 cluster. • Security • vSphere Security Hardening Guide split security into 3 levels: Production, DMZ and SSLF • vShield is used to complement vSphere. • Changing the paradigm in security. From “Hypervisor as another point to secure” to “Hypervisor to give unfair advantage for security team”. • Skills of IT team • Skills include both internal and external (preferred vendor who complement the IT team) • Improvement • Beside meeting current requirements, can we improve things? • Moving toward “1 VM 1 OS 1 App”. In physical, some physical servers may serve multiple purpose. In virtual, they can afford, and should do so, to run 1 App per VM. • We consider Desktop Virtualisation in the overall architecture.

Data Center Design Data Center, Cluster, Resource Pool, DRS, DPM

Methodology • Define how many physical data centers are required • DR requirements normally dictate 2 • For each Physical DC, define how many vCenter are required • Desktop and Server should be separated by vCenter • View 5 comes with bundled vSphere (unless you are buying add-on) • Security Requirement, not scalability, drives this one • In our sample scenario, it does not warrant separation. • Different vCenter Admin drives this segregration • For each vCenter, define how many virtual data centers are required • Virtual Data Center serve as name boundary. • Different paying customer drives this segregation • For each vDC, define how many Cluster are required • For each Cluster, define how many ESXi are required • Preferably 4 – 8. 2 is too small a size. Adjust according to workload • Standardise the host spec across cluster. While each cluster can have its own host type, this adds complexity

DataCenter and Cluster • When do you decide to use a separate… • Cluster? • Datacenter? • vCenter? • Input needed to decide the above • Application licensing • Application Workload • Application hardware requiremens • Group

Special scenarios • There are scenarios where you might need to create a separate cluster, or even vCenter. • Large VM (>6 vCPU, > 36 GB RAM) • Separate cluster, as the ESXi host spec is different? • Databases. Do you … • …group them into 1 cluster (to save licence, give DBA more access to the cluster, vShield group)? • …or put them together with the app they support? • … put DB used by IT in the same cluster with DB used by Business? • Oracle VM • Separate cluster or sub-cluster? • VM that needs hardware dongle • Use network-base. • Separate subcluster. Will also need the same at DR Site. • VM holding company secret • Do you put them them in separate cluster? Can you trust the vCenter Admin? • Do you put them in separate datastore? Do you use VSA as you can’t trust the SAN Admin? • Enhance the security with vShield • VM with 0.1 ms network latency • Do you put them in separate cluster as your ESXi has to be configured differently? • VM with 5 ms disk latency • VM on DMZ zone • Same cluster. We wil use vShield

Overall Architecture This shows an example for Cloud for >500 VM. It also uses Active/Passive data centers. The overall architecture remain similar with Large Cloud.

Cluster Design (1 DC)

The need for Non Prod Cluster • This is unique in the virtual data center. • Well, we don’t have “Cluster” to begin with in physical DC. • Non-Prod Cluster serves multiple purposes • Run Non Production VM • In our design, all Non-Production run on DR Site to save cost. • A consequence of our design is migrating from/to Production can mean copying large data across WAN. • Disaster Recovery • Test-Bed for Infrastructure patching or updates. • Test-Bed for Infrastructure upgrade or expansion • Evaluating or Implementing new features • In Virtual Data Centre, a lot of enhancements can impact entire data centre • e.g. Distributed Switch, Nexus 1000V, Fault Tolerant, vShield • All the above need proper testing. • Non-Prod Cluster should provide sufficient large scale scope to make testing meaningful • Upgrade of the core virtual infrastructure • e.g. from vSphere 5 to future version (major release) • This needs extensive testing and roll back plan. • Even with all the above… • How are you going to test SRM properly? • SRM test needs 2 vCenters, 2 arrays, 2 SRM servers. If all are used in production, then where is the test-environment for SRM? Business IT This new layer does not exist in physical world. It is software, hence needs its own Non Prod envi.

The need for IT Cluster • Special purpose cluster • Running all the IT VMs used to manage the virtual DC or provide core services • The Central Management will reside here too • Separated for ease for management & security This separation keeps Business Cluster clean, “strictly for business”.

Cluster Size • I recommend 8 nodes per cluster. Why 8, not 4 or 12 or 16 or 32? • A balance between too small (4 hosts) and too large (>12 hosts) • DRS: 8 give DRS sufficient host to “maneuver”. 4 is rather small from DRS scheduler point of view. • With vSphere 4.1, having 4 hosts do not give enough hosts to do “sub-cluster” • For cost reason, some clusters can be as small as 2 nodes. But DPM benefit can’t be used. • Best practice for cluster is same hardware spec with same CPU frequency. • Eliminates risk of incompatibility • Consistent performance (from user point of view) • Complies with Fault Tolerant & VMware View best practices • So more than 8 means it’s more difficult/costly to keep them all the same. You need to buy 8 hosts a time. • Upgrading >8 servers at a time is expensive ($$) and complex. A lot of VMs will be impacted when you upgrade > 8 hosts. • Manageability • Too many hosts are harder to manage (patch, performance troubleshooting, too many VMs per cluster, HW upgrade) • Allow us to isolate 1 host for VM-troubleshooting purpose. At 4 node, we can’t afford such ”luxury” • Too many paths to a LUN can be complex to manage and troubleshoot • Normally, a LUN is shared by 2 clusters, which are “adjacent” cluster. • 1 ESX is 4 paths. So 8 ESX is 32 paths. 2 clusters is 64 paths. This is a rather high number (if you compare with physical world) • N+2 for Tier 1 and N+1 for others • With 8 host, you can withstand 2 host failures if you design it to. • At 4 nodes, it is too expensive as payload is only 50% at N+2 • Small Cluster size • From Availability and Performance point of view, this is rather risky. • Say you have 3-node cluster…. You are doing maintenance on Host 1 and suddenly Host 2 goes down… you are exposed with just 1 node. Assuming HA Admission Control is enabled (which you should), the affected VM may not even boot. When a host is placed into maintenance mode, or disconnected for that matter, it is taken out of the admission control calculation. • Cost: Too few hosts result in overhead (the “spare” host) • See slide notes for more details

3-Tier cluster • The host spec can be identical. But the service can be very different. • Below is an example of 3-tier cluster.

ESXi Host: CPU Sizing • ESXi Host: CPU • 2 - 4 vCPU per physical core • This is a general guideline. • Not meant for sizing Tier 1 Application. Tier 1 App should be given 1:1 sizing. • More applicable for Test/Dev or Tier 3 • 12 core box  24 – 48 vCPU • Design with ~10 VM per box in Production and ~15 VM per box in Non Production. • ~10 VM per box means impact of downtime when host fails are capped at ~10 Production VM. • ~10 VM per box in a 8-node cluster means ~10 VMs may be able to boot in 7 hosts in the event of HA, hence reducing down time. • Based on 10:1 consolidation ratio, if all your VMs are 3 vCPU, then you need 30 vCPU, which means a 12 core ESX gives 2.5:1 CPU oversubcribe. • Based on 15:1 consolidation ratio, if all your VMs are 2 vCPU, then you need 30 vCPU. • Buffer the following: • HA event • Performance isolation. • Hardware maintenance • Peak: month end, quarter end, year end • Future requirements: within 12 months • DR. If your cluster needs to run VM from the Production site.

ESXi: Sample host specification • Estimated Hardware Cost: US$ 8K per ESXi. • Configuration included in the above price: • 2 Xeon X5650. The E series has different performance & price attributes • 72 GB RAM (18 slots x 4 GB) or 96 GB RAM (12 slots x 8 GB) • 2 x 10 GE ports (no hardware iSCSI) • 2 x 8 Gb FC HBA • 5 year warranty (next business day) • 2x 50 GB SSD. • Swap to host-cache feature in ESXi 5 • Running agent VM that is IO intensive • Could be handy during troubleshooting. Only need 1 HD as it’s for troubleshooting purpose. • PXE boot • No need local disk • Installation service • Light-Out Management. Avoid using WoL. Uses IPMI or HP iLO.

Blade or Rack • Both are good. Both have pro and cons. Table below is relative comparison, not absolute. • Consult principal for specific model. Below is just for guidelines. • Comparison below is only for vSphere purpose. Not for other use case, say HPC or non VMware.

Server Selection • All Tier 1 vendors (HP, Dell, IBM, Cisco, etc) make great ESXi hosts. • Hence the following guidelines are relatively minor to the base spec. • Additional guidelines for selecting an ESXi Servers: • Does it have Embedded ESXi? • How much local SSD (capacity and IOPS) can it handle? This is useful for stateless desktop architecture. Useful when using local SSD as cache or virtual storage. • Does it have built-in 2x 10 GE ports? • Does the built-in NIC card have hardware iSCSI capability? • Memory cost. Most ESXi Server has around 64 – 128 GB of RAM, with mostly around 72 GB. With 4 GB DIMM, it needs a lot of DIMM slots. • What are the server unique features for ESXi? • Management integration. Majority of the server vendors have integrated management with vCenter. Most are free. Dell is not free, although it has more features? • DPM support?

SAN Boot • 4 methods of ESXi boot • Local Compact Flash • Local Disk • LAN Boot (PXE) with Auto-Deploy • SAN Boot • For the 3 sample size, we use ESXi Embedded. • Environment with >20 ESXi should consider Auto Deploy. • Auto-Deploy is also good for environment where you need to prove to security team that your ESXi has not been tempered (you can simply boot it and it is back to “normal” ) • Advantages of Local Disk to SAN boot • No SAN complexity • Need to label the LUN properly. • Disadvantages of Local Disk to SAN boot • Need 2 local disk, mirrored. • Certain organisation does not like local disk. • Disk is a moving part. Lower MTBF. • Save power/cooling • SAN Boot is a step toward stateless ESXi • An ideal ESX is just pure CPU and RAM. No disk, no PCI card, no identity.

Storage Design

Once mapping is done, turn on QoS if needed Turn on Storage IO Control if a particular VM needs certain guarantee. Turn on Storage IO Control is we want fairness among all VM within the DS Storage IO Control is per datastore. If underlying LUN shares spindles with all other LUN, then it may not achieve the result. Consult with storage vendor on this as they have entire array visibility/control. Methodology Define the standard (Storage Driven Profile) Define the Datastore profile. Map Cluster to Datastore • For each VM, gather: • Capacity (GB) • Performance (IOPS) requirements • Importance to business: Tier 1, 2, 3 • Map each VM to each datastore • Create another DS if insufficient (either capacity or performance) • See next slide for detail

SLA: Type of Datastores • Not all datastores are equal. • Always know the underlying IOPS & SLA that the Array can provide for a given datastore • You should always know where to place a VM. • Use datastore group • Always have a mental picture where your Tier 1 VM resides. It can’t be “somewhere in the cloud” • Types of datastore • Business VM • Tier 1 VM, Tier 2 VM, Tier 3 VM, Single VM • Each Tier may have multiple datastores. • DMZ VM • Mounted only by ESX that has DMZ network? • IT VM • Isolated VM • Template • Desktop VM • SRM Placeholder • Datastore Heartbeat? • Do we dedicate datastores for it? • 1 datastore = 1 LUN • Relative to “1 LUN = Many VMFS”, it gives better performance due to less SCSI reservation

Special Purpose Datastore • 1 low cost Datastores for ISO and Templates • Need 1 per vCenter • Need 1 per physical Data Center. Else you will transfer GBs of data across WAN. • Around 500 GB • ISO directory structure: • 1 staging/troubleshooting datastore • To isolate a VM. Proof to Apps team that datastore is not affected by other VM. • For storage performance study or issue. Makes it easier to corelate with data from Array. • The underlying spindles should have enough IOPS & Size for the single VM • Our sizing: 500 GB • 1 SRM Placeholder datastore • So you always know where it is. • Sharing with other datastore may confuse others. • Used in SRM 5 to place the VMs metadata so it can be seen in vCenter. • 10 GB enough. Low performance. \ISO\ \OS\Windows \OS\Linux \Non OS\ store things like anti virus, utilities, etc

Consult storagevendor for arrayspecific design SLA: 3 Tier pools of storage • Create 3 Tiers of Storage. • This become the type of Storage Pool provided to VM • Paves for standardisation • 1 size for each Tier. Keep it consistent. Choose an easy number. • 20% free capacity for VM swap files, snapshots, logs, thin volume growth, and storage vMotion (inter tier). • Use Thin Provisioning at array level, not ESX level. • Separate Production and Non Production • Example • Replication is to DR Site via array replication, not same building. • Snapshot = protected with array-level snapshot for fast restore • RAID level does not matter so much if Array has sufficient cache (with battery backed, naturally) • RDM will be used for data drive with 1 TB. Virtual-compatibility mode used unless Apps said so. • VMDK larger than 1 TB will be provisioned as RDM. Virtual-compatibility mode used.

3-Tier Storage? • Below is a sample diagram, showing disk grouping inside an array. • The array has 48 disks. Hot Spare not shown for simplicity • This example only has 1 RAID Group (2+2) for simplicity • Design consideration • Datastore 1 and Datastore 2 performance can impact one another, as they share physical spindles. • The only way they don’t impact if there are “Share” and “Reservation” concept at “meta slice” level. • Datastore 3, 4, 5, 6 performance can impact one another. • DS 1 and DS 3 can impact each other since they share the same Controller (or SP). This contention happens if the shared component becomes bottlenect (e.g. cache, RAM, CPU). • The only way to prevent is to implement “Share” or “Reservation” at SP level.

Mapping: Cluster - Datastore • Always know which cluster mounts what datastores • Keep the diagram simple. Not too many info. The idea is to have a mental picture that you can remember. • If your diagram has too many lines, too many datastores, too many clusters, then it maybe too complex. Create a Pod when such thing happens. Modularisation can be good.

Mapping: Datastore Replication

Mapping: Datastore – VM • Criteria to use when placing a VM into a Tier: • How critical is the VM? Importance to business. • What are its performance and availability requirements? • What are its Point-in-Time restoration requirements? • What are its backup requirements? • What are its replication requirements? • Have a document that lists which VM resides on which datastore group • Content can be generated using PowerCLI or Orchestrator, which shows datastores and their VMs. • While rarely happen, you can’t rule out if datastore metadata get corrupted. • A VM normally change tiers throughout its life cycle • Criticality is relative and might change for a variety of reasons, including changes in the organization, operational processes, regulatory requirements, disaster planning, and so on. • Be prepared to do Storage vMotion.

Storage Calculation • We will split System Drive and Data Drive • Enable changing the OS by swapping the C:\ vmdk file • We use 10 GB for C:\ to cater for Win08 and give space for defragmentation. • We use Thin Provisioning, at array level preferably. • The sample calculation below is for our small cloud • 30 Production VM: 26 non-DB + 3 DB + 1 File Server • Non-DB VM: 100 GB on average • DB VM: 500 GB on average • File server VM: 2 TB • 15 Non Production

Reasons for FC (partial list) • Network issue does not create storage issue • Troubleshooting storage does not mean troubleshooting network too • 8 Gb vs 1 Gb. 16 vs 2 Gb in redundant mode • 10 GE is still expensive and need uplink to change too • HP or Cisco blade may provide good alternative here.Consider the total TCO and not just cost per box. • FC vs IP • FC protocol is more efficient & scalable than IP protocol for storage • Path failover is <30 seconds, compared with <60 seconds for iSCSI • Lower CPU cost • See the chart. FC has lowest CPU hit to process the IO, followed by hardware iSCSI • Storage vMotion • You can estimate the time taken to move 100 GB over 1 Gb path…

Network Design

Methodology • Define how many VLAN you need • Decide if you will use 10 GE or 1 GE • If you use 10 GE, define how you will use Network IO Control • Decide if you use IP storage or FC storage • Decide the vSwitch to use: local, distributed, Nexus • Decide when to use Load Based Teaming • Select blade or rack mount • This has impact on NIC ports and Switches • Define the detailed design with vendor

Network Architecture

ESXi Network configuration

Security Design

Separation of Duties with vSphere • VMware Admin >< AD Admin • AD Admin has access to NTFS. This can be too powerful if it has confidential data • Segregate the virtual world • Split vSphere access into 3. • Storage • Server • Network • Give Network to Network team. • Give Storage to Storage team. • Role with all access to vSphere should be rarely used. • VM owner can be given some accessthat they don’t have in physical world. They will like the empowerment (self service) Enterprise IT space vSphere space

Folder • Properly use it • Do not use Resource Pool to organise VM. • Caveat: the Host/Cluster view + VM is the only view where you can see both ESX and VM. • Study the hierarchy on the right • It is Folder everywhere. • Folder is the way to limit access. • Certain object don’t have its own access control. They rely on folder. • E.g. You cannot set permissions directly on a vNetwork Distributed Switches. To set permissions, create a folder on top of it.

Compliance • How do we track changes made at vCenter by authorised staff? • vCenter does not track configuration drift. • Tools like vCenter Ops Enterprise provides some level of configuration management, but not all.

VM Design

Standard VM sizing: Follow McDonald • 1 VM = 1 App = 1 purpose. No bundling of services. • Having multiple application or services in 1 OS tend to create more problem. Apps team knows this better. • Start with Small size, especially for CPU & RAM. • Use as few virtual CPUs (vCPUs) as possible. • CPU impact on scheduler, hence performance • Hard to take back once you give them. Also, the app might be configured to match the processor (you will not know unless you ask the application team). • Maintaining a consistent memory view among multiple vCPUs consumes resources. • There is licencing impact if you assign more CPU. vSphere 4.1 multi-core can help (always verify with ISV) • Virtual CPUs not used still consumes timer interrupts and execute the idle loops of the guest OS • In physical world, CPU tend to be oversized. Right size it in virtual world. • RAM • RAM starts with 1 GB, not 512 MB. Patch can be large (330 MB for XP SP3) and needs RAM • Size impact vMotion, ballooning, etc, so you want to trim the fat • Tier 1 Cluster should use Large Page. • Anything above XL needs to be discussed case by case. Utilise Hot Add to start small (need DC edition) • See speaker notes for more info

Operational Excellence

Ownership • Where do you draw the line between Storage and vSphere. Who owns the following: • VMFS and RDM • VMDK • Storage DRS • Who can initiate Storage vMotion? • Virtual Disk SCSI controller • Who decide the storage-related design in vSphere? • Where do you draw the line between Network and vSphere? • Who decide which one to use: vSwitch, vDS, Nexus? • Who decide on the network-related design in vSphere? • Who troubleshoot network problem in vSphere? • Where do you draw the line between Security and vSphere?

Private Cloud Sample Architectures for >1000 VM Singapore, Oct 2011