vSphere 4Design Consideration, Jan 2010 Iwan ‘e1’ Rahabok Senior Systems Consultant VMware M: +65 9119 9226 | email@example.com | IM: e1_ang (Yahoo)FaceBook.com/e1ang | tinyurl.com/user-group-SG | tinyurl.com/user-group-ID VCP, VTSP
Folks, some disclaimer… • This is my personal opinion. • Please don’t take it as official and formal VMware Inc recommendation. I’m not authorised to do so. • Also, generally we should judge the content, rather than the organisation/person behind the content. • Technology changes • 10 Gb ethernet, SSD disk, 8 core CPU, FCoE, CNA, vStorage API, storage virtualisation, etc will impact the design • New modules/products from VMware will also impact the design. • With that, let’s have a professional*Design* discussion * Not emotional & religious discussion
Table of Contents • Background • Before you design… • vSphere Design: Overview • vSphere Design: Server • vSphere Design: Network & Security • vSphere Design: Storage • vSphere Design: VM • vSphere Design: Miscellaneous • Upgrade from VI3 to vSphere
Wait! Before you design… Make sure you know some basic. So let’s test the water first
Quiz • Which of the following statements on Storage vMotion are true in vSphere: • You can do it, but only via the command line. There is no graphical interface provided by directly VMware. • You can do from the vSphere client now. One of the enhancements in Storage vMotion is the ability do it via GUI. • It no longer requires 2X amount of RAM/CPU as it does not create an image of the VM. It no longer does a self-vMotion. • It uses a fast suspend-resume approach, which is different to the vMotion technique uses in VI3 • You cannot use it to save storage, as you cannot convert from Thick Disk to Thin Disk. To convert, downtime is required.
Quiz • Which of the following statements on DPM are true in vSphere: • Unlike VI3, it is no longer experimental. VMware provides support for DPM. • It works on all hardware as it’s hardware independent. Any model from any vendors. • It can save company cost (electricity power cost) • It support a variety of technology, such as Wake-On-LAN, IPMI and HP iLO. • It does not work with DRS, as DRS will try to load balance to as many ESX as possible, while DPM tries to limit that.
Quiz • Which of the following statements on vDS are true in vSphere: • To use it, you need to use the Cisco 1000V switch. VMware does not provide its own variant of vDS. • Migration of VM from local vSwitch to vDS requires downtime on the VM. • It allows for better segregation of duties (network team and server team) in Data Center operation. • It helps you set Private VLAN, which allows a single VLAN to be split into smaller network. • It is much easier to design and configure relative to local vSwitch, and you don’t need to discuss with the network team • vDS is a cluster-level object. It is not a datacenter-level object. So a single vDS does not span across multiple cluster.
Quiz • Which of the followings are enhancements in vSphere • Ability to do hot-add CPU for most guest OS, such as Windows 03 Standard Edition and Solaris 10 • Ability to hot-remove CPU for some guest OS. E.g, you can reduce CPU from 8 to 4 in Windows 08 64 bit Server Datacenter Edition • Ability to run odd number of CPU, such as 3 or 5 CPU, in a VM. You are no longer bound by 1, 2, and 4 CPU configuration. • Ability to have thin provision disk. Thin Provisioning allows you to save on storage. • Ability to perform ESX host upgrade using the Update Manager 4. • Ability to use paravirtualised SCSI disk
Quiz • Which of the followings are the enhancements in vSphere: • vSphere has better performance on storage due to features such as para-virtualised SCSI driver, Asynchronous IO, updated iSCSI stack • vSphere has better IO performance as it can assign PCI Devices natively to a VM. • vSphere has better networking performance due to features such as enhanced network driver & jumbo frames (10 Gb support) • vSphere has better CPU performance due to improvement in the VMkernel scheduler. The “cell lock” has been removed. • The VMkernel is now 64-bit instead of 32-bit in VI3. • vSphere assist VI Admin in troubleshooting performance via its improved charts.
Quiz • Which of the followings are the enhancements in vSphere in the area of management. • You can now link multiple vCenter, providing you with a global view. • You can now search across multiple vCenter from the vCenter client. • You can see more info about your datastore. Info such as snapshot, multi-pathing, storage connectivity are now available more easily. • You can create higher-level scripts using vCenter Orchestrator. • You can ensure compliance of hosts by using the Host Profile. • You can now manage the storage array (e.g. EMC, NetApp) natively via the vCenter client. You no longer need the storage vendor admin client. • You can group related application (e.g. a 3-tier system of web, app, DB) into 1 logical group using the vApp feature. • Licencing is much simpler. Also, no more licence file as it has been replaced by a simple 25-character licence key.
Assumptions • Assumptions are needed to avoid the infamous “It depends…” answer • A design for 10 VM differs with a design for 100 VM, which in turn differs with a design for 1000 VM. • A design for large VM (8 vCPU, 64 GB) differs with a design for small VM • A design for desktop VM (WinXP) differs to server VM • To eliminate a vague design, we need to make assumptions. • The Design on this presentation has these assumptions • 2 physical Data Centers, connected via SAN replication. • 100 – 300 VMs for Server. • 1:3 ratio between Production and Non-Production (Dev, Test, UAT, Production Fix, etc) • MS Active Directory is used as corporate directory • 1000 VMs for Desktop. • But this is not the focus of this document. • SAN environment, not iSCSI/NFS. • VMs are mostly 1 CPU, with a few 2 or 3 vCPU. 4 vCPU will be suitable when Xeon 7500 is released. • VMs are mostly 2 GB to 6 GB, with a few 8 - 16 GB VMs • Cisco Nexus is not yet incorporated. Waiting for Cisco input on Nexus best practice • Separate person for Network, Server and Storage.
Now that we have set the broad parameters… • The rest of the slides cover a design that are bound by these parameters. • The focus is on the Overall Design, as opposed to detailed Configuration. Settings that do not impact design are generally not covered here. • This is a Best Practice (read: ideal) • It is not the minimum requirements. It is much higher than minimum. • For example, it specifies 10-12 GE NIC ports per ESXi box, which is much higher than most configuration. • This is a Design document • In your work, there are many other documents/areas you need to do. • Examples: • Operation Guide: covers things like standard operating procedures, monitoring • Configuration Settings: document the actual settings • Installation Guide • Test Plan
DataCenter and Cluster • In our design, we will have 1 Datacenter only for Production • But we will have 2 clusters for Servers VM + multiple clusters for Desktops VM • Inventory objects can interact within datacenters, but have only limited interaction across datacenters. • For example, you can vMotion from one cluster to another within a datacenter, but not to another datacenter. On the other hand, you can clone a virtual machine within a datacenter and to a different datacenter • 3 type of clusters that you should have: • Production • Server • Desktop • Non Prod • Dev, Test, DR, UAT, Staging, etc • Sandpit: Lab for VI Admin. • Used for testing/evaluating Updates/Upgrade or new features • 2 hosts enough • Complemented by ESXi on top of ESXi • If not enough budget, then merge Sandpit with Non Prod • Physically separate production and non-production VMs • Allow patching on Non-Prod cluster first • Share is used instead of fixed Reservation & Limit
VM VM VM VM VM ESXi host Cluster Design vSphere Client Web Browser Client vCenter Server Prod Cluster 1 Prod Cluster 2 Desktop Cluster LAN Switches FC Switches Tier 1 Storage Tier 2 Storage Tier 3 Storage
Explanation of previous 2 slides • It shows the Overall Architecture. • Physical Sites, Farms are shown. • It has Production Site and DR Site. • DR servers doubles up as Test/Dev/UAT servers. • In this approach, we choose not to mix Prod and Non-Prod in the same VMware Cluster • Just FYI, mixing Prod and Non-Prod has the following advantages: • Prod is spread on more ESX. An host failure impact less VMs • Non-Prod normally have much less workload, so Prod gets more resources • DR and vCenter LinkedMode are shown • Array-level replication is used. This is needed by SRM 4 • Desktop farm are shown • Notice it has its own vCenter, and it is based on iSCSI
Production Cluster • 8 hosts per cluster. Why 8, not 4 or 12 or 16 or 32? • Best practice for cluster is same hardware spec with same CPU frequency. • Eliminates risk of incompatibility • Complies with Fault Tolerant best practices • So more than 8 means it’s more difficult/costly to keep them all the same. You need to buy 8 hosts a time. • Cost: Too few hosts result in overhead (the “spare” host) • Mgmt: Too many hosts are harder to manage (patch, performance troubleshooting, too many VMs per cluster, HW upgrade) • Some cluster changes in the Advanced Attributes requires cluster to be disable and enable • Harder/longer to do this when there are many hosts • DRS: 8 give DRS sufficient host to “maneuver” • #VM per host decrease by 4x in >8 host. • 160VMs per host if <= 8 hosts in cluster • 40 VMs per host if > 8 hosts in cluster • We should avoid being near the limit. 40 VM/host is easily reached in Lab Manager or View environment. • Availability: Able to withstand 1-2 host failures • A balance between too small (4 hosts) and too large (>12 hosts) • Allow us to isolate 1 host for VM-troubleshooting purpose • Upgrading >8 servers at a time is expensive ($$) and complex • 8 is a number “known” to the Storage team and Network team. • Storage: 8 hosts/LUN a safe value of 16 paths to a LUN • Mgmt: 8 is easy number to remember. And a lucky one. And we all know that production needs luck, not just experience
Production Cluster • Sizing is based on 7 ESX. ESX no 8 serves 4 purposes: • Improve performance as it is also active. This is not a N+1 concept as all nodes are active. • Buffer for sudden workload increase, or predictable increase (month end, year end, special event, etc) • Hardware failure or maintenance. Failure on Friday evening does not trigger IT to work on weekend. • Isolating VM for performance troubleshooting. • Mix VMs from different BU and for different Guest OS • At this scale of implementation, it may not make financial sense to have separate cluster for different Business Units or Division or different Guest OS (e.g. Linux, Windows). • Mixing VMs normally result in a relatively more balanced/diversed workload • Mixing VMs mean Vmkernel can’t deliver the maximum memory sharing. This is one reason why we recommend 32 GB or more. • If networking infra is mixed across business units, then same logic can be applied to virtualization layer • VMs are isolated from each VM point of view. • VLANs in virtual switches (layer2) • Deploy vShield zones to protect VMs further • Resource Pools • Tier 1 apps: High shares. • Tier 2 apps: default shares • Tier 3 apps: low shares • Don’t put VM and RP as “sibling” or same level
Non Prod Cluster • More relaxed on #Hosts • Can be more than 8. • But if you need >12, keep them as 2 clusters. • Able to withstand 1 host failure • Resource Pools • Either group them by Environment, or by Projects • Environment: Staging, Test, Dev, Training • Projects: Intranet, ERP, <project name>, etc • Can have cascaded. By projects, then by environment • Folders • Use to organise VM by business units, etc. • Not used to separate VM by performance requirements. Use Resource Pool for this.
The need for Non Prod Cluster • VM “Testing” • Sometimes you need to bring up a clone or restore a back up. • Patching of ESXi • “Thou shall not patch directly in production” cardinal rule. • So you do not want to patch on your Production Cluster right away. • Updating of ESX • e.g. from 4.0 to 4.0 Update 1 • This is even bigger than patch, as more changes are involved. • Evaluating or Implementing new features • e.g. vDS, Nexus 1000, Fault Tolerant, vShield Zones, VMSafe • All the above need proper testing. Example, you should not evaluate Nexus in your Production Cluster. • Upgrade • e.g. from 4.0 to 4.x or 5.0 • This needs extensive testing and roll back plan.
Explanation of previous slide • Drilling a bit more to show more details • Showing both Production and DR sites. • But this slide now shows • Which LUNs are replicated. • We do not need to replicate the Infra LUNs • More details on the clusters • # VM varies to give flexibility of workload
DR Site • This is for the DR Site. • This is for the Non Productionenvironment. • We only have 1 cluster
Explanation of previous slides • This is the same example • Drilling a bit more depth, now adding Infra VM • It shows • Admin Client • For all administration purpose to manage the vSphere infrastructure, not just tied to vSphere. • Infra VM all reside on Cluster 2. This keeps Cluster 1 clean, “strictly for business”. • vShield Zones VMs. 1 per ESX host • VC + DB as 1 VM. Makes snapshot-ing easier. • SRM 4 + SRM 4 DB • VMware Converter • vMA: only powered on when we need Remote CLI access • View 4 management VMs are placed outside View 4 Clusters • Non-Production Cluster need its own set of Infra VM as they are separate vCenter. • Storage Management VM • Some storage (e.g. EMC) needs to have some “servers” to manage it. This can be virtualised too.
Cluster • Settings • DRS fully automated. Sensitivity: Moderate • Use anti-affinity or affinity rules only when needed. • More things for you to remember. • Gives DRS less room to maneuver • DPM enabled. Choose hosts that support DPM • VM Monitoring enabled. • VM monitoring sensitivity: Medium • HA will restart the VM if the heartbeat between the host and the VM has not been received within a 60 second interval • EVC enabled. Enable you to upgrade in future. • Prevent VMs from being powered on if they violate availability constraints better availability • Host isolation response: Power off VM • Prevent data/transaction integrity risk
The "70% Rule" • Generally, best to avoid exceed 70% of the rated capacity of any component • Except for tape drive, where it’s a long running stream. • Except for long running batch job, as it tends to consume 100% • Manufacturer throughput and performance specifications are normally based on theoretical environment • Never achieved in real world • Response time significantly increases after 70% utilization threshold is exceeded……. • Specific for ESX • We need the extra resources for vMotion/DRS/Storage vMotion/etc.
Throughput vs. Response Time As throughput approaches 100%, response decreasesto 14% of optimal Response Time* (latency – ms) Percentage of Maximum Throughput (bandwidth) *Hennessy & Patterson, "Computer Architecture – A Quantitative Approach"
ESX or ESXi • Whenever possible • Use ESXi over ESX. • The long term roadmap is ESXi, not ESX. • Simpler to manage. No need to learn & patch Linux. • No need to worry about CPU 0 as Service Console runs there. • Heavy SC utilisation (CPU 0 or RAM) may trigger overall slowness, or crash. Ex: KB 1009525 (for 3.5 only) • Certain features like vScsiStats not ported yet. For the esxcfg-* commands, porting required to vicfg-* • Some 3rd party agents may not yet be ported to ESXi. Always check with the 3rd party vendor for latest info. • Use ESXi Embedded over ESXi Installable. • Allow us to do away with local disks in production. • Ask OEM vendor for integration details with their drivers. • Boot from SAN is not possible with ESXi 4.0. • ESXi Host Specification • Get a compatible host from http://www.vmware.com/resources/compatibility/search.php • Click the Server name to get detailed requirements. Normally, it specifies the BIOS version. • 2 sockets, 8 core Xeon 5500, 16 threads (turn it on) • DPM support (IPMI or HP iLO). You may need to pay extra to server vendor for this feature
ESXi Host Specification • 32 - 64 GB RAM. • Spread equally. • VMkernel has Home Node concept in NUMA system • Lights-out management • So you don’t have to be in front of physical server to do certain thing (e.g. go into command line as requested by VMware Support) • 2 FC ports and 10-12 GE NIC ports • Get 2x quad-port NIC. Since the built-in NIC is either 2-4, the box will end up with 10 – 12 ports. • Hardware agent is properly configured • Very important to monitor hardware health due to many VMs in 1 box. • Use a network adapter that supports the following: • Checksum offload • Capability to handle high memory DMA (64-bit DMA addresses) • Capability to handle multiple scatter/gather elements per Tx frame
Estimated Hardware Cost: S$10K per ESXi • Configuration: • 2 Xeon 5520 • 36 GB RAM (18 x 2). Add $1400 for 48 GB • 2 FC HBA • 10 GE ports (no hardware iSCSI) • 5 year warranty (next business day) • 2 local HD with controller card • Embedded ESXi • Installation service • Source: www.dell.com.sg, SMB section, 12 Jan 2009
ESXi has smaller set of services One reason why ESXi is the platform of choice moving forward is ESXi has much smaller set of services that need to be considered in terms of security. ESX 4 services ESXi 4 services
RAM sizing • For ideal performance, fit a VM within 1 CPU-RAM “pair” to avoid “remote memory” effect. • Also, populate all the DIMMs slots, instead of leaving some empty. This bring benefits of memory interleaving. 16 GB 16 GB 16 GB 16 GB • For best performance of VMs on NUMA systems: • # of vCPUs + 1 <= # of cores in 1 socket. So running a 5 vCPU VM in a quad-core will force remote memory situation • VM memory <= memory of one node 16 GB 16 GB Note: VMware does not officially recommend Intel or AMD. This is just for illustration. Use VMmark for server selection.
ESX: sizing • Rules of Thumb. • 4 GB RAM per core • 2 - 4 vCPU per physical core. • 8 core box 16 – 32 vCPU • Design with ~10 VM per box in Production. • This allows you to have some 2 vCPU and 3 vCPU VMs • ~10 VM per box means impact of downtime when host fails are capped at ~10 Production VM. • ~10 VM per box in a 8-node cluster means ~10 VMs may be able to boot in 7 hosts in the event of HA, hence reducing down time. • Aim for 70% utilisation • Will be 80% when HA occurs (a box fails) • PCI Slot on the motherboard • Since we are using 8 Gb FC HBA, make sure the physical PCI-E slot has sufficient bandwidth. • A single-dual port FC port makes more sense if the saving is high and you need the slot. But there is a risk of bus failure. Also, double check to ensure the chip can handle the throughput of both ports. • If you are using blade, and have to settle for a single 2-port HBA (instead of two 1-port HBA), then ensure the PCI slot has bandwidth for 16 Gb. When using a dual-port HBA, ensure the chip & bus in the HBA can handle the peak load of 16 Gb.
VMware VMmark • Use VMmark as the basis for CPU selection • It is the official benchmark for VMware, and it uses multiple workload • Other benchmark are not run on vSphere, and typically test 1 workload • Use it as a guide only • Your environment is not the same. • You need head room and HA. • How it’s done • 1 Tile = 6 VM. Each is 1 vCPU. • MS Exchange, MySQL, Apache, J2EE, File Server, Idle VM • Result page: • www.vmware.com/products/vmmark/results.html • VMmark 1.1 results is directly comparable to VMmark 1.0 results. The underlying virtual hardware definitions and load levels for each workload have not changed. 1 Tile = 6 VM Each 2 CPUs 2GB RAM
Sample Results Look at this number Not so much this number This tells us that DL380 G6can run 17 Tiles, at 100% utilisation. Each Tile has 6 VM, but 1 is idle. 17 x 5 VM = 85 active VM! At 70% ESX host utilisation, that’s around 59 VM. DL 385 G6 runs 11 tiles. The 2 boxes have similar date submission, around mid 2009. VMware does not recommend 1 OEM over the other. I use 1 OEM here so we don’t compare between OEM
Time Keeping and Time Drift • Critical to have the same time for all ESX and VM. • All VM & ESX to get time from the same 1 internal NTP server • The Internal NTP server to get time from a reliable external server. • Do not virtualise the NTP server • As a VM, it may experience time drift. • Physical candidates for NTP Server: • VCB Proxy server
Network • At least 4 vSwitches or dvPort per ESX. If your licence is Enterprise Plus, then use vDS
ESX: Network Configuration 2 VLAN required. Mgmt & Prod
ESX: Network Configuration (no FC) 2 VLAN required. Mgmt & Prod
Explanation • The diagram shows why we need 8 - 12 GE ports. • Alternatively, we can also use 2x 10 GE ports. This gives higher performance. • Additional NIC ports can be deployed if VM needs > 2 Gb. Choose server with sufficient ports expansion • 2x 10 GE ports give flexibility & higher thruput • Future scalability • Consider 1-2 year ahead while looking at NIC ports. You may need to give more network for VM as you are running more VM , or running network-demanding VM. • You may want to boot ESXi from the network (PXE) boot when such feature becomes feasible in future release. This may require a separate physical NIC from best practice point of view. • Once wired, it is hard (expensive too) to rewire. All cables are already connected and labelled properly to each physical switch ports. • In Blade scenario, it could be worse. If all the PCI slots are occupied, then there may not be choice to expand. • The diagram includes vShield Zone • It adds 1 hidden vSwitch per vSwitch for VM. Management network does not require vShield Zone protection. • Reason for isolation: • Availability • Performance • Scalability (for future demand, so we don’t have to rewire) • Re-wiring is expensive. “Messy” cabling creates complexity in data center • Security • If you use VLAN, then the physical isolation is less of a concern • Use of Jumbo Frames is recommended for best VMotion performance. Physical switch must support Jumbo Frames too.
Explanation • Assumptions • All VMs in the ESX share the same vSwitch. So VST VLAN is required. • If the physical switches are dedicated per VLAN, and not enough ports in the core switch, then we need to add 2 NIC ports per VLAN. Example: 4 VLAN = 4 x 2 ports = 8 ports required for VM LAN. • Why the need of many NIC: • Recovering from dropped network packets results in large performance degradation. • In addition to time spent determining that data was dropped, the retransmission uses network bandwidth that could otherwise be used for current transactions • It is possible to have drop packets when link is near saturation. The picture below shows 9922 dropped packets while processing near 1 Gbps.