1 / 66

BCO2874 vSphere High Availability 5.0 and SMP Fault Tolerance – Technical Overview and Roadmap

BCO2874 vSphere High Availability 5.0 and SMP Fault Tolerance – Technical Overview and Roadmap. Name, Title, Company. Disclaimer. This session may contain product features that are currently under development.

Télécharger la présentation

BCO2874 vSphere High Availability 5.0 and SMP Fault Tolerance – Technical Overview and Roadmap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCO2874vSphere High Availability 5.0 and SMP Fault Tolerance – Technical Overview and Roadmap Name, Title, Company

  2. Disclaimer • This session may contain product features that are currently under development. • This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product. • Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. • Technical feasibility and market demand will affect final delivery. • Pricing and packaging for any new technologies or features discussed or presented have not been determined.

  3. vSphere HA and FT Today Minimize downtime without the cost/complexity of traditional solutions • vSphere HA provides rapid recovery from outages • vSphere Fault Tolerance provides continuous availability Coverage App Monitoring APIs Partner solutions Application Guest Monitoring Guest OS Fault Tolerance Infrastructure HA VM Hardware Downtime minutes none

  4. This Talk 1. Technical overview of vSphere HA 5.0 • Presented by Keith Farkas 2. Technical preview of vSphere Fault Tolerance SMP • Presented by Jim Chow Coverage App Monitoring APIs Partner solutions Application HA 5.0 Guest Monitoring Guest OS Fault Tolerance Infrastructure HA VM Multiple vCPUFT Hardware Downtime minutes none

  5. vSphere HA 5.0 Objectives • Learn about the enhancements in vSphere HA 5.0 • Understand the new architecture • Identify questions for the breakout / expert sessions

  6. vSphere HA 5.0 vSphere HA was completely rewritten in 5.0 to • Simplify setting up HA clusters and managing them • Enable more flexible and larger HA deployments • Make HA more robust and easier to troubleshoot • Support network partitions 5.0 architecture is fundamentally different • This talk • Describes the three key concepts • Summarizes host failure responses • To learn more, see other VMworld HA venues

  7. 5.0 Architecture New vSphere HA agent • Called the Fault Domain Manager (FDM) • Provides all the HA on-host functionality As in previous releases • vCenter Server (VC) manages the cluster • Failover operations are independent of VC • FDMs communicate over management network FDM FDM FDM FDM vCenter Server (VC)

  8. Key Concepts – Part 1 FDM roles and responsibilities Inter-FDM communication

  9. FDM Master One FDM is chosen to be the master • Normally, one master per cluster • All others assume the role of FDMslaves Any FDM can be chosen as master • No longer a primary / secondary role concept • Selection done using an election Master-specific responsibilities • Monitors availability of hosts / VMs in cluster • Manages VM restarts after VM/host failures • Reports cluster state / failover actions to VC • Manages persisted state slave slave master slave vCenter Server (VC)

  10. FDM Slave and Shared Responsibilities Slave-specific responsibilities • Forwards critical state changes to the master • Restarts VMs when directed by the master • If the master should fail, participates in master election Each FDM (master or slave) • Monitors the state of local VMs and the host • Implements the VM/App Monitoring feature slave slave master slave

  11. The Master Election An election is held when: • vSphere HA is enabled • Master’s host becomes inactive • HA is reconfigured on master’s host • A management network partition occurs If multiple masters can communicate, all but one will abdicate Master-election algorithm • Takes15 to 25s (depends on reason for election) • Elects participating host with the greatest number of mounted datastores FDM FDM FDM FDM FDM FDM ESX 1 ESX 3 ESX 4 ESX 2

  12. Agent Communication FDMs communicate over the • Management networks • Datastores Datastores used when network is unavailable • Used when hosts are isolated or partitioned Network communication • All communication is point to point • Election is conducted using UDP • All master-slave communication is via SSL encrypted TCP slave slave master slave

  13. Questions Answered Using Datastore Communication FDM FDM

  14. Questions Answered Using Datastore Communication FDM FDM

  15. Heartbeat Datastores • VC chooses (by default) two datastores for each host • You can override the selection or provide preferences • Use the cluster “edit settings” dialog for this purpose

  16. Responses to a Network or Host Failures

  17. Host Is Declared Dead Master declares a host dead when: • Master can’t communicate with it over the network • Host is not connected to master • Host does not respond to ICMP pings • Master observes no storage heartbeats Results in: • Master attempts to restart all VMs from host • Restarts on network-reachable hosts andits own host FDM FDM FDM FDM ESX 2 ESX 1 ESX 3 ESX 4

  18. Host Is Network Partitioned Master declares a host partitioned when: • Master can’t communicate with it over the network • Master can see its storage heartbeats Results in: • One master exists in each partition • VC reports one master’s view of the cluster • Only one master “owns” any one VM • A VM running in the “other” partition will be • monitored via the heartbeat datastores • restarted if it fails (in master’s partition) • When partition is resolved, all but one master abdicates FDM FDM FDM FDM FDM ESX 3 ESX 2 ESX 1 ESX 4

  19. Host Is Network Isolated A host is isolated when: • It sees no vSphere HA network traffic • It cannot ping the isolation addresses Results in: • Host invokes (improved) Isolation response • Checks first if a master “owns” a VM • Applied if VM is owned or datastore is inaccessible • Default is now Leave Powered On • Master • Restarts those VMs powered off or that fail later • Reports host isolated if both can access itsheartbeat datastores, otherwise dead FDM FDM FDM FDM ESX 2 ESX 1 ESX 3 ESX 4 Isolation Addresses

  20. Key Concepts – Part 2 HA Protection and failure-response guarantees

  21. vSphere HA Response to Failures

  22. HA Protected Workflow User issues power on for a VM Host powers on the VM VC learns that the VM powered on VC tells master to protect the VM time Master receives directive from VC Master writes fact to a file Write is done

  23. HA Restart Guarantee User issues power on for a VM Host powers on the VM VC learns that the VM powered on VC tells master to protect the VM time An attempt may be madeif a failure occurs now Master receives directive from VC Master writes fact to a file An attempt will be madefor failures now and in future Write is done

  24. vSphere HA Protection Property Is a new per-VM property Reports on whether a restart attempt is guaranteed Is shown on the VM summary panel and optionally in VM lists

  25. Values of the HA Protection Property Value reported by VC User issues power on for a VM N/A Host powers on the VM VC learns that the VM powered on VC tells master to protect the VM time Master receives directive from VC Unprotected Master writes fact to a file Write is done. Master tells VC Protected VC learns VM has been protected

  26. Wrap Up

  27. vSphere HA Summary • vSphere HA feature provides organizations the ability to run their critical business applications with confidence • 5.0 Enhancements provide • A solid, scalable foundation upon which to build to the cloud • Simpler management and troubleshooting • Additional and more robust responses to failures • Resource Pool VMware ESXi VMware ESXi VMware ESXi Operating Server Failed Server Operating Server

  28. To Learn More About HA and HA 5.0 • At VMworld • See demo in VMware booth in solutions exchange • Try it out in lab HOL04 – Reducing Unplanned Downtime • Attend group discussions GD15 and GD35 – vSphere HA and FT • Review panel session VSP1682 – vSphere Clustering Q&A • Talk with knowledge expert (EXPERTS-09) • Offline • Availability Guide • Best Practices Guide • Troubleshooting Guide • Release notes

  29. vSphere Fault Tolerance SMPTechnical Preview Objectives • Why Fault Tolerance? • What’s new: SMP

  30. vSphere Availability Portfolio Coverage App Monitoring APIs Application Guest Monitoring Guest OS Fault Tolerance Infrastructure HA VM Hardware Downtime minutes none

  31. Why Fault Tolerance? • Continuous Availability • Zero downtime • Zero data loss • No loss of TCP connections • Completely transparent to guest software • Simple UI: Turn On Fault Tolerance • Delegate all management to the virtual infrastructure Users Apps OS

  32. Background • 2009: vSphere Fault Tolerance in vSphere 4.0 • 2010: Updates to vSphere Fault Tolerance in vSphere 4.1 • 2011: Updates to vSphere Fault Tolerance in vSphere 5.0 • Details: http://www.vmware.com/products/fault-tolerance/ • Problem: • FT only for uni-processor VMs • Is FT for multi-processor VMs possible? • An impressively hard problem • Concerted effort to find an approach • Reached milestone • We’d like to share it

  33. Application Application Operating System Operating System Virtualization Layer Virtualization Layer A Starting Point: vSphere FT vLockstep FT LOGGING Shared Disk

  34. Application Application Operating System Operating System Virtualization Layer Virtualization Layer A Clean Slate SMP protocol vLockstep FT LOGGING 10 GigE Shared Disk

  35. Application Application Operating System Operating System Virtualization Layer Virtualization Layer A Clean Slate • Spare you the details • See it in action SMP protocol FT LOGGING 10 GigE

  36. Application Application Client Operating System Operating System Operating System Virtualization Layer Virtualization Layer Live Demo • Experimental setup, caveats SMP protocol FT LOGGING 10 GigE

  37. Live Demo Summary • SMP FT in action • Presented a good solution • Client oblivious to FT operation • SwingBench client • SSH client • Transparent failover • Zero downtime, zero data loss • Taste for performance / bandwidth • But that’s not all

  38. Performance Numbers • Similar configuration to vSphere 4 FT Performance Whitepaper • Models real-world workloads: 60% CPU utilization

  39. vSphere FT Summary • Why Fault Tolerance • Continuous availability • Fault Tolerance for multi-processor VMs • Good solution to impressively hard problem • A new design • Demonstrated similar experience to existing vSphere FT • But more vCPUs

  40. vSphere HA and FTFuture Directions

  41. vSphere HA and FT – Technical Directions Technical directions include • More comprehensive coverage of failures for more applications Coverage Multi-tierapplication App Monitoring APIs Application VM/Guest Monitoring Guest OS Infrastructure HA Fault Tolerance Hardware/VM Multiple vCPUs MetroHA Protection against host component failures Downtime

  42. vSphere HA and FT – Technical Directions Technical directions include • More comprehensive coverage of failures for more applications • Broader set of enablers for improving availability of applications Coverage Building blocks for creating available apps Multi-tierapplication App Monitoring APIs Application API extensions VM/Guest Monitoring Guest OS Infrastructure HA Fault Tolerance Hardware/VM Multiple vCPUs MetroHA Protection against host component failures Downtime

  43. vSphere HA and FT – Technical Directions Technical directions include • More comprehensive coverage of failures for more applications • Broader set of enablers for improving availability of applications Coverage Building blocks for creating available apps Multi-tierapplication Partner solutions App Monitoring APIs Application API extensions VM/Guest Monitoring Guest OS Infrastructure HA Fault Tolerance Hardware/VM Multiple vCPUs MetroHA Protection against host component failures Downtime minutes none

  44. vSphere HA and FT – Technical Directions Technical directions include • More comprehensive coverage of failures for more applications • Broader set of enablers for improving availability of applications Coverage Building blocks for creating available apps Multi-tierapplication Partner solutions Solidifying vSphereas the platform for running all mission-critical applications App Monitoring APIs Application API extensions VM/Guest Monitoring Guest OS Infrastructure HA Fault Tolerance Hardware/VM Multiple vCPUs MetroHA Protection against host component failures Downtime minutes none

  45. Thank you! Questions?

  46. BCO2874vSphere High Availability 5.0 and SMP Fault Tolerance – Technical Overview and Roadmap

  47. Additional vSphere HA 5.0 Details

  48. Troubleshooting

  49. Troubleshooting vSphere HA 5.0 • HA issues proactive warning about possible future conditions • VMs not protected after powering on • Management network discontinuities • Isolation addresses stop working • HA host states provide granularity into error conditions • All HA conditions reported via events; config issues/alarms for some • Event descriptions describe problem and actions to take • All event messages contain “vSphere HA” so searching for HA issues easier • HA alarms are more fine grain and auto clearing (where appropriate) • 5.0 Troubleshooting guide which discusses likely top issues. E.g., • Implications of each of the HA host states • Topics on HB datastores, failovers, admission control • Will be updated periodically

More Related