VMS Clusters: Advanced Concepts

VMS Clusters:Advanced Concepts CETS2001 Seminar 1090 Sunday, September 9, 2001, 210A Keith Parris

Speaker Contact Info Keith Parris E-mail: parris@encompasserve.org or keithparris@yahoo.com Web: http://www.geocities.com/keithparris/ and http://encompasserve.org/~kparris/ Integrity Computing, Inc. 2812 Preakness Way Colorado Springs, CO 80916-4375 (719) 392-6696

Topics to be covered • Large Clusters • Multi-Site Clusters • Disaster-Tolerant Clusters • Long-Distance Clusters • Performance-Critical Clusters • Recent Cluster Developments

Large Clusters • Large by what metric? • Large in node-count • Large in CPU power and/or I/O capacity • Large in geographical span

Large Node-Count Clusters • What is a “large” number of nodes for a cluster? • VMS Cluster Software SPD limit: • 96 VMS nodes total • 16 VMS nodes per CI Star Coupler

Large Node-Count Clusters • What does a typical large node-count cluster configuration consist of? • Several core boot and disk servers • Lots & lots of workstation satellite nodes

Large Node-Count Clusters • Why build a large node-count cluster? • Shared access to resources by a large number of users • Particularly workstation users • Easier system management of a large number of VMS systems • Managing two clusters is close to twice the work of managing one • Adding just 1 more node to an existing cluster is almost trivial in comparison

Large Node-Count Clusters • Challenges in building a large node-count cluster: • System (re)booting activity • LAN problems • System management • Hot system files

System (re)booting activity • Reboot sources: • Power failures • LAN problems • VMS or software upgrades • System tuning

System (re)booting activity • Fighting reboot pain: • Power failures: • UPS protection • LAN problems: • Redundancy • Sub-divide LAN to allow “Divide and conquer” troubleshooting technique • Monitoring

System (re)booting activity • Fighting reboot pain: • VMS or software upgrades; patches: • Try to target safe “landing zones” • Set up for automated reboots • System tuning: • Run AUTOGEN with FEEDBACK on regular schedule, pick up new parameters during periodic automated reboots

System (re)booting activity • Factors affecting (re)boot times: • System disk throughput • LAN bandwidth, latency, and quality • Boot and disk server horsepower

System disk throughput • Off-load work from system disk • Move re-directable files (SYSUAF, RIGHTSLIST, queue file, etc.) off system disk • Put page/swap files on another disk (preferably local to satellite nodes) • Install applications on non-system disks when possible • Dump files off system disk to conserve space • All this reduces write activity to system disk, making shadowset or mirrorset performance better

System disk throughput • Avoid disk rebuilds at boot time: • Set ACP_REBLDSYSD=0 to prevent boot-time rebuild of system disk • While you’re at it, use MOUNT/NOREBUILD on all disks mounted in startup • But remember to set up a batch job to do • $ SET VOLUME/REBUILD • commands during off-hours to free up disk blocks incorrectly left marked allocated when nodes crash (blocks which were in node’s free-extent cache)

System disk throughput • Faster hardware: • Caching in the disk controller • 10K rpm or 15K rpm magnetic drives • Solid-state disks • DECram disks (shadowed with non-volatile disks)

System disk throughput • Multiple system disk spindles • Host-based volume shadowing allows up to 3 copies of each system disk • Controller-based mirroring allows up to 6 spindles in each mirrorset • Controller-based striping or RAID-5 allows up to 14 spindles in a storageset • And you can layer these

System disk throughput • Multiple separate system disks for groups of nodes • Use “cloning” technique to replicate system disks and avoid doing “n” upgrades for “n” system disks • Consider throttling satellite boot activity to limit demand

System disk “Cloning” technique • Create “Master” system disk with roots for all nodes. Use Backup to create Clone system disks. • Before an upgrade, save any important system-specific info from Clone system disks into the corresponding roots on Master system disk • Basically anything that’s in SYS$SPECIFIC:[*] • Examples: ALPHAVMSSYS.PAR, MODPARAMS.DAT, AGEN$FEEDBACK.DAT • Perform upgrade on Master disk • Use Backup to copy Master to Clone disks again.

LAN bandwidth, latency, and quality • Divide LAN into multiple segments • Connect systems with switches or bridges instead of contention-based hubs • Use full-duplex links when possible

LAN bandwidth, latency, and quality • Use faster LAN technology at concentration points like backbones and at servers: • e.g. if using Fast Ethernet for satellites, consider using Gigabit Ethernet for server LAN adapters • Provide redundant LANs for servers, backbone

LAN bandwidth, latency, and quality • Try to avoid saturation of any portion of LAN hardware • Bridge implementations must not drop small packets under heavy loads • SCS Hello packets are small packets • If two in a row get lost, a node without redundant LANs will see a Virtual Circuit closure; if failure lasts too long, node will do a CLUEXIT bugcheck

LAN bandwidth, latency, and quality • Riding through temporary LAN problems while you troubleshoot: • Raise RECNXINTERVAL parameter • Default is 20 seconds • It’s a dynamic parameter

LAN bandwidth, latency, and quality • Where redundant LAN hardware is in place, use the LAVC$FAILURE_ANALYSIS tool from SYS$EXAMPLES: • It monitors and reports, via OPCOM messages, LAN component failures and repairs • Described in Appendix D of the OpenVMS Cluster Systems Manual • Workshop 1257: Network Monitoring for LAVCs • Tuesday 1:00 pm, Room 208A • Thursday 8:00 am, Room 208A

LAN bandwidth, latency, and quality • VOTES: • Most configurations with satellite nodes give votes to disk/boot servers and set VOTES=0 on all satellite nodes • If the sole LAN adapter on a disk/boot server fails, and it has a vote, ALL satellites will CLUEXIT! • Advice: give at least as many votes to node(s) on the LAN as any single server has, or configure redundant LAN adapters

LAN redundancy and Votes 0 0 0 1 1

LAN redundancy and Votes Subset A 0 0 0 1 1 Subset B Which subset of nodes does VMS select as the optimal subcluster?

LAN redundancy and Votes 0 0 0 1 1 One possible solution: redundant LAN adapters on servers

LAN redundancy and Votes 1 1 1 2 2 Another possible solution: Enough votes on LAN to outweigh any single server node

Boot and disk server horsepower • MSCP-serving is done in interrupt state on Primary CPU • Interrupts from LAN Adapters come in on CPU 0 (Primary CPU) • Multiprocessor system may have no more MSCP-serving capacity than a uniprocessor • Fast_Path on CI may help

Large Node-Count Cluster System Management • Console management software is very helpful for reboots & troubleshooting • If that’s not available, consider using console firmware’s MOP Trigger Boot function to trigger boots in small waves after a total shutdown • Alternatively, satellites can be shut down with auto-reboot and then MOP boot service can be disabled, either on an entire boot server, or for individual satellite nodes, to control rebooting

Hot system files • Standard multiple-spindle techniques also apply here: • Disk striping (RAID-0) • Volume Shadowing (host-based RAID-1) • Mirroring (controller-based RAID-1) • RAID-5 array (host- or controller-based) • Consider solid-state disk for hot system files, such as SYSUAF, queue file, etc.

High-Horsepower Clusters • Why build a large cluster, in terms of CPU and/or I/O capacity? • Handle high demand for same application(s) • Pool resources to handle several applications with lower overall costs and system management workload than separate clusters

High-Horsepower Clusters • Risks: • “All eggs in one basket” • Hard to schedule downtime • Too many applications, with potentially different availability requirements • System tuning and performance • Which application do you optimize for? • Applications may have performance interactions

High-Horsepower Clusters • Plan, configure, and monitor to avoid bottlenecks (saturation of any resource) in all areas: • CPU • Memory • I/O • Locking

High-Horsepower Clusters • Generally easiest to scale CPU by first adding CPUs within SMP boxes • within limits of VMS or applications’ SMP scalability • Next step is adding more systems • But more systems implies less local locking • Local locking code path length and latency are much lower than remote (order-of-magnitude)

High-Horsepower Clusters • Memory scaling is typically easy with 64-bit Alphas: Buy more memory • May require adding nodes eventually

High-Horsepower Clusters • I/O scalability is generally achieved by using: • Multiple I/O adapters per system • More disk controllers, faster controllers, more controller cache • More disks; faster disks (solid-state) • Disk striping, mirroring, shadowing

High-Horsepower Clusters • Challenges in I/O scalability: • CPU 0 interrupt-state saturation • Interconnect load balancing

CPU 0 interrupt-state saturation • VMS receives interrupts on CPU 0 (Primary CPU) • If interrupt workload exceeds capacity of primary CPU, odd symptoms can result • CLUEXIT bugchecks, performance anomalies • VMS has no internal feedback mechanism to divert excess interrupt load • e.g. node may take on more trees to lock-master than it can later handle • Use MONITOR MODES/CPU=0/ALL to track CPU 0 interrupt state usage and peaks

CPU 0 interrupt-state saturation • FAST_PATH capability can move some of interrupt activity to non-primary CPUs • Lock mastership workload can be heavy contributor to CPU 0 interrupt state • May have to control or limit this workload

Interconnect load balancing • SCS picks path in fixed priority order: • Galaxy Shared Memory Cluster Interconnect (SMCI) • Memory Channel • CI • DSSI • LANs, based on: • Maximum packet size, and • Lowest latency • PORT_CLASS_SETUP tool available from CSC to allow you to change order of priority if needed • e.g. to prefer Gigabit Ethernet over DSSI

Interconnect load balancing • CI Port Load Sharing code didn’t get ported from VAX to Alpha • MOVE_REMOTENODE_CONNECTIONS tool available from CSC to allow you to statically balance VMS$VAXcluster SYSAP connections across multiple CI adapters

High-Horsepower Clusters • Locking performance scaling is generally done by: • Improving CPU speed (to avoid CPU 0 interrupt-state saturation) • Improving cluster interconnect performance (lower latency, higher bandwidth, and minimizing host CPU overhead) • Spreading locking workload across multiple systems

High-Horsepower Clusters • Locking performance scaling • Check SHOW CLUSTER/CONTINUOUS with ADD CONNECTIONS, ADD REM_PROC and ADD CR_WAITS to check for SCS credit waits. If counts are present and increasing over time, increase the SCS credits at the remote end as follows:

High-Horsepower Clusters • Locking performance scaling • For credit waits on VMS$VAXcluster SYSAP connections: • Increase CLUSTER_CREDITS parameter • Default is 10; maximum is 127

High-Horsepower Clusters • Locking performance scaling • For credit waits on VMS$DISK_CL_DRVR / MSCP$DISK connections: • For VMS server node, increase MSCP_CREDITS parameter. Default is 8; maximum is 128. • For HSJ/HSD controller, lower MAXIMUM_HOSTS from default of 16 to actual number of VMS systems on the CI/DSSI interconnect

Multi-Site Clusters • Consist of multiple “Lobes” with one or more systems, in different locations • Systems in each “Lobe” are all part of the same VMS Cluster and can share resources • Sites typically connected by bridges (or bridge-routers; pure routers don’t pass SCS traffic)

Multi-Site Clusters • Sites linked by: • DS-3/T3 (E3 in Europe) or ATM Telco circuits • Microwave link: DS-3/T3 or Ethernet • “Dark fiber” where available: • FDDI: 40 km with single-mode fiber; 2 km multi-mode fiber • Ethernet over fiber (10 mb, Fast, Gigabit) • Fiber links between Memory Channel switches ; up to 3 km • Dense Wave Division Multiplexing (DWDM), then ATM

Multi-Site Clusters • Inter-site link minimum standards are in OpenVMS Cluster Software SPD: • 10 megabits minimum data rate • “Minimize” packet latency • Low SCS packet retransmit rate: • Less than 0.1% retransmitted. Implies: • Low packet-loss rate for bridges • Low bit-error rate for links

VMS Clusters: Advanced Concepts