Performance and Reliability Issues – Network, Storage & Services

Performance and Reliability Issues – Network, Storage & Services Shawn McKee/University of Michigan OSG All-hands Meeting March 8th 2010, FNAL

Outline • I want to present a mix of topics related to performance and reliability for our sites • Not composed of “the answers” but rather a set of what I consider important topics and examples followed by discussion • I will cover Network, Storage and Services • Configuration • Tuning • Monitoring • Management OSG All-hands USATLAS Meeting

General Goals for our Sites • Goal: Build a Robust Infrastructure • Consider physical and logical topologies • Provide alternate paths when feasible • Tune, test, monitor and manage • Meta-Goal: Protect Services while Maintaining Performance • Services should be configured in such a way that they “fail gracefully” rather than crashing. Potentially many ways to do this • Tune, test, monitor and manage (as always) OSG All-hands USATLAS Meeting

Common Problems • Power issues • Site (mis)configurations • Service Failures • Load related, Bugs, Configuration, Updates • Hardware Failures • Disks, Memory, CPU, etc • Cooling Failures • Network Failures • Robust solutions are needed to minimize these impacts OSG All-hands USATLAS Meeting

Site Infrastructures • There are a number of areas to examine where we can add robustness (usually at the cost of $ or complexity !) • Networking • Physical and logical connectivity • Storage • Physical and logical connectivity • Filesystems, OS, Software, Services • Servers and Services • Grid and VO software and middleware OSG All-hands USATLAS Meeting

Example Site-to-Site Diagram OSG All-hands USATLAS Meeting

Power Issues • Power issues can frequently be the cause of service loss in our infrastructure • Redundant power-supplies connected to independent circuits can minimize loss due to circuit or supply failure (Verify 1 circuit can support the required load!!) • UPS systems can bridge brown-outs or short-duration loses and protect equipment from power fluctuations • Generators can provide longer-term bridging OSG All-hands USATLAS Meeting

Robust Network Connectivity • Redundant network connectivity can help provide robust networking • WAN resiliency is part of almost all WAN providers infrastructure • Sites need to determine how best to provide both LAN and connector-level resiliency • Basically, allow for multiple paths for network traffic to flow in case of switch/router failure, cabling mishaps, NIC failure, etc OSG All-hands USATLAS Meeting

Virtual Circuits in LHC (WAN) • ESnet and Internet2 have helped the LHC sites in the US setup end-to-end circuits • USATLAS has persistent circuits from BNL to 4 of the 5 Tier-2s • The circuits are guaranteed 1 Gbps but may overflow to utilize the available bandwidth • This simplifies traffic management and is transparent to the sites. • Future possibilities for dynamic mgmt… • Failover is back to default routing OSG All-hands USATLAS Meeting

LAN Options to Consider • Utilize equipment of reasonable quality. Managed switches typically are more robust as well as configurable and support monitoring • Within your LAN have redundant switches with paths managed by spanning-tree to increase uptime • Anticipate likely failure modes… • At the host level you can utilize multiple NICS (bonding) OSG All-hands USATLAS Meeting

Example: Network Bonding • You can configure multiple network interfaces on a host to cooperate as a single virtual interface via “bonding” • Linux allows multiple “modes” for the bonding configuration (see next page) • Trade-offs based upon resiliency vs performance as well as those related to hardware capabilities and topology. OSG All-hands USATLAS Meeting

NIC Bonding Modes • Mode 0 – Balance Round-Robin: the only mode allowing a single flow to balance over more than one NIC BUT reorders packets. Requires ‘etherchannel’ or ‘trunking’ on the switch • Mode 1 – Active-Backup: Allows connecting to different switches at different speeds. No throughput benefit but redundant. • Mode 2 – Balance-XOR: Selects NIC per destination based upon XOR of MAC addresses. Needs ‘etherchannel’ or ‘trunk’ • Mode 3 – Balance: Transmits on all slaves. Needs distinct nets. • Mode 4 – 802.3ad: Active-active, specific flows select NIC based upon chosen algorithm. Needs switch support for 802.3ad • Mode 5 – Balance-tlb: Adaptive transmit load balancing. Output balanced based upon current slave loads. No special switch support required. NIC must support ‘ethtool’ • Mode 6 – Balance-alb: Adaptive load balancing. Similar to 5 but allows receive balancing via “arp” manipulation. OSG All-hands USATLAS Meeting

Network Tuning (1/2) • Typical “default” OS tunings for networking are not optimal for WAN data transmission. • Depending upon the OS you can find particular tuning advice at: http://fasterdata.es.net/TCP-tuning/background.html • Buffers are the primary tuning target: buffer size = bandwidth * RTT • Good news: most OSes support autotuning now=> no need to set default buffer sizes OSG All-hands USATLAS Meeting

Network Tuning (2/2) • To get maximal throughput it is critical to use optimal TCP buffer sizes • If the buffers are too small, the TCP congestion window will never fully open up. • If the receiver buffers are too large, TCP flow control breaks; the sender can overrun the receiver, which will cause the TCP window to shut down. This is likely to happen if the sending host is faster than the receiving host. OSG All-hands USATLAS Meeting

Linux TCP Tuning (1/2) Like all operating systems, the default maximum Linux TCP buffer sizes are way too small # increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limits # min, default, and max number of bytes to use # set max to at least 4MB, higher if you use very high BDP paths net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 You should also verify that the following are all set to the default value of 1 sysctl net.ipv4.tcp_window_scaling sysctl net.ipv4.tcp_timestamps sysctl net.ipv4.tcp_sack Of course, TEST after changes. SACK may need to be off for large BDP paths (> 16MB) or timeouts may result. OSG All-hands USATLAS Meeting

Linux TCP Tuning (2/2) • Tuning can be more complex for 10GE • You can explore different congestion algorithms: BIC, CUBIC, HTCP, etc. • Large MTU can improve throughput There are a couple additional sysctl settings for 2.6: # don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_moderate_rcvbuf = 1 # recommended to increase this for 1000 BT or higher net.core.netdev_max_backlog = 2500 # for 10 GigE, use 30000 OSG All-hands USATLAS Meeting

Storage Connectivity • Increase robustness for storage by providing resiliency at various levels: • Network: Bonding (e.g. 802.3ad) • Raid/SCSI redundant cabling, multipathing (hw specific) • iSCSI(with redundant connections) • Single-Host resiliency: redundant power, mirrored memory, RAID OS disks, multipath controllers • Clustered/failover storage servers • Multiple copies, multiple write locations OSG All-hands USATLAS Meeting

Example: Redundant Cabling Using Dell MD1000s • New firmware for Dell RAID controllers supports redundant cabling of MD1000s • Each MD1000 can have two EMMs, each capable of accessing all disks • A Perc6/E has two SAS channels • Can now cable each channel to an EMM on a shelf. Connection shows 1 logical link (similar to “bond” in networking) • Can be daisy-chained to 3 MD1000’s OSG All-hands USATLAS Meeting

Redundant Path With Static Load Balancing Support The PERC 6/E adapter can detect and use redundant paths to drives contained in enclosures. This provides the ability to connect two SAS cables between a controller and an enclosure for path redundancy. The controller is able to tolerate the failure of a cable or Enclosure Management Module (EMM) by utilizing the remaining path. When redundant paths exist, the controller automatically balances I/O load through both paths to each disk drive. This load balancing feature increases throughput to each drive and is automatically turned on when redundant paths are detected. To set up your hardware to support redundant paths, see Setting up Redundant Path Support on the PERC 6/E Adapter. NOTE: This support for redundant paths refers to path-redundancy only and not to controller-redundancy“ http://support.dell.com/support/edocs/storage/RAID/PERC6/en/UG/HTML/chapterd.htm#wp1068896 OSG All-hands USATLAS Meeting

Storage Tuning • Have good hardware underneath the storage system! • Pick an underlying filesystem that performs well. XFS is a common choice which supports large number of directory entries and online defragmentation. • The following settings require the target to be mounted: • Set “readahead” to improve read speed (4096-16384) blockdev --setra 10240 $dev • Setup queuing requests (allows optimizing) echo 512 > /sys/block/${sd}/queue/nr_requests • Pick an I/O scheduler suitable for your task echo deadline > /sys/block/${sd}/queue/scheduler • There are often hardware specific tunings possible. • Remember to test for your expected workload to see if changes help. OSG All-hands USATLAS Meeting

Robust Grid Services? • Just a topic I wanted to mention. I would like to be able to configure virtual grid services (using multiple hosts, heartbeat, LVS, etc.) to create a robust infrastructure. • Primary targets: • Gatekeepers, Job schedulers, GUMS servers, LFC, software servers, dCache admin servers • Possible solution for NFS servers via heartbeat, LVS…others? OSG All-hands USATLAS Meeting

Virtualization of Service Nodes • Our current grid infrastructure for ATLAS requires a number of services • Virtualization technologies can be used to provide some of these services • Depending upon the virtualization system this can help: • Backing up critical services • Increasing availability • Easing management OSG All-hands USATLAS Meeting

Example: VMware • At AGLT2 we have VMware Enterprise running: • LFC,3 Squid servers,OSG Gatekeeper, ROCKS headnodes (dev/prod), 2 of 3 Kerb/AFS/NIS nodes, central syslog-ng host,muon splitter, 2 of 5 AFS file servers • “HA” can ensure services run even if a server fails. Backup is easy as well • Can “live-migate” VMs between 3 servers or migrate VM storage to alternate back-end storage server OSG All-hands USATLAS Meeting

Example: AGLT2 VMware Not shown are the 10GE Connections 1/server OSG All-hands USATLAS Meeting

Example: Details for UMVM02 OSG All-hands USATLAS Meeting

Backups • “You do have backups, right?...” • Scary question, huh?! Backups provide a form of resiliency against various hardware failures and unintentional acts of stupidity. • Could be anything from a full tape system backup services to various cron scripts saving needed config info. • Not always easy to get right…test! OSG All-hands USATLAS Meeting

System Tuning • Lots of topics could be put here but I will just mention a few items • You can install ‘ktune’ (yum install ktune). It will provide some tunings for large memory systems running disk and network intensive applications. • See related storage/network tunings • Memory is a likely bottleneck in many cases…have lots! OSG All-hands USATLAS Meeting

Cluster Monitoring… • This is a huge topic. In general you can’t find problems if you don’t know about them and you can’t effectively manage systems if you can’t monitor them • I will list a few monitoring programs that I have found useful. • There are many options in this area that I won’t cover: Nagiosis a prime example being very successfully used. OSG All-hands USATLAS Meeting

Ganglia • Ganglia is a cluster monitoring program available from http://ganglia.sourceforge.net/ and also distributed as part of ROCKS • Allows a quick view of CPU and memory use cluster-wide • Can drill down into host specific details • Can easily extend to monitor additional data or aggregate sites OSG All-hands USATLAS Meeting

Example Ganglia Interface OSG All-hands USATLAS Meeting

Cacti Monitoring • Cacti ( see http://www.cacti.net/ ) is a network graphing package using SNMP and RRDtool to record data • Can be extended with plugins (threshold, monitoring, MAC lookup) OSG All-hands USATLAS Meeting

Example Cacti Graphs Outbound AGLT2 10GE Bytes/sec Inbound AGLT2 10GE Bytes/sec Aggregate ‘ntpd’ offset (ms) Space-tokens stats (put/get) Postgres DB stats NFS client statistics OSG All-hands USATLAS Meeting

Custom Monitoring • Philippe Laurens (MSU) has developed a summary page for AGLT2 which quickly shows cluster status: OSG All-hands USATLAS Meeting

Automated Monitoring/Recovery • Some types of problems can be easily “fixed” if we can just identify them • The ‘monit’ software (‘yum install monit’) can provide an easy way to test various system/software components and attempt to remediate problems. • Configure a file per item to watch/test • Very configurable; can fix problems at 3AM! Some examples follow: OSG All-hands USATLAS Meeting

Monit Example for MySQL • This describes the relevant MySQL info for this host. # mysqld monitoring check process mysqld with pidfile /var/lib/mysql/dq2.aglt2.org.pid group database start program = "/etc/init.d/mysql start" stop program = "/etc/init.d/mysql stop" if failed host 127.0.0.1 port 3306 protocol mysql 3 cycles then restart if failed host 127.0.0.1 port 3306 protocol mysql 3 cycles then alert if failed unixsocket /var/lib/mysql/mysql.sock protocol mysql 4 cycles then alert if 5 restarts within 10 cycles then timeout • Restarting and alerting are triggered based upon tests. • Resides in /etc/monit.d as mysqld.conf OSG All-hands USATLAS Meeting

Other Monitoring/Management • Lots of sites utilize “simple” scripts run via “cron” (or equivalent) that: • Perform regular maintenance • Check for “known” problems • Backup data or configurations • Extract monitoring data • Remediate commonly occurring failures • These can be very helpful for increasing reliability and performance OSG All-hands USATLAS Meeting

Security Considerations • Security is a whole separate topic…not appropriate to cover it here… • General issue is that unless security is also addressed, your otherwise high-performing robust infrastructure may have large downtimes while you try to contain and repair system compromises! • Good security practices are part of building robust infrastructures. OSG All-hands USATLAS Meeting

Configuration Management • Not directly related to performance or reliability but very important • Common tools: • Code management, versioning (Subversion, CVS) • Provisioning and configuration managment (ROCKS, Kickstart, Puppet, Cfengine) • All important for figuring out what was changed and what is currently configured OSG All-hands USATLAS Meeting

Regular Storage “Maintenance” • Start with the bits on disk. Run ‘smartd’ to look for impending failures • Use “patrol reads” or background consistency checks to find bad sectors • Run filesystem checks when things are “suspicious” (xfs_repair, fsck…) • Run higher level consistency checks (like Charle’sccc.pyscript) to insure various views of your storage are consistent OSG All-hands USATLAS Meeting

High Level Storage Consistency Being run at MWT2 and AGLT2 Allows finding consistency problems and “dark” data OSG All-hands USATLAS Meeting

dCache Monitoring/Management • AGLT2 has monitoring/mgmt we do specific to dCache (as an example) • Other storage solutions may have similar types of monitoring • We have developed some custom pages in addition to the standard dCache services web interface • Tracks usage and consistency • Also have a series of scripts running in ‘cron’ doing routine maintenance/checks OSG All-hands USATLAS Meeting

dCache Allocation and Use OSG All-hands USATLAS Meeting

dCache Consistency Page OSG All-hands USATLAS Meeting

WAN Network Monitoring • Within the Throughput group we have been working on network monitoring as complementary to throughput testing • Two measurement/monitoring areas: • perfSONAR at Tier-1/Tier-2 sites • “Network” specific testing • Automated transfer testing • “End-to-end” using standard ATLAS tools • May add a “transaction test” next (TBD) OSG All-hands USATLAS Meeting

Network Monitoring: perfSONAR • As you are by now well aware there is a broad scale effort to standardize network monitoring under the perfSONAR framework • Since the network is so fundamental to our work we targeted implementation of a perfSONAR instance at all our primary facilities. We have ~20 sites running • Has already proven very useful in USATLAS! OSG All-hands USATLAS Meeting

perfSONAR Examples USATLAS OSG All-hands USATLAS Meeting

perfSONAR in USATLAS • The typical Tier-1/Tier-2 installation provides two systems (using the same KOI hardware at each site): latency and bandwidth nodes • Automated recurring tests are configured for both latency and bandwidth between all Tier-1/Tier-2 sites (“mesh” testing) • We are acquiring a baseline and history of network performance between sites • On demand testing is also available OSG All-hands USATLAS Meeting

Production System Testing • While perfSONAR is becoming the tool of choice for monitoring the network behavior between sites, we also need to track the “end-to-end” behavior of our complex, distributed systems. • We are utilizing regularly scheduled automated testing, sending specific data between sites to verify proper operation. • This is critical for problem isolation; comparing network and application results can pin-point problem locations OSG All-hands USATLAS Meeting

Automated Data Transfer Tests • As part of USATLAS Throughput work, Hiro has developed an automated data transfer system which utilizes the standard ATLAS DDM system • This allows us to monitor the throughput of the system on a regular basis • It transfers a set of files once per day from the Tier-1 to each Tier-2 for two different destinations. • Recently it was extended to allow arbitrary source/destination (including Tier-3s) • http://www.usatlas.bnl.gov/dq2/throughput OSG All-hands USATLAS Meeting

Web Interface to Throughput Test OSG All-hands USATLAS Meeting

Performance and Reliability Issues – Network, Storage & Services