Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters

Introduction to the NPACI Rocks Clustering Toolkit:Building Manageable COTS Clusters Philip M. Papadopoulos, Mason J. Katz, Greg Bruno

Who We Are • Philip Papadopoulos • Parallel message passing expert (PVM and Fast Messages) • Mason Katz • Network protocol expert (x-kernel, Scout and Fast Messages) • Greg Bruno • 10 years experience with NCR’s Teradata Systems • Builders of clusters which drive very large commercial databases • All three of us have worked together for the past 2 years building NT and Linux clusters

Who is “NPACI Rocks” ? • Key people from the UCB Millennium Group • Prof. David Culler • Eric Fraser • Brent Chun • Matt Massie • Albert Goto • People from SDSC • Bruno, Katz, Papadopoulos (Distributed Computing Group) • Kenneth Yoshimoto (Scheduling) • [Keith Thompson, Bill Link (Grid)] • [ Storage Resource Broker (SRB) Group] • [ You ! ]

Why We Do Clusters – Frankly, we love it • Building high-performance systems which have killer price/performance is a gas • NPACI is about building pervasive infrastructure. Supported, transferable cluster infrastructure was missing from our “portfolio”. • Enabling others to build their own clusters and “do scientific simulation” is a blast. • We wanted a management system that would allow us to rapidly experiment with new low-level system software (and recover when things didn’t go quite right) • “Protect ourselves from ourselves” 

What We’ll Cover • Rocks philosophies • Hardware components • Software packages • Theory and practice • Lab

What we thought we “Learned” • Clusters are phenomenal price/performance computational engines, but are hard to manage • Cluster management is a full-time job which gets linearly harder as one scales out. • “Heterogeneous” Nodes are a bummer (network, memory, disk, MHz, current kernel version).

You Must Unlearn What You Have Learned

Installation/Management • Need to have a strategy for managing cluster nodes • Pitfalls • Installing each node “by hand” • Difficult to keep software on nodes up to date • Management increases as node count increases • Disk Imaging techniques (e.g.. VA Disk Imager) • Difficult to handle heterogeneous nodes • Treats OS as a single monolithic system • Specialized installation programs (e.g. IBM’s LUI, or RWCPs Multicast installer) – • let Linux packaging vendors do their job • Penultimate • RedHat Kickstart • Define packages needed for OS on nodes, kickstart gives a reasonable measure of control. • Need to fully automate to scale out (Rocks)

Scaling out • Evolve to management of “two” systems • The front end(s) • Log in host • User’s home areas, passwords, groups • Cluster configuration information • The compute nodes • Disposable OS image • Let software manage node heterogeneity • Parallel (re)installation • Cluster-wide configuration files derived through reports from a MySQL database (DHCP, hosts, PBS nodes, …)

NPACI Rocks Toolkit – rocks.npaci.edu • Techniques and software for easy installation, management, monitoring and update of clusters • Installation • Bootable CD + floppy which contains all the packages and site configuration info to bring up an entire cluster • Management and update philosophies • Trivial to completely reinstall any (all) nodes. • Nodes are 100% automatically configured • Use of DHCP, NIS for configuration • Use RedHat’s Kickstart to define the set of software that defines a node. • All software is delivered in a RedHat Package (RPM) • Encapsulate configuration for a package (e.g.. Myrinet) • Manage dependencies • Never try to figure out if node software is consistent • If you ever ask yourself this question, reinstall the node

More Rocksisms • Leverage widely-used (standard) software wherever possible • Everything is in RedHat Packages (RPM) • RedHat’s “kickstart” installation tool • SSH, Telnet, Existing open source tools • Write only the software that we need to write • Focus on simplicity • Commodity components • For example: x86 compute servers, Ethernet, Myrinet • Minimal • For example: no additional diagnostic or proprietary networks • Rocks is a collection point of software for people building clusters • It will evolve to include cluster software and packaging from more than just SDSC and UCB • <[your-software.i386.rpm] [your-software.src.rpm] here>

Hardware

Node Node Node Node Node Node Node Node Node Node Many variations on a basic layout Front-end Node(s) Power Distribution (Net addressable units as option) Public Ethernet Fast-Ethernet Switching Complex Gigabit Network Switching Complex

Frontend and Compute Nodes • Choices • Uni or Dual, Intel Processors • Linux is, in reality, an Intel OS • Rackmount vs. Desktop chassis • Rackmount “essential” for large installations • SCSI vs. IDE • Performance is a non-issue • Price and serviceability are the real considerations • Note: rackmount servers usually are SCSI • User integration versus system integrator • Our Nodes • Dual PIIIs (733, 800 and 933 MHz [Compaq, IBM]) • 1.0+ GHz as we expand • ½ GB node (1 GB would be better) • Hot swap SCSI on these nodes • We integrate our hardware

Networks • High-performance networks • Myrinet, Giganet, Servernet, Gigabit Ethernet, etc. • Ethernet only  Beowulf-class • Management Networks (Light Side) • Ethernet – 100 Mbit • Management network used to manage compute nodes and launch jobs • Nodes are in Private IP (192.168.x.x) space, front-end does NAT • Ethernet – 802.11b • Easy access to the cluster via laptops • Plus, wireless will change your life • Evil Management Networks (Dark Side) • A serial “console” network is not necessary • A KVM (keyboard/video/monitor) switching system adds too much complexity, cables, and cost

Power Distribution Ethernet port • Highly desirable to have network addressable power distribution units • Can remotely power cycle compute nodes • Instrumented which help determine power needs Power sockets

Other Helpful Hardware When All Else Fails • When a node appears to be sick • Issue a “reinstall” command over the network • If still dead, instruct the network addressable power distribution unit to power cycle the node (this reinstalls the OS) • If still dead, roll up the “crash cart” • Monitor and keyboard

Leatherman: A Must-Have For Any Self-Respecting Clusters Person

Current Configuration of the Meteor Cluster • Rocks v2.0 • 2 Frontends • 100 nodes • 50 GB RAM • Ethernet • For management • Myrinet • Servernet • Working through some bugs

Software

RedHat Supplied Software • 7.0 Base + Updates • RPM • RedHat Package Manager • Kickstart • Method for unattended server installation

Community Software • Myricom’s General Messaging (GM) • MPICH • GM device • Ethernet device • Portable Batch System • Maui • PVM • Intel’s Math Kernel Library • Math functions tuned for Intel processors

NPACI Rocks Software • Cluster-dist • A tool used to assemble the latest RedHat, community and Rocks packages into a distribution which is used by compute nodes during reinstallation • Shoot-node and eKV (Ethernet Keyboard and Video) • Initiate a compute node reinstallation • Monitor compute node reinstallations over Ethernet with telnet • Cluster-admin and cluster-ssl • Tools to create user accounts and user SSL certificates • Rexec (UC Berkeley) • Launch and control parallel jobs (SSL-based authentication) • Ganglia (UC Berkeley) • Cluster monitoring

Software Details

Cluster-dist • Integrate RedHat Packages from • Redhat (mirror) – base distribution + updates • Contrib directory • Locally produced packages • Packages from rocks.npaci.edu • Produces a single updated distribution that resides on front-end • Is a RedHat Distribution with patches and updates applied • Different Kickstart files and different distribution can co-exist on a front-end to add flexibility in configuring nodes.

Remote re-installationShoot-node and eKV • Rocks provides a simple method to remotely reinstall a node (once it has been installed the first time) • By default, hard power cycling will cause a node to reinstall itself. • With no serial (or KVM) console, we are able to watch a node as installs

Remote re-installationShoot-node and eKV 192.168.254.254 Remotely starting reinstallation on two nodes 192.168.254.253

Starting Jobs • SSH-based MPI-Launch • Provides full integration with Myrinet reservation capability of Usher/Patron • SSL-Based Rexec • Better control of jobs on remote nodes • Sane signal propagation • Batch System: PBS + Maui • PBS provides queue definition and node monitoring • Maui has rich scheduling policies • Standing and Future Reservations • Query number of “available” now nodes

PBS – Portable Batch System • Three standard components to PBS • MOM – Node health reporting daemon, Job Launch daemon on every node • Server – On front-end: queue definition, aggregation of node information • Scheduler – Policies for what job to run out of which queue at what time • We added a fourth • Configuration – Get cluster node configuration from our SQL database.

PBS RPM Packaging • Repackaged PBS (Sane packaging + enhancements) • Added “chkconfig-compatible” start-up scripts • 4 packages • pbs (server and scheduler) (should be divided again) • pbs-mom • pbs-config-sql (Python script to generate database report) • pbs-common (files needed by all three packages) • A Rocks 2.0 base installation (automatically) defines a default queue with all nodes being available in the queue • http://pbs.mrj.com is a good starting point for PBS

PBS Server defaults (and changing them) • Startup script: “/etc/rc.d/init.d/pbs-server start” • /usr/apps/pbs/pbs.default • “Sourced” every time pbs is started # $Id: pbs.default.in,v 1.5 2001/02/16 19:59:38 bruno Exp $ # # A basic pbs setup that creates a queue called default and starts scheduling # # Create queues and set their attributes. # # # Create and define queue default # # 1 node default, 1hr walltime create queue default set queue default queue_type = Execution set queue default resources_default.nodes = 1 set queue default resources_default.walltime = 1:00:00 set queue default enabled = True set queue default started = True

PBS.defaults (cont’d) # # Set server attributes. # # Assume maui scheduler will be installed set server managers = maui@frontend-0 set server operators = maui@frontend-0 set server default_queue = default set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server scheduling=false • PBS will ignore queue creation if a queue already exists.

Modifying the default setup (simple queue creation) • Use qmgr to create a new queue # /usr/apps/pbs/bin/qmgr Max open servers: 4 Qmgr: create queue single Qmgr: set queue single queue_type=execution Qmgr: set queue single enabled=true Qmgr: set queue single acl_hosts=compute-1-0 Qmgr: set queue single started=true • Use qmgr command to save configuration /usr/apps/pbs/bin/qmgr -c "print server“ > /usr/apps/pbs/pbs.default

Maui Scheduler • We use Maui as our scheduler for PBS • mauischeduler.sourceforge.net • http://havi.supercluster.org/documentation/maui • Add the “single” queue definition so that Maui understands. This is in /usr/spool/maui/maui.cfg SRNAME[0] single SRHOSTLIST[0] compute-1-0 • Restart Maui • % /etc/rc.d/init.d/maui restart • Submit a job to PBS • % /usr/apps/pbs/bin/qsub –q single mytest.sh

Monitoring your cluster • PBS has a GUI called xpsmon. Gives a nice graphical view of up/down state of nodes • SNMP status • Use the extensive SNMP MIB defined by the Linux community to find out many things about a node • Installed software • Uptime • Load • Ganglia (UCB) – IP Multicast-based monitoring system

Ganglia - http://www.millennium.berkeley.edu/ganglia/ • Dendrite on each node • Multicasts state of the machine on significant changes • Load averages, disk consumption, memory, etc. • Beacons every minute, if no significant deltas • Axons • Collection daemons (at least one/cluster) • Ganglia client – Sort the measured variables to find a set of hosts that match a desired criteria • E.g. X MB free memory, load below Y • Can act as a “vexec” resource for Rexec.

Ganglia – text output [phil@slic01 ~]$ /usr/sbin/ganglia load_one compute-1-5 0.07 compute-0-9 0.08 compute-1-3 0.14 compute-2-0 0.15 compute-2-8 0.18 compute-2-5 0.27 frontend-0 0.36 compute-3-11 0.82 compute-23 1.06 compute-22 1.19 compute-3-4 1.96 compute-3-9 1.99 compute-3-10 1.99 compute-3-2 2.00 compute-3-3 2.09 compute-3-7 2.12 compute-3-5 2.99 compute-3-6 3.0

“Hidden” Software

Some Tools that assist in automation. • Users generally will not see these tools • Profile scripts run at user’s first login • Usher-patron (Myrinet port reservation) • Insert-ethers (Add nodes to a cluster) • Cluster-sql package • Reports to build service-specific config files • Cluster-admin • Node reinstallation • Creating accounts (NIS, auto.home map creation) • Cluster-ssl • Generate keys for SSL authentication (rexec)

Usher/Patron • Tool to simplify using installed Myricom Hardware • Eliminates a central “database” to decide which Myrinet ports are currently in use • (Myricom driver installed with a separate source RPM) • Usher daemon runs on each compute node. Takes reservation requests for access to the limited set of Myrinet ports (RPC-based) • Reservations time out, if not claimed. • Patron – works with usher to request and claim ports • Integrated with MPI-Launch • Automatically creates node file need for MPICH-GM

First Login Profile Scripts • On first login, all users, including root, are prompted to build an SSH public/private key pair • Makes sense because ssh is the only way to gain login access to the nodes • NIS is updated (passwd, auto.home, etc.) • Additionally, if it’s the first time root has logged in, a SSL certificate authority is generated which is used to sign user’s SSL certificates • The SSL certificate and root’s public SSH key are then propagated the to compute node kickstart file

insert-ethers • Used to populate the “nodes” MySQL table • Parses a file (e.g., /var/log/messages) for DHCPDISCOVER messages • Extracts MAC addr and, if not in table, adds MAC addr and hostname to table • For every new entry: • Rebuilds /etc/hosts and /etc/dhcpd.conf • Reconfigures NIS • Restarts DHCP and PBS • Hostname is • <basename>-<cabinet>-<chassis> • Configurable to change hostname • E.g., when adding new cabinets

dhcp_options – One More Important MySQL Table • Created by the Frontend kickstart file (based on user input from Rocks configuration web page) • Used by makedhcp to construct the header in /etc/dhcpd.conf

Configuration Derived from Database Automated node discovery mySQL DB Node 0 insert-ethers Node 1 makehosts makedhcp pbs-config-sql Node N /etc/hosts /etc/dhcpd.conf pbs node list

Futures • Attack the storage problem • Keep the global view of storage that NFS gives us, but address the scalability problem • Source high bandwidth from the cluster into the WAN • Apply our cluster bring-up automation to easily attach clusters to the grid • Continue to improve cluster monitoring • Configure a monitoring GUI (e.g., NetSaint) to extract data from Ganglia • Get node health (Fan Speed, Temp., Disk Error rate) into Ganglia • Technologies • Processors: IA-64 and Alpha • Networks: Infiniband • 2.4 kernel (Will rev our distribution at RedHat 7.1)

Lab

Front-end Node • Node seen by external world • Performs Network Address Translation (NAT) • NFS Server(s) for user home areas • Beware of scalability issues! • Compilers, libraries • Configuration for Nodes • DHCP Server, NIS Domain Controller, NTP Server, Web Server, MySQL Server • Installation Server for defining system on nodes • Method(s) to start jobs on compute nodes • Batch System (PBS + Maui) • Interactive launching of jobs

Installing a Front-end Machine • Build ks.cfg from https://rocks.npaci.edu/site.htm • Define your root password • NIS Domain • Public IP Address • Boot CD • Full ISO image for download. Burn your own! • Enter: “frontend” at the boot prompt. • Sit back. Time varies depending on speed of the CPU and CDROM of frontend • Entire distribution is being copied to /home/install/cluster-dist

Building a Distribution with cluster-dist • Directory structure • Build mirror • From mirror host • Emulates mirroring from rocks • Build distro • cluster-dist dist

Installing Compute Nodes • Login as root to frontend • Execute: tail –f /var/log/messages | insert-ethers • Back on the compute node • Boot CD • From laptop • Examine MySQL database through browser

Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters