1 / 58

Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters

Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters. Philip M. Papadopoulos, Mason J. Katz, Greg Bruno. Who We Are. Philip Papadopoulos Parallel message passing expert (PVM and Fast Messages) Mason Katz

larsen
Télécharger la présentation

Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to the NPACI Rocks Clustering Toolkit:Building Manageable COTS Clusters Philip M. Papadopoulos, Mason J. Katz, Greg Bruno

  2. Who We Are • Philip Papadopoulos • Parallel message passing expert (PVM and Fast Messages) • Mason Katz • Network protocol expert (x-kernel, Scout and Fast Messages) • Greg Bruno • 10 years experience with NCR’s Teradata Systems • Builders of clusters which drive very large commercial databases • All three of us have worked together for the past 2 years building NT and Linux clusters

  3. Who is “NPACI Rocks” ? • Key people from the UCB Millennium Group • Prof. David Culler • Eric Fraser • Brent Chun • Matt Massie • Albert Goto • People from SDSC • Bruno, Katz, Papadopoulos (Distributed Computing Group) • Kenneth Yoshimoto (Scheduling) • [Keith Thompson, Bill Link (Grid)] • [ Storage Resource Broker (SRB) Group] • [ You ! ]

  4. Why We Do Clusters – Frankly, we love it • Building high-performance systems which have killer price/performance is a gas • NPACI is about building pervasive infrastructure. Supported, transferable cluster infrastructure was missing from our “portfolio”. • Enabling others to build their own clusters and “do scientific simulation” is a blast. • We wanted a management system that would allow us to rapidly experiment with new low-level system software (and recover when things didn’t go quite right) • “Protect ourselves from ourselves” 

  5. What We’ll Cover • Rocks philosophies • Hardware components • Software packages • Theory and practice • Lab

  6. What we thought we “Learned” • Clusters are phenomenal price/performance computational engines, but are hard to manage • Cluster management is a full-time job which gets linearly harder as one scales out. • “Heterogeneous” Nodes are a bummer (network, memory, disk, MHz, current kernel version).

  7. You Must Unlearn What You Have Learned

  8. Installation/Management • Need to have a strategy for managing cluster nodes • Pitfalls • Installing each node “by hand” • Difficult to keep software on nodes up to date • Management increases as node count increases • Disk Imaging techniques (e.g.. VA Disk Imager) • Difficult to handle heterogeneous nodes • Treats OS as a single monolithic system • Specialized installation programs (e.g. IBM’s LUI, or RWCPs Multicast installer) – • let Linux packaging vendors do their job • Penultimate • RedHat Kickstart • Define packages needed for OS on nodes, kickstart gives a reasonable measure of control. • Need to fully automate to scale out (Rocks)

  9. Scaling out • Evolve to management of “two” systems • The front end(s) • Log in host • User’s home areas, passwords, groups • Cluster configuration information • The compute nodes • Disposable OS image • Let software manage node heterogeneity • Parallel (re)installation • Cluster-wide configuration files derived through reports from a MySQL database (DHCP, hosts, PBS nodes, …)

  10. NPACI Rocks Toolkit – rocks.npaci.edu • Techniques and software for easy installation, management, monitoring and update of clusters • Installation • Bootable CD + floppy which contains all the packages and site configuration info to bring up an entire cluster • Management and update philosophies • Trivial to completely reinstall any (all) nodes. • Nodes are 100% automatically configured • Use of DHCP, NIS for configuration • Use RedHat’s Kickstart to define the set of software that defines a node. • All software is delivered in a RedHat Package (RPM) • Encapsulate configuration for a package (e.g.. Myrinet) • Manage dependencies • Never try to figure out if node software is consistent • If you ever ask yourself this question, reinstall the node

  11. More Rocksisms • Leverage widely-used (standard) software wherever possible • Everything is in RedHat Packages (RPM) • RedHat’s “kickstart” installation tool • SSH, Telnet, Existing open source tools • Write only the software that we need to write • Focus on simplicity • Commodity components • For example: x86 compute servers, Ethernet, Myrinet • Minimal • For example: no additional diagnostic or proprietary networks • Rocks is a collection point of software for people building clusters • It will evolve to include cluster software and packaging from more than just SDSC and UCB • <[your-software.i386.rpm] [your-software.src.rpm] here>

  12. Hardware

  13. Node Node Node Node Node Node Node Node Node Node Many variations on a basic layout Front-end Node(s) Power Distribution (Net addressable units as option) Public Ethernet Fast-Ethernet Switching Complex Gigabit Network Switching Complex

  14. Frontend and Compute Nodes • Choices • Uni or Dual, Intel Processors • Linux is, in reality, an Intel OS • Rackmount vs. Desktop chassis • Rackmount “essential” for large installations • SCSI vs. IDE • Performance is a non-issue • Price and serviceability are the real considerations • Note: rackmount servers usually are SCSI • User integration versus system integrator • Our Nodes • Dual PIIIs (733, 800 and 933 MHz [Compaq, IBM]) • 1.0+ GHz as we expand • ½ GB node (1 GB would be better) • Hot swap SCSI on these nodes • We integrate our hardware

  15. Networks • High-performance networks • Myrinet, Giganet, Servernet, Gigabit Ethernet, etc. • Ethernet only  Beowulf-class • Management Networks (Light Side) • Ethernet – 100 Mbit • Management network used to manage compute nodes and launch jobs • Nodes are in Private IP (192.168.x.x) space, front-end does NAT • Ethernet – 802.11b • Easy access to the cluster via laptops • Plus, wireless will change your life • Evil Management Networks (Dark Side) • A serial “console” network is not necessary • A KVM (keyboard/video/monitor) switching system adds too much complexity, cables, and cost

  16. Power Distribution Ethernet port • Highly desirable to have network addressable power distribution units • Can remotely power cycle compute nodes • Instrumented which help determine power needs Power sockets

  17. Other Helpful Hardware When All Else Fails • When a node appears to be sick • Issue a “reinstall” command over the network • If still dead, instruct the network addressable power distribution unit to power cycle the node (this reinstalls the OS) • If still dead, roll up the “crash cart” • Monitor and keyboard

  18. Leatherman: A Must-Have For Any Self-Respecting Clusters Person

  19. Current Configuration of the Meteor Cluster • Rocks v2.0 • 2 Frontends • 100 nodes • 50 GB RAM • Ethernet • For management • Myrinet • Servernet • Working through some bugs

  20. Software

  21. RedHat Supplied Software • 7.0 Base + Updates • RPM • RedHat Package Manager • Kickstart • Method for unattended server installation

  22. Community Software • Myricom’s General Messaging (GM) • MPICH • GM device • Ethernet device • Portable Batch System • Maui • PVM • Intel’s Math Kernel Library • Math functions tuned for Intel processors

  23. NPACI Rocks Software • Cluster-dist • A tool used to assemble the latest RedHat, community and Rocks packages into a distribution which is used by compute nodes during reinstallation • Shoot-node and eKV (Ethernet Keyboard and Video) • Initiate a compute node reinstallation • Monitor compute node reinstallations over Ethernet with telnet • Cluster-admin and cluster-ssl • Tools to create user accounts and user SSL certificates • Rexec (UC Berkeley) • Launch and control parallel jobs (SSL-based authentication) • Ganglia (UC Berkeley) • Cluster monitoring

  24. Software Details

  25. Cluster-dist • Integrate RedHat Packages from • Redhat (mirror) – base distribution + updates • Contrib directory • Locally produced packages • Packages from rocks.npaci.edu • Produces a single updated distribution that resides on front-end • Is a RedHat Distribution with patches and updates applied • Different Kickstart files and different distribution can co-exist on a front-end to add flexibility in configuring nodes.

  26. Remote re-installationShoot-node and eKV • Rocks provides a simple method to remotely reinstall a node (once it has been installed the first time) • By default, hard power cycling will cause a node to reinstall itself. • With no serial (or KVM) console, we are able to watch a node as installs

  27. Remote re-installationShoot-node and eKV 192.168.254.254 Remotely starting reinstallation on two nodes 192.168.254.253

  28. Starting Jobs • SSH-based MPI-Launch • Provides full integration with Myrinet reservation capability of Usher/Patron • SSL-Based Rexec • Better control of jobs on remote nodes • Sane signal propagation • Batch System: PBS + Maui • PBS provides queue definition and node monitoring • Maui has rich scheduling policies • Standing and Future Reservations • Query number of “available” now nodes

  29. PBS – Portable Batch System • Three standard components to PBS • MOM – Node health reporting daemon, Job Launch daemon on every node • Server – On front-end: queue definition, aggregation of node information • Scheduler – Policies for what job to run out of which queue at what time • We added a fourth • Configuration – Get cluster node configuration from our SQL database.

  30. PBS RPM Packaging • Repackaged PBS (Sane packaging + enhancements) • Added “chkconfig-compatible” start-up scripts • 4 packages • pbs (server and scheduler) (should be divided again) • pbs-mom • pbs-config-sql (Python script to generate database report) • pbs-common (files needed by all three packages) • A Rocks 2.0 base installation (automatically) defines a default queue with all nodes being available in the queue • http://pbs.mrj.com is a good starting point for PBS

  31. PBS Server defaults (and changing them) • Startup script: “/etc/rc.d/init.d/pbs-server start” • /usr/apps/pbs/pbs.default • “Sourced” every time pbs is started # $Id: pbs.default.in,v 1.5 2001/02/16 19:59:38 bruno Exp $ # # A basic pbs setup that creates a queue called default and starts scheduling # # Create queues and set their attributes. # # # Create and define queue default # # 1 node default, 1hr walltime create queue default set queue default queue_type = Execution set queue default resources_default.nodes = 1 set queue default resources_default.walltime = 1:00:00 set queue default enabled = True set queue default started = True

  32. PBS.defaults (cont’d) # # Set server attributes. # # Assume maui scheduler will be installed set server managers = maui@frontend-0 set server operators = maui@frontend-0 set server default_queue = default set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server scheduling=false • PBS will ignore queue creation if a queue already exists.

  33. Modifying the default setup (simple queue creation) • Use qmgr to create a new queue # /usr/apps/pbs/bin/qmgr Max open servers: 4 Qmgr: create queue single Qmgr: set queue single queue_type=execution Qmgr: set queue single enabled=true Qmgr: set queue single acl_hosts=compute-1-0 Qmgr: set queue single started=true • Use qmgr command to save configuration /usr/apps/pbs/bin/qmgr -c "print server“ > /usr/apps/pbs/pbs.default

  34. Maui Scheduler • We use Maui as our scheduler for PBS • mauischeduler.sourceforge.net • http://havi.supercluster.org/documentation/maui • Add the “single” queue definition so that Maui understands. This is in /usr/spool/maui/maui.cfg SRNAME[0] single SRHOSTLIST[0] compute-1-0 • Restart Maui • % /etc/rc.d/init.d/maui restart • Submit a job to PBS • % /usr/apps/pbs/bin/qsub –q single mytest.sh

  35. Monitoring your cluster • PBS has a GUI called xpsmon. Gives a nice graphical view of up/down state of nodes • SNMP status • Use the extensive SNMP MIB defined by the Linux community to find out many things about a node • Installed software • Uptime • Load • Ganglia (UCB) – IP Multicast-based monitoring system

  36. Ganglia - http://www.millennium.berkeley.edu/ganglia/ • Dendrite on each node • Multicasts state of the machine on significant changes • Load averages, disk consumption, memory, etc. • Beacons every minute, if no significant deltas • Axons • Collection daemons (at least one/cluster) • Ganglia client – Sort the measured variables to find a set of hosts that match a desired criteria • E.g. X MB free memory, load below Y • Can act as a “vexec” resource for Rexec.

  37. Ganglia – text output [phil@slic01 ~]$ /usr/sbin/ganglia load_one compute-1-5 0.07 compute-0-9 0.08 compute-1-3 0.14 compute-2-0 0.15 compute-2-8 0.18 compute-2-5 0.27 frontend-0 0.36 compute-3-11 0.82 compute-23 1.06 compute-22 1.19 compute-3-4 1.96 compute-3-9 1.99 compute-3-10 1.99 compute-3-2 2.00 compute-3-3 2.09 compute-3-7 2.12 compute-3-5 2.99 compute-3-6 3.0

  38. “Hidden” Software

  39. Some Tools that assist in automation. • Users generally will not see these tools • Profile scripts run at user’s first login • Usher-patron (Myrinet port reservation) • Insert-ethers (Add nodes to a cluster) • Cluster-sql package • Reports to build service-specific config files • Cluster-admin • Node reinstallation • Creating accounts (NIS, auto.home map creation) • Cluster-ssl • Generate keys for SSL authentication (rexec)

  40. Usher/Patron • Tool to simplify using installed Myricom Hardware • Eliminates a central “database” to decide which Myrinet ports are currently in use • (Myricom driver installed with a separate source RPM) • Usher daemon runs on each compute node. Takes reservation requests for access to the limited set of Myrinet ports (RPC-based) • Reservations time out, if not claimed. • Patron – works with usher to request and claim ports • Integrated with MPI-Launch • Automatically creates node file need for MPICH-GM

  41. First Login Profile Scripts • On first login, all users, including root, are prompted to build an SSH public/private key pair • Makes sense because ssh is the only way to gain login access to the nodes • NIS is updated (passwd, auto.home, etc.) • Additionally, if it’s the first time root has logged in, a SSL certificate authority is generated which is used to sign user’s SSL certificates • The SSL certificate and root’s public SSH key are then propagated the to compute node kickstart file

  42. insert-ethers • Used to populate the “nodes” MySQL table • Parses a file (e.g., /var/log/messages) for DHCPDISCOVER messages • Extracts MAC addr and, if not in table, adds MAC addr and hostname to table • For every new entry: • Rebuilds /etc/hosts and /etc/dhcpd.conf • Reconfigures NIS • Restarts DHCP and PBS • Hostname is • <basename>-<cabinet>-<chassis> • Configurable to change hostname • E.g., when adding new cabinets

  43. dhcp_options – One More Important MySQL Table • Created by the Frontend kickstart file (based on user input from Rocks configuration web page) • Used by makedhcp to construct the header in /etc/dhcpd.conf

  44. Configuration Derived from Database Automated node discovery mySQL DB Node 0 insert-ethers Node 1 makehosts makedhcp pbs-config-sql Node N /etc/hosts /etc/dhcpd.conf pbs node list

  45. Futures • Attack the storage problem • Keep the global view of storage that NFS gives us, but address the scalability problem • Source high bandwidth from the cluster into the WAN • Apply our cluster bring-up automation to easily attach clusters to the grid • Continue to improve cluster monitoring • Configure a monitoring GUI (e.g., NetSaint) to extract data from Ganglia • Get node health (Fan Speed, Temp., Disk Error rate) into Ganglia • Technologies • Processors: IA-64 and Alpha • Networks: Infiniband • 2.4 kernel (Will rev our distribution at RedHat 7.1)

  46. Lab

  47. Front-end Node • Node seen by external world • Performs Network Address Translation (NAT) • NFS Server(s) for user home areas • Beware of scalability issues! • Compilers, libraries • Configuration for Nodes • DHCP Server, NIS Domain Controller, NTP Server, Web Server, MySQL Server • Installation Server for defining system on nodes • Method(s) to start jobs on compute nodes • Batch System (PBS + Maui) • Interactive launching of jobs

  48. Installing a Front-end Machine • Build ks.cfg from https://rocks.npaci.edu/site.htm • Define your root password • NIS Domain • Public IP Address • Boot CD • Full ISO image for download. Burn your own! • Enter: “frontend” at the boot prompt. • Sit back. Time varies depending on speed of the CPU and CDROM of frontend • Entire distribution is being copied to /home/install/cluster-dist

  49. Building a Distribution with cluster-dist • Directory structure • Build mirror • From mirror host • Emulates mirroring from rocks • Build distro • cluster-dist dist

  50. Installing Compute Nodes • Login as root to frontend • Execute: tail –f /var/log/messages | insert-ethers • Back on the compute node • Boot CD • From laptop • Examine MySQL database through browser

More Related