1 / 81

Introduction to Linux Clusters

Introduction to Linux Clusters. Clare Din SAS Computing University of Pennsylvania March 15, 2004. Hardware Nodes Disk array Networking gear Backup device Admin front end UPS Rack units. Software Operating system MPI Compilers Scheduler. Cluster Components. Hardware Nodes

demetrius
Télécharger la présentation

Introduction to Linux Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction toLinux Clusters Clare Din SAS Computing University of Pennsylvania March 15, 2004

  2. Hardware Nodes Disk array Networking gear Backup device Admin front end UPS Rack units Software Operating system MPI Compilers Scheduler Cluster Components

  3. Hardware Nodes Compute nodes Admin node I/O node Login node Disk array Networking gear Backup device Admin front end Software Operating system Compilers Scheduler MPI Cluster Components

  4. Hardware Disk array RAID5 SCSI 320 10k+ RPM, TB+ capacity NFS-mounted from I/O node Networking gear Backup device Admin front end Software Operating system Compilers Scheduler MPI Cluster Components

  5. Hardware Networking gear Myrinet, gigE, 10/100 Switches Cables Networking cards Backup device Admin front end UPS Rack units Software Operating system Compilers Scheduler MPI Cluster Components

  6. Hardware Backup device AIT3, DLT, LTO N-slot cartridge drive SAN Admin front end UPS Rack units Software Operating system Compilers Scheduler MPI Cluster Components

  7. Hardware Admin front end Console (keyboard, monitor, mouse) KVM switches KVM cables UPS Rack units Cluster Components

  8. Hardware UPS APC SmartUPS 3000 3 per 42U rack Rack units Software Operating system Compilers Scheduler MPI Cluster Components

  9. Hardware Rack units 42U, standard or deep Software Operating system Compilers Scheduler MPI Cluster Components

  10. Software Operating system Red Hat 9+ Linux Debian Linux SUSE Linux Mandrake Linux FreeBSD and others MPI Compilers Scheduler Cluster Components

  11. Software MPI MPICH LAM/MPI MPI-GM MPI Pro Compilers Scheduler Cluster Components

  12. Software Compilers gnu Portland Group Intel Scheduler Cluster Components

  13. Software Scheduler OpenPBS PBS Pro Maui Cluster Components

  14. Journalled filesystem Reboots happen more quickly after a crash Slight performance hit for this feature ext3 is a popular choice (old ext2 was not journalled) Filesystem Requirements

  15. Space Standard 42U rack is about 24”W x 80”H x 40”D Blade units give you more than 1 node per 1U space in a deeper rack Cable management inside the rack Consider overhead or raised floor cabling for the external cables Power 67 node Xeon cluster consumes 19,872W = 5.65 tons of A/C to keep it cool Ideally, each UPS plug should connect to its own circuit Clusters (especially blades) run real hot; make sure there is adequate A/C and ventilation Space and Power Requirements

  16. External Network One 10mbps network line is adequate (all computation and message passing is within the cluster) Internal Network gigE Myrinet Some combo Base your net gear selection on whether most of your jobs are CPU-bound or I/O bound Network Requirements

  17. Network Choices Compared • Fast Ethernet (100BT) • 0.1 Gb/s (or 100 Mb/s) bandwidth • Essentially free • gigE • 0.4 Gb/s to 0.64 Gb/s bandwidth • ~$400 per node • Myrinet • 1.2 Gb/s to 2.0 Gb/s bandwidth • ~$1000 per node • Scales to thousands of nodes • Buy fiber instead of copper cables

  18. I/O Node • Globally accessible filesystem (RAID5 disk array) • Backup device

  19. I/O Node • Globally accessible filesystem (RAID5 disk array) • NFS share it • Put user home directories, apps, and scratch space directories on it so all compute nodes can access them • Enforce quotas on home directories • Backup device

  20. I/O Node • Globally accessible filesystem (RAID5 disk array) • Backup device • Make sure your device and software is compatible with your operating system • Plan a good backup strategy • Test the ETA of bringing back a single file or a filesystem from backups

  21. Admin Node • Only sysadmins log into this node • Runs cluster management software

  22. Admin Node • Only sysadmins log into this node • Accessible only from within the cluster • Runs cluster management software

  23. Admin Node • Only admins log into this node • Runs cluster management software • User and quota management • Node management • Rebuild dead nodes • Monitor CPU utilization and network traffic

  24. Compute Nodes • Buy the fastest CPUs and bus speed you can afford. • Memory size of each node depends on the application mix. • Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node.

  25. Compute Nodes • Buy the fastest CPUs and bus speed you can afford. • Don’t forget that some software companies license their software per node, so factor in software costs • Stick with a proven technology over future promise • Memory size of each node depends on the application mix.

  26. Compute Nodes • Buy the fastest CPUs and bus speed you can afford. • Memory size of each node depends on the application mix. • 2 GB + for for large calculations • < 2 GB for financial databases • Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node.

  27. Compute Nodes • Buy the fastest CPUs and bus speed you can afford. • Memory size of each node depends on the application mix. • Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node. • Disks are cheap nowadays... 40GB EIDE is standard per node

  28. Compute Nodes • Choose a CPU architecture you’re comfortable with • Intel: P4, Xeon, Itanium • AMD: Opteron, Athlon • Other: G4/G5 • Consider that some algorithms require 2n nodes • 32-bit Linux is free or close-to-free, 64-bit Red Hat Linux costs $1600 per node

  29. Login Node • Users login here • Only way to get into the cluster • Compile code • Job control

  30. Login Node • Users login here • ssh or ssh -X • Cluster designers recommend 1 login node per 64 compute nodes • Update /etc/profile.d so all users get the same environment when they log in • Only way to get into the cluster • Compile code • Job control

  31. Login Node • Users login here • Only way to get into the cluster • Static IP address (vs. DHCP addresses on all other cluster nodes) • Turn on built-in firewall software • Compile code • Job control

  32. Login Node • Users login here • Only way to get into the cluster • Compile code • Licenses should be purchased for this node only • Don’t pay for more than you need • 2 licenses might be sufficient for code compilation for a department • Job control

  33. Login Node • Users login here • Only way to get into the cluster • Compile code • Job control (using a scheduler) • Choice of queues to access subset of resources • Submit, delete, terminate jobs • Check on job status

  34. Spare Nodes • Offline nodes that are put into service when an existing node dies • Use for spare parts • Use for testing environment

  35. Cluster Install Software • Designed to make cluster installation easier (“cluster in a box” concept) • Decreases ETA of the install process using automated steps • Decreases chance of user error • Choices:

  36. Cluster Management Software • Run parallel commands via GUI • Or write Perl scripts for command-line control • Install new nodes, rebuild corrupted nodes • Check on status of hardware (nodes, network connections) • Ganglia • xpbsmon • Myrinet tests (gm_board_info)

  37. Cluster Management Software • xpbsmon - shows jobs running that were submitted via the scheduler

  38. Cluster Consistency • Rsync or rdist /etc/password, shadow, gshadow, and group files from login node to compute nodes • Also consider (auto or manually) rsync’ing /etc/profile.d files, pbs config files, /etc/fstab, etc.

  39. Local and Remote Management • Local management • GUI desktop from console monitor • KVM switches to access each node • Remote management • Console switch • ssh in and see what’s on the console monitor screen from your remote desktop • Web-based tools • Ganglia ganglia.sourceforge.net • Netsaint www.netsaint.org • Big Brother www.bb4.com

  40. Ganglia • Tool for monitoring clusters of up to 2000 nodes • Used on over 500 clusters worldwide • For multiple OS’s and CPU architectures # ssh -X coffee.chem.upenn.edu # ssh coffeeadmin # mozilla & Open http://coffeeadmin/ganglia Periodically auto-refreshes web page

  41. Ganglia

  42. Ganglia

  43. Ganglia over 100% utilization (>2.00 load) 25-49% (0.00-1.00) 75-100% (1.50-2.00) 50-74% (1.00-1.99) 0-24% (~0.00)

  44. Scheduling Software (PBS) • Set up queues for different groups of users based on resource needs (i.e. not everyone needs Myrinet; some users only need 1 node) • The world does not end if one node goes down; the scheduler will run the job on another node • Make sure pbs_server and pbs_sched is running on login node • Make sure pbs_mom is running on all compute nodes, but not on login, admin, or I/O nodes

  45. Scheduling Software • OpenPBS • PBS Pro • Others

  46. Scheduling Software • OpenPBS • Limit users by number of jobs • Good support via messageboards • *** FREE *** • PBS Pro • Others

  47. Scheduling Software • OpenPBS • PBS Pro • The “pro” version of OpenPBS • Limit by nodes, not just jobs per user • Must pay for support ($25 per CPU, or $3200 for a 128 CPU cluster) • Others

  48. Scheduling Software • OpenPBS • PBS Pro • Others • Load Share Facility • Codeine • Maui

  49. MPI Software • MPICH (Argonne National Labs) • LAM/MPI (OSC/Univ. of Notre Dame) • MPI-GM (Myricom) • MPI Pro (MSTi Software) • Programmed by one of the original developers of MPICH • Claims to be 20% faster than MPICH • Costs $1200 plus support per year

  50. Compilers and Libraries • Compilers • gcc/g77 www.gnu.org/software • Portland Group www.pgroup.com • Intel www.developer.intel.com • Libraries • BLAS • ATLAS - portable BLAS www.math-atlas.sourceforge.net • LAPACK • SCALAPACK - MPI-based LAPACK • FFTW - Fast Fourier Transform www.fftw.org • many, many more

More Related