1 / 15

in Large-Scale Cluster

Resource Management. Issues. in Large-Scale Cluster. Yutaka Ishikawa ishikawa@is.s.u-tokyo.ac.jp Computer Science Department/Information Technology Center The University of Tokyo http://www.il.is.s.u-tokyo.ac.jp/ http://www.itc.u-tokyo.ac.jp. Outline. Jittering Memory Affinity

Télécharger la présentation

in Large-Scale Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Resource Management Issues in Large-Scale Cluster Yutaka Ishikawa ishikawa@is.s.u-tokyo.ac.jp Computer Science Department/Information Technology Center The University of Tokyo http://www.il.is.s.u-tokyo.ac.jp/ http://www.itc.u-tokyo.ac.jp The University of Tokyo

  2. Outline • Jittering • Memory Affinity • Power Management • Bottleneck Resource Management The University of Tokyo

  3. Issues • Jittering Problem • The execution of a parallel application is disturbed by system processes in each node independently. This causes the delay of global operations such as allreduce # 0 # 0 # 0 # 1 # 1 # 1 # 2 # 2 # 2 # 3 # 3 # 3 • References: • Terry Jones, William Tuel, Brain Maskell, “Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System,” SC2003. • FabrizioPetrini, Darren J. Kerbyson, Scott Pakin, “The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,1928 Processors of ASCI Q,” SC2003. The University of Tokyo

  4. Jittering Problem • Our Approach • Clusters usually have two types network • Network for Computing • Network for Management • The Management network is used to deliver the global clock • Interval Timer is turned off • Broadcast packet is sent from the global clock generator • Gang scheduling is employed for all system and application processes Global clock generator Network for Management i.e., gigabit ethernet Network for Computing i.e., Myrinet, Infiniband The University of Tokyo

  5. Jittering Problem • Preliminary Experience • The Management network is used to deliver the global clock • The Interval Timer is turned off • Each arrival of the special broadcast packet, the tick counter is updated (The kernel code has been modified) • No cluster daemons, such as batch scheduler nor information daemon, are running, but system daemons are running CPU : AMD Opteron 275 2.2GHz Memory : 2GHz Network : Myri-10G : BCM5721 Gigabit Ethernet # of Host : 16 Kernel : Linux 2.6.18 x86_64 modified MPI : mpich-mx 1.2.6 MX : MX Version: 1.2.0 Daemons: syslog, portmap, sshd, sysstat, netfs, nfslock, autofs, acpid, mx, ypbind, rpcgssd, rpcidmapd, network The University of Tokyo

  6. Preliminary Global Clock Experience NAS Parallel Benchmark MG + No global clock X Global clock Elapsed time (second) 20 times executions are sorted The University of Tokyo

  7. Preliminary Global Clock Experience NAS Parallel Benchmark FT + No global clock X Global clock Elapsed time (second) The University of Tokyo

  8. Preliminary Global Clock Experience NAS Parallel Benchmark CG + No global clock X Global clock Elapsed time (second) The University of Tokyo

  9. What kind of heavy daemonrunning in cluster • Batch Job System • In case of Torque • Every 1 second, the daemon takes 50 microseconds • Every 45 seconds, the daemon takes about 8 milliseconds • Monitoring System • Not yet majored • Simple Formulation Worst Case Overhead N: Number of nodes TIi: Interval time in daemon i TRi: Running time in daemon i Σ = MIN(TIi, TRi x N) TIi In case of 1000 node cluster 0.000050*1000/1 + 0.008*1000/45 = 22.8 % TR t The worst case might never happen ! TI The University of Tokyo

  10. Issues on NUMA • Memory Affinity in NUMA • CPU Memory • Network Memory • An Example of network and memory Node 0 Node 1 Memory Memory Dual Core CPU NFP3600 NIC NIC NFP3600 Dual Core CPU NIC NIC Near Dual Core CPU NFP3050 NIC NIC NFP3050 Dual Core CPU Far NIC NIC Memory Memory The University of Tokyo

  11. M M P C N N C P N N P C N N C P N N M M Memory Location and Communication Note: The result depends on the BIOS settings. • Communication performance depends on data location. • Data is also accessed by CPU. • The location of data should be determined based on both CPU and network location. • Dynamic data migration mechanism is needed ?? The University of Tokyo

  12. Power Management Power Consumption Issue • 100 Tflops cluster machine • 1666 Nodes • If 80 % machine resource utilization (332 nodes are idle) • 66 KW power is wasted in case of idle • 55K$(660 万円)/year • This is under estimation because memory size is small and no network switches are included • 10.6KW power is wasted though the power is turned off!! • 9K$ (110万円)/year Power Consumption in single node ?? • Supermicro AS-2021-M-UR+V • Opteron 2347 x 2 • (Balcerona 1.9 GHz, 60.8 Gflops) • 4 Gbyte Memory • Infiniband HCA x 2 • Fedora Core 7 Digital Ammeter FLUKE105B The University of Tokyo

  13. Power Management • Cooperating with Batch Job system • Idle machines are turned off • When those machines are needed, they are turned on using the IPMI (Intelligent Platform Management Interface) protocol (BMC). • However, still we lose 300 mA for each idle machine • Quick shutdown/restart and synchronization mechanism Turn OFF Idle JOB2 running Turn OFF JOB3 runs Dispatch JOB3 Submit JOB3 JOB2 running Batch Job System JOB1 running Turn OFF In Service Turn ON Turn OFF Turn OFF Turn OFF Idle JOB2 running JOB2 running JOB2 running JOB2 running JOB2 running The University of Tokyo

  14. Bottleneck Resource Management • What are bottleneck resources • A cluster machine has many resources while other resources are limited. • When the cluster accesses such a resource, overloading or congestion happens • Examples • Internet • We have been focusing on bottleneck links in GridMPI 10 GB/sec x N Internet 10 GB/sec • Global File System • From the file system view point, N file operations are independently performed where N is the number of node 10 GB/sec 10 GB/sec x N The University of Tokyo

  15. Summary • We have presented issues on large-scale clusters • Jittering • Memory affinity • Power management • Bottleneck resource management The University of Tokyo

More Related