FAST-OS: Petascale Single System Image

FAST-OS: Petascale Single System Image Jeffrey S. Vetter (PI) Nikhil Bhatia, Collin McCurdy, Phil Roth, Weikuan Yu Future Technologies Group Computer Science and Mathematics Division

PetaScale single system image • What? • Make a collection of standard hardware look like one big machine, in as many ways as feasible • Why? • Provide a single solution to address all forms of clustering • Simultaneously address availability, scalability, manageability, and usability • Components • Virtual cluster using contemporary hypervisor technology • Performance and scalability of parallel shared root file system • Paging behaviors of applications, i.e., reducing TLB misses • Reducing “noise” from operating systems at scale

Virtual clusters using hypervisors • Physical nodes run Hypervisor in the privileged mode. • Virtual machines run on top of the hypervisor. • Virtual machines are compute nodes hosting user-level distributed memory (e.g., MPI) tasks. • Virtualization enables dynamic cluster management via live migration of compute nodes. • Capabilities for dynamic load balancing and cluster management ensuring high availability. Task 0 Task 1 Task 3 Task 4 Task 5 Task 7 Task 60 Task 61 Task 63 VM VM VM VM VM VM VM VM VM Hypervisor (Xen) Hypervisor (Xen) Hypervisor (Xen) Physical Node 0 Physical Node 1 Physical Node 15

Virtual cluster management • Single system view of the cluster • Cluster performance diagnosis • Dynamic node addition and removal ensuring high availability • Dynamic load balancing across the entire cluster • Innovative load balancing schemes based on distributed algorithms Virtual Application Virtual Machine Xen Xen Intel x86 Intel x86

Virunga Daemon Compute Node Virunga Daemon Compute Node Virunga Daemon Compute Node Virunga—virtual cluster manager Virunga Client Performance Data Parser Performance Data Presenter System Administration Manager Monitoring station (Login Noded)

Paging behavior of DOE applications Goal: Understand behavior of applications in large-scale systems composed of commodity processors • With performance counters and simulation: • Show that the HPCC benchmarks, meant to characterize the memory behavior of HPC applications, do not exhibit the same behavior in the presence of paging hardware as scientific applications of interest to the Office of Science • Offer insight into why that is the case • Use memory system simulation to determine whether large page performance will improve with the next generation of Opteron processors

1.0 1.0 0.1 0.1 0.01 0.01 Large Small Large Small Large Small Large Small Large Small Large Small 0.001 0.001 0.0001 0.0001 1.0 1.0 0.1 0.1 0.01 0.01 0.001 0.001 0.0001 0.0001 Experimental results (from performance counters) HPCC Benchmarks • Performance: trends for large vs small nearly opposite • TLBs: significantly different miss rates 1.0 Tb_misses Cycle L2tlb_hits 0 FFT FFT FFT HPL HPL HPL MPIFFT MPIFFT MPIFFT DGEMM DGEMM DGEMM PTRANS PTRANS PTRANS STREAM STREAM STREAM RANDOM RANDOM RANDOM MPIRANDOM MPIRANDOM MPIRANDOM Applications 1.0 Tb_misses Cycle L2tlb_hits 0 LMP GTC LMP GTC POP POP CAM CAM LMP POP GTC CAM GYRO GYRO GYRO AMBER AMBER HYCOM HYCOM AMBER HYCOM

1 1 1 1 2 2 2 2 4 4 4 4 8 8 8 8 16 16 16 16 32 32 32 32 64 64 64 64 128 128 128 128 256 256 256 256 512 512 512 512 1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 60 50 40 30 20 10 0 40 30 20 10 0 50 40 30 20 10 0 60 50 40 30 20 10 0 CAM HYCOM GTC PCP Large Small Large Small Large Small Large Small pctg pctg pctg pctg 2 2 2 8 1 1 1 2 1 8 8 8 4 4 4 4 16 16 32 32 32 16 16 32 128 512 128 128 256 256 256 256 512 512 128 512 64 64 64 64 1024 1024 1024 1024 Reuse distances (simulated) HPCC Benchmarks 70 60 50 40 30 20 10 0 50 40 30 20 10 0 100 90 80 70 60 50 40 30 20 10 0 60 50 40 30 20 10 0 HPL STREAM RANDOM PTRANS Patterns are clearly significantly different… Large Small Large Small Large Small Large Small pctg pctg pctg pctg Applications

Paging conclusions • HPCC benchmarks are not representative of paging behavior of typical DOE applications. • Applications access many more arrays concurrently. • Simulation results (not shown) indicate the following: • New paging hardware in next-generation Opteron processors will improve large page performance. • Performance near that with paging turned off. • However, simulations also indicate that the mere presence of a TLB is likely degrading performance, whether paging hardware is on or off. • More research into implications of paging, and of commodity processors in general, on performance of scientific applications is required.

Parallel root file system • Goals of study • Use a parallel file system for implementation of shared root environment • Evaluate performance of parallel root file system • Evaluate the benefits of high-speed interconnects • Understand root I/O access pattern and potential scaling limits • Current status • RootFS implemented using NFS, PVFS-2, Lustre, and GFS • RootFS distributed using ramdisk via etherboot • Modified mkinitrd program locally • Modified to init scripts to mount root at boot time • Evaluation with parallel benchmarks (IOR, b_eff_io, NPB I/O) • Evaluation with emulated loads of I/O accesses for RootFS • Evaluation of high-speed interconnects for Lustre-based RootFS

1 PE 4 PEs 9 PEs 16 PEs 25 PEs NFS Write 256MB Block PVFS2 Write 256MB Block Lustre Write 256MB Block 100 50 50 90 45 45 80 40 40 70 35 35 60 30 30 BW (MiB/s) B/W (MiB/s) B/W (MiB/s) 50 25 25 40 20 20 30 15 15 20 10 10 10 5 5 0 0 0 8 16 32 64 128 256 8 16 32 64 128 256 8 16 32 64 128 256 xfer (KiB) xfer (KiB) xfer (KiB) NFS Read 256M Block PVFS2 Read 256MB Block Lustre Read 256MB Block 120 250 800 700 100 200 600 80 500 150 B/W (MiB/s) B/W (MiB/s) B/W (MiB/s) 60 400 100 300 40 200 50 20 100 0 0 0 8 16 32 64 128 256 8 16 32 64 128 256 8 16 32 64 128 256 xfer (KiB) xfer (KiB) xfer (KiB) Example results: Parallel benchmarksIOR read/write throughput

9 7 Execution Time (sec) 5 3 1 2 4 6 8 10 12 14 16 18 20 22 Number of Processors 1600 35 2000 35 30 30 1200 1500 25 25 20 Time (sec) 20 CPU Utilization (%) CPU Utilization (%) Time (sec) 800 1000 15 15 10 10 400 500 5 5 0 0 0 0 2 4 8 12 16 20 2 4 8 12 16 20 Number of Nodes Number of Nodes 1400 30 600 25 1200 25 500 20 1000 20 400 Time (sec) 15 800 Time (sec) 15 CPU Utilization (%) 300 CPU Utilizatoin (%) 600 10 10 200 400 5 5 100 200 0 0 0 0 2 4 8 12 16 20 2 4 8 12 16 20 Number of Nodes Number of Nodes Performance with synthetic I/O accesses Startup with different images 16K 16M 1G Time tar jcf linux-2.6.17.13 Time tar jxf linux-2.6.17.13.tar.bz2 Lustre-TCP-Time Lustre-TCP-CPU Lustre-IB-Time Lustre-IB-CPU Time cvs co mpich2 Time diff mpich2 Lustre-TCP-Time Lustre-TCP-CPU Lustre-IB-Time Lustre-IB-CPU

Contacts • Jeffrey S. Vetter • Principle Investigator • Future Technologies Group Computer Science and Mathematics Division (865) 356-1649 • vetter@ornl.gov • Nikhil Bhatia Phil Roth • (865) 241-1535 (865) 241-1543 • bhatia@ornl.gov rothpc@ornl.gov • Collin McCurdy Weikuan Yu • (865) 241-6433 (865) 574-7990 • cmccurdy@ornl.gov wyu@ornl.gov 13 Vetter_petassi_SC07

FAST-OS: Petascale Single System Image