Parallel Scalable Operating Systems
Parallel Scalable Operating Systems. Presented by Dr. Florin Isaila Universidad Carlos III de Madrid Visiting Scholar Argonne National Lab. Contents. Preliminaries Top500 Scalability Blue Gene History BG/L, BG/C, BG/P, BG/Q Scalable OS for BG/L Scalable file systems
Parallel Scalable Operating Systems
E N D
Presentation Transcript
Parallel Scalable Operating Systems Presented by Dr. Florin IsailaUniversidad Carlos III de MadridVisiting Scholar Argonne National Lab
Contents • Preliminaries • Top500 • Scalability • Blue Gene • History • BG/L, BG/C, BG/P, BG/Q • Scalable OS for BG/L • Scalable file systems • BG/P at Argonne National Lab • Conclusions
Generalities • Since 1993 twice a year: June and November • Ranking of the most powerful computing systems in the world • Ranking criteria: performance of the LINPACK benchmark • Jack Dongarra alma máter • Site web: www.top500.org
HPL: High-Performance Linpack • solves a dense system of linear equations • Variant of LU factorization of matrices of size N • measure of a computer’s floating-point rate of execution • computation done in 64 bit floating point arithmetic • Rpeak : theoretic system performance • upper bound for the real performance (in MFLOP) • Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS • Nmax: obtained by varying N and choosing the maximum performance • Rmax : maximum real performance achieved for Nmax • N1/2: size of problem needed to achieve ½ ofRmax
Amdahl´s law • Suppose a fraction f of your application is not parallelizable • 1-f : parallelizable on p processors Speedup(P) = T1 /Tp <= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p) <= 1/f
Sequential Work Speedup ≤ Max Work on any Processor Load Balance • Work: data access, computation • Not just equal work, but must be busy at same time • Ex: Speedup ≤1000/400 = 2.5
Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost) Communication and synchronization • Communication is expensive! • Measure: communication to computation ratio • Inherent communication • Determined by assignment of tasks to processes • Actual communication may be larger (artifactual) • One principle: Assign tasks that access same data to same process Process 1 Process 2 Process 3 Communication Work Synchronization point Synchronization wait time
Blue Gene partners • IBM • “Blue”: The corporate color of IBM • “Gene”: The intended use of the Blue Gene clusters – Computational biology, specifically, protein folding • Lawrence Livermore National Lab • Department of Energy • Academia
Family • BG/L • BG/C • BG/P • BG/Q
System Blue Gene/L 64 Racks, 64x32x32 Rack 32 Node Cards Node Card 180/360 TF/s 32 TB (32 chips 4x4x2) 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute Card 2 chips, 1x2x1 90/180 GF/s 16 GB Chip 2 processors 5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB
Technical specifications • 64 cabinets which contain 65.536 high-performance compute nodes (chips) • 1.024 I/O nodes. • 32-bit PowerPC processors • 5 networks • The main memory has a size of 33 terabytes. • Maximum performance of 183.5 TFLOPS when using one processor for computation and the other one for communication, and 367 TFLOPS if using both for computation.
Blue Gene / L • Networks: • 3D Torus • Collective Network • Global Barrier/Interrupt • Gigabit Ethernet (I/O & Connectivity) • Control (system boot, debug, monitoring)
Networks • Three dimensional torus • - Compute nodes • Global tree • - collective communication • - I/O • Ethernet • Control network
Three-dimensional (3D) torus network in which the nodes (red balls) are connected to their six nearest-neighbor nodes in a 3D mesh.
Blue Gene / L • Processor: PowerPC 440 700Mhz • Low power allows dense packaging • External Memory: 512MB SDRAM per node / 1GB • Slow embedded core at a clock speed of 700 Mhz • 32 KB L1 cache • L2 is a small prefetch buffer • 4MB Embedded DRAM L3 cache
BG/L compute ASIC • Non-cache coherent L1 • Pre-fetch buffer L2 • Shared 4MB DRAM (L3) • Interface to external DRAM • 5 network interfaces • Torus, collective, global barrier, Ethernet, control
Blue Gene / L • Compute Nodes: • Dual processor, 1024 per Rack • I/O Nodes: • Dual processor, 16-128 per Rack
Blue Gene / L • Compute Nodes: • Proprietary kernel (tailored to processor design) • I/O Nodes: • Embedded Linux • Front-end and service nodes: • Suse SLES 9 Linux (familiarity with users)
Blue Gene / L • Performance: • Peak performance per rack: 5,73 TFlops • Linpack performance per rack: 4,71 TFlops
Blue Gene / C • a.k.a Cyclops64 • massively parallel (first supercomputer on a chip) • Processors with a 96 port, 7 stage non-internally blocking crossbar switch. • Theoretical peak performance (chip): 80 GFlops
Blue Gene / C • Cellular architecture • 64-bit Cyclops64 chip: • 500 Mhz • 80 processors ( each has 2 thread units and a FP unit) • Software • Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.
Blue Gene / C • Picture of BG/C • Performances: • Board: 320 GFlops • Rack: 15,76 Tflops • System: 1,1 PFlops
Blue Gene / P • Similar Architecture to BG/L, but • Cache coherent L1 cache • 4 cores per nodes • 10 Gbit Ethernet external IO infrastructure • Scales upto 3-PFLOPS • More energy efficient • 167TF/s by 2007, 1PF by 2008
Blue Gene / Q • Continuation of Blue Gene/L and /P • Targeting 10PF/s by 2010/2011 • Higher freq at similar performance / watt • Similar number of nodes • Many more cores • More generally useful • Aggressive compiler • New network: Scalable and cheap
Motivationfor a scalable OS • Blue Gene/L is currently the world’s fastest and most scalable supercomputer • Several system components contribute to that scalability. • The Operating Systems for the different nodes of Blue Gene/L are among the components responsible for that scalability. • The OS overhead on one node affects the scalability of the whole system • Goal: design a scalable solution for the OS.
High level view of BG/L • Principle: the structure of the software should reflect the structure of the hardware.
BG/L Partitioning • Space-sharing • Divided along natural boundaries into partitions • Each partition can run only one job • Each node can be in one of this modes • Coprocessor: one processor assists the other • Virtual node: two separate processors, each of them with its own memory space
OS • Compute nodes: dedicated OS • I/O nodes: dedicated OS • Service nodes: conventional off-the-shelf OS • Front-end nodes: program compilation, debug, submit • File servers: store data , no specific for BG/L
BG/L OS solution • Components: I/O, service nodes, CNK • The compute and I/O nodes organized into logical entities called processing sets or psets: 1 I/O node + a collection of CNs • 8, 16, 64, 128 CNs • Logical concept • Should reflect physical proximity => fast communication • Job: collection of N compute processes (on CNs) • Own private address space • Message passing • MPI: ranks 0, N-1
BG/L OS solution:CNK • Compute node: run only compute processes an all the compute nodes of a particular partition can execute in two different modes: • Coprocessor mode • Virtual node mode • Compute Node Kernel (CNK): simple OS • Creates an address spaces • Load code and initialize data • Transfer processor control to the loaded executable
CNK • consumes 1MB • Creates either • One address space of 511/1023MB • 2 address spaces of 255/511MB • No virtual memory, no paging • The entire mapping fits into the TLB of PowerPC • Load in push mode: 1 CN reads the executable from FS and sends to all the others • One image loaded and then stays out of the way!!!
CNK • No OS scheduling (one thread) • No memory management (No TLB overhead) • No local file services • User level execution until: • Process requests a system call • Hardware interrupts: timer (requested by application), abnormal events • Syscall • Simple: handled locally (getting the time, set an alarm) • Complex: forward to I/O nodes • Unsupported (fork/mmap): error
Benefits of the simple solution • Robustness: simple design, implementation, test, debugging • Scalability: no interference among compute nodes • Low system noise • Performance measurements
I/O node • Two roles in Blue Gene/L: • Act as an effective master of its corresponding pset • To offer services request from compute nodes in its pset • Mainly I/O operations on locally mounted FSs • Only one processor used: due to the lack of memory coherency • Executes an embedded version of the Linux operating system: • Does not use any swap space • it has an in-memory root file system • it uses little memory • lacks the majority of LINUX daemons.
I/O node • Complete TCP/IP stack • Supported FS: NFS, GPFS, Lustre, PVFS • Main process: Control and I/O daemon (CIOD) • Launch a job • Job manager sends the request to the service node • Service node contacts the CIOD • CIOD sends the executable to all processes in pset
Service nodes • run the Blue Gene/L control system. • Tight integration with CNs and IONs • CN and IONs: stateless, no persistent memory • Responsible for operation and monitoring the CNs and I/ONs • Creates system partitions and isolates it • Computes network routing for torus, collective and global interrupt networks • loads OS code for CNs and I/ONs
Problems • Not fully POSIX compliant • Many applications need • Process/thread creation • Full server sockets • Shared memory segments • Memory mapped files
File systems for BG systems • Need for scalable file systems: NFS is not a solution • Most supercomputers and clusters in top 500 use one of these parallel file systems • GPFS • Lustre • PVFS2
GPFS/PVFS/Lustre mounted on the I/O nodes File system servers