Parallel Scalable Operating Systems

Parallel Scalable Operating Systems Presented by Dr. Florin IsailaUniversidad Carlos III de MadridVisiting Scholar Argonne National Lab

Contents • Preliminaries • Top500 • Scalability • Blue Gene • History • BG/L, BG/C, BG/P, BG/Q • Scalable OS for BG/L • Scalable file systems • BG/P at Argonne National Lab • Conclusions

Top500

Generalities • Since 1993 twice a year: June and November • Ranking of the most powerful computing systems in the world • Ranking criteria: performance of the LINPACK benchmark • Jack Dongarra alma máter • Site web: www.top500.org

HPL: High-Performance Linpack • solves a dense system of linear equations • Variant of LU factorization of matrices of size N • measure of a computer’s floating-point rate of execution • computation done in 64 bit floating point arithmetic • Rpeak : theoretic system performance • upper bound for the real performance (in MFLOP) • Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS • Nmax: obtained by varying N and choosing the maximum performance • Rmax : maximum real performance achieved for Nmax • N1/2: size of problem needed to achieve ½ ofRmax

Jack Dongarra´s slide

Amdahl´s law • Suppose a fraction f of your application is not parallelizable • 1-f : parallelizable on p processors Speedup(P) = T1 /Tp <= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p) <= 1/f

Amdahl’s Law (for 1024 processors)

Sequential Work Speedup ≤ Max Work on any Processor Load Balance • Work: data access, computation • Not just equal work, but must be busy at same time • Ex: Speedup ≤1000/400 = 2.5

Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost) Communication and synchronization • Communication is expensive! • Measure: communication to computation ratio • Inherent communication • Determined by assignment of tasks to processes • Actual communication may be larger (artifactual) • One principle: Assign tasks that access same data to same process Process 1 Process 2 Process 3 Communication Work Synchronization point Synchronization wait time

Blue Gene

Blue Gene partners • IBM • “Blue”: The corporate color of IBM • “Gene”: The intended use of the Blue Gene clusters – Computational biology, specifically, protein folding • Lawrence Livermore National Lab • Department of Energy • Academia

BG History

Family • BG/L • BG/C • BG/P • BG/Q

System Blue Gene/L 64 Racks, 64x32x32 Rack 32 Node Cards Node Card 180/360 TF/s 32 TB (32 chips 4x4x2) 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute Card 2 chips, 1x2x1 90/180 GF/s 16 GB Chip 2 processors 5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB

Technical specifications • 64 cabinets which contain 65.536 high-performance compute nodes (chips) • 1.024 I/O nodes. • 32-bit PowerPC processors • 5 networks • The main memory has a size of 33 terabytes. • Maximum performance of 183.5 TFLOPS when using one processor for computation and the other one for communication, and 367 TFLOPS if using both for computation.

Blue Gene / L • Networks: • 3D Torus • Collective Network • Global Barrier/Interrupt • Gigabit Ethernet (I/O & Connectivity) • Control (system boot, debug, monitoring)

Networks • Three dimensional torus • - Compute nodes • Global tree • - collective communication • - I/O • Ethernet • Control network

Three-dimensional (3D) torus network in which the nodes (red balls) are connected to their six nearest-neighbor nodes in a 3D mesh.

Blue Gene / L • Processor: PowerPC 440 700Mhz • Low power allows dense packaging • External Memory: 512MB SDRAM per node / 1GB • Slow embedded core at a clock speed of 700 Mhz • 32 KB L1 cache • L2 is a small prefetch buffer • 4MB Embedded DRAM L3 cache

PowerPC 440 core

BG/L compute ASIC • Non-cache coherent L1 • Pre-fetch buffer L2 • Shared 4MB DRAM (L3) • Interface to external DRAM • 5 network interfaces • Torus, collective, global barrier, Ethernet, control

Block diagram

Blue Gene / L • Compute Nodes: • Dual processor, 1024 per Rack • I/O Nodes: • Dual processor, 16-128 per Rack

Blue Gene / L • Compute Nodes: • Proprietary kernel (tailored to processor design) • I/O Nodes: • Embedded Linux • Front-end and service nodes: • Suse SLES 9 Linux (familiarity with users)

Blue Gene / L • Performance: • Peak performance per rack: 5,73 TFlops • Linpack performance per rack: 4,71 TFlops

Blue Gene / C • a.k.a Cyclops64 • massively parallel (first supercomputer on a chip) • Processors with a 96 port, 7 stage non-internally blocking crossbar switch. • Theoretical peak performance (chip): 80 GFlops

Blue Gene / C • Cellular architecture • 64-bit Cyclops64 chip: • 500 Mhz • 80 processors ( each has 2 thread units and a FP unit) • Software • Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.

Blue Gene / C • Picture of BG/C • Performances: • Board: 320 GFlops • Rack: 15,76 Tflops • System: 1,1 PFlops

Blue Gene / P • Similar Architecture to BG/L, but • Cache coherent L1 cache • 4 cores per nodes • 10 Gbit Ethernet external IO infrastructure • Scales upto 3-PFLOPS • More energy efficient • 167TF/s by 2007, 1PF by 2008

Blue Gene / Q • Continuation of Blue Gene/L and /P • Targeting 10PF/s by 2010/2011 • Higher freq at similar performance / watt • Similar number of nodes • Many more cores • More generally useful • Aggressive compiler • New network: Scalable and cheap

Motivationfor a scalable OS • Blue Gene/L is currently the world’s fastest and most scalable supercomputer • Several system components contribute to that scalability. • The Operating Systems for the different nodes of Blue Gene/L are among the components responsible for that scalability. • The OS overhead on one node affects the scalability of the whole system • Goal: design a scalable solution for the OS.

High level view of BG/L • Principle: the structure of the software should reflect the structure of the hardware.

BG/L Partitioning • Space-sharing • Divided along natural boundaries into partitions • Each partition can run only one job • Each node can be in one of this modes • Coprocessor: one processor assists the other • Virtual node: two separate processors, each of them with its own memory space

OS • Compute nodes: dedicated OS • I/O nodes: dedicated OS • Service nodes: conventional off-the-shelf OS • Front-end nodes: program compilation, debug, submit • File servers: store data , no specific for BG/L

BG/L OS solution • Components: I/O, service nodes, CNK • The compute and I/O nodes organized into logical entities called processing sets or psets: 1 I/O node + a collection of CNs • 8, 16, 64, 128 CNs • Logical concept • Should reflect physical proximity => fast communication • Job: collection of N compute processes (on CNs) • Own private address space • Message passing • MPI: ranks 0, N-1

High level view of BG/L

BG/L OS solution:CNK • Compute node: run only compute processes an all the compute nodes of a particular partition can execute in two different modes: • Coprocessor mode • Virtual node mode • Compute Node Kernel (CNK): simple OS • Creates an address spaces • Load code and initialize data • Transfer processor control to the loaded executable

CNK • consumes 1MB • Creates either • One address space of 511/1023MB • 2 address spaces of 255/511MB • No virtual memory, no paging • The entire mapping fits into the TLB of PowerPC • Load in push mode: 1 CN reads the executable from FS and sends to all the others • One image loaded and then stays out of the way!!!

CNK • No OS scheduling (one thread) • No memory management (No TLB overhead) • No local file services • User level execution until: • Process requests a system call • Hardware interrupts: timer (requested by application), abnormal events • Syscall • Simple: handled locally (getting the time, set an alarm) • Complex: forward to I/O nodes • Unsupported (fork/mmap): error

Benefits of the simple solution • Robustness: simple design, implementation, test, debugging • Scalability: no interference among compute nodes • Low system noise • Performance measurements

I/O node • Two roles in Blue Gene/L: • Act as an effective master of its corresponding pset • To offer services request from compute nodes in its pset • Mainly I/O operations on locally mounted FSs • Only one processor used: due to the lack of memory coherency • Executes an embedded version of the Linux operating system: • Does not use any swap space • it has an in-memory root file system • it uses little memory • lacks the majority of LINUX daemons.

I/O node • Complete TCP/IP stack • Supported FS: NFS, GPFS, Lustre, PVFS • Main process: Control and I/O daemon (CIOD) • Launch a job • Job manager sends the request to the service node • Service node contacts the CIOD • CIOD sends the executable to all processes in pset

System calls

Service nodes • run the Blue Gene/L control system. • Tight integration with CNs and IONs • CN and IONs: stateless, no persistent memory • Responsible for operation and monitoring the CNs and I/ONs • Creates system partitions and isolates it • Computes network routing for torus, collective and global interrupt networks • loads OS code for CNs and I/ONs

Problems • Not fully POSIX compliant • Many applications need • Process/thread creation • Full server sockets • Shared memory segments • Memory mapped files

File systems for BG systems • Need for scalable file systems: NFS is not a solution • Most supercomputers and clusters in top 500 use one of these parallel file systems • GPFS • Lustre • PVFS2

GPFS/PVFS/Lustre mounted on the I/O nodes File system servers

Parallel Scalable Operating Systems

Parallel Scalable Operating Systems

Presentation Transcript

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Parallel Systems

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems

Scalable Parallel ComputIng

Parallel Systems

Parallel Systems

PARSYMONY: Scalable Parallel Data Mining

Parallel Systems

Scalable Parallel Computing on Clouds

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Performance Technology for Scalable Parallel Systems

Designing Parallel Operating Systems via Parallel Programming

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Parallax - A New Operating System for Scalable, Distributed, and Parallel Computing

Scalable Parallel Intrusion Detection

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Designing Parallel Operating Systems using Modern Interconnects

Designing Parallel Operating Systems using Modern Interconnects

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Designing Parallel Operating Systems using Modern Interconnects