Computer Architecture II

Computer Architecture II

Contents • Preliminaries • Top500 • Scalability • Blue Gene • History • BG/L, BG/C, BG/P, BG/Q • Scalable OS for BG/L • Scalable file systems • BG/P at Argonne National Lab • Conclusions

Top500

Generalities • Since 1993 twice a year: June and November • Ranking of the most powerful computing systems in the world • Ranking criteria: performance of the LINPACK benchmark • Jack Dongarra alma máter • Site web: www.top500.org

HPL: High-Performance Linpack • solves a dense system of linear equations • Variant of LU factorization of matrices of size N • measure of a computer’s floating-point rate of execution • computation done in 64 bit floating point arithmetic • Rpeak : theoretic system performance • upper bound for the real performance (in MFLOP) • Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS • Nmax: obtained by varying N and choosing the maximum performance • Rmax : maximum real performance achieved for Nmax • N1/2: size of problem needed to achieve ½ ofRmax

Jack Dongarra´s slide

Amdahl´s law • Suppose a fraction f of your application is not parallelizable • 1-f : parallelizable on p processors Speedup(P) = T1 /Tp <= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p) <= 1/f

Amdahl’s Law (for 1024 processors)

Sequential Work Speedup ≤ Max Work on any Processor Load Balance • Work: data access, computation • Not just equal work, but must be busy at same time • Ex: Speedup ≤1000/400 = 2.5

Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost) Communication and synchronization • Communication is expensive! • Measure: communication to computation ratio • Inherent communication • Determined by assignment of tasks to processes • Actual communication may be larger (artifactual) • One principle: Assign tasks that access same data to same process Process 1 Process 2 Process 3 Communication Work Synchronization point Synchronization wait time

Blue Gene

Blue Gene partners • IBM • “Blue”: The corporate color of IBM • “Gene”: The intended use of the Blue Gene clusters – Computational biology, specifically, protein folding • Lawrence Livermore National Lab • Department of Energy • Academia

BG History

Family • BG/L • BG/C • BG/P • BG/Q

System Blue Gene/L 64 Racks, 64x32x32 Rack 32 Node Cards Node Card 180/360 TF/s 32 TB (32 chips 4x4x2) 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute Card 2 chips, 1x2x1 90/180 GF/s 16 GB Chip 2 processors 5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB

Technical specifications • 64 cabinets which contain 65.536 high-performance compute nodes (chips) • 1.024 I/O nodes. • 32-bit PowerPC processors • 5 networks • The main memory has a size of 33 terabytes. • Maximum performance of 183.5 TFLOPS when using one processor for computation and the other one for communication, and 367 TFLOPS if using both for computation.

Blue Gene / L • Networks: • 3D Torus • Collective Network • Global Barrier/Interrupt • Gigabit Ethernet (I/O & Connectivity) • Control (system boot, debug, monitoring)

Networks • Three dimensional torus • - Compute nodes • Global tree • - collective communication • - I/O • Ethernet • Control network

Three-dimensional (3D) torus network in which the nodes (red balls) are connected to their six nearest-neighbor nodes in a 3D mesh.

Blue Gene / L • Processor: PowerPC 440 700Mhz • Low power allows dense packaging • External Memory: 512MB SDRAM per node / 1GB • Slow embedded core at a clock speed of 700 Mhz • 32 KB L1 cache • L2 is a small prefetch buffer • 4MB Embedded DRAM L3 cache

PowerPC 440 core

BG/L compute ASIC • Non-cache coherent L1 • Pre-fetch buffer L2 • Shared 4MB DRAM (L3) • Interface to external DRAM • 5 network interfaces • Torus, collective, global barrier, Ethernet, control

Block diagram

Blue Gene / L • Compute Nodes: • Dual processor, 1024 per Rack • I/O Nodes: • Dual processor, 16-128 per Rack

Blue Gene / L • Compute Nodes: • Proprietary kernel (tailored to processor design) • I/O Nodes: • Embedded Linux • Front-end and service nodes: • Suse SLES 9 Linux (familiarity with users)

Blue Gene / L • Performance: • Peak performance per rack: 5,73 TFlops • Linpack performance per rack: 4,71 TFlops

Blue Gene / C • a.k.a Cyclops64 • massively parallel (first supercomputer on a chip) • Processors with a 96 port, 7 stage non-internally blocking crossbar switch. • Theoretical peak performance (chip): 80 GFlops

Blue Gene / C • Cellular architecture • 64-bit Cyclops64 chip: • 500 Mhz • 80 processors ( each has 2 thread units and a FP unit) • Software • Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.

Blue Gene / C • Picture of BG/C • Performances: • Board: 320 GFlops • Rack: 15,76 Tflops • System: 1,1 PFlops

Blue Gene / P • Similar Architecture to BG/L, but • Cache coherent L1 cache • 4 cores per nodes • 10 Gbit Ethernet external IO infrastructure • Scales upto 3-PFLOPS • More energy efficient • 167TF/s by 2007, 1PF by 2008

Blue Gene / Q • Continuation of Blue Gene/L and /P • Targeting 10PF/s by 2010/2011 • Higher freq at similar performance / watt • Similar number of nodes • Many more cores • More generally useful • Aggressive compiler • New network: Scalable and cheap

Motivationfor a scalable OS • Blue Gene/L is currently the world’s fastest and most scalable supercomputer • Several system components contribute to that scalability. • The Operating Systems for the different nodes of Blue Gene/L are among the components responsible for that scalability. • The OS overhead on one node affects the scalability of the whole system • Goal: design a scalable solution for the OS.

High level view of BG/L • Principle: the structure of the software should reflect the structure of the hardware.

BG/L Partitioning • Space-sharing • Divided along natural boundaries into partitions • Each partition can run only one job • Each node can be in one of this modes • Coprocessor: one processor assists the other • Virtual node: two separate processors, each of them with its own memory space

OS • Compute nodes: dedicated OS • I/O nodes: dedicated OS • Service nodes: conventional off-the-shelf OS • Front-end nodes: program compilation, debug, submit • File servers: store data , no specific for BG/L

BG/L OS solution • Components: I/O, service nodes, CNK • The compute and I/O nodes organized into logical entities called processing sets or psets: 1 I/O node + a collection of CNs • 8, 16, 64, 128 CNs • Logical concept • Should reflect physical proximity => fast communication • Job: collection of N compute processes (on CNs) • Own private address space • Message passing • MPI: ranks 0, N-1

High level view of BG/L

BG/L OS solution:CNK • Compute node: run only compute processes an all the compute nodes of a particular partition can execute in two different modes: • Coprocessor mode • Virtual node mode • Compute Node Kernel (CNK): simple OS • Creates an address spaces • Load code and initialize data • Transfer processor control to the loaded executable

CNK • consumes 1MB • Creates either • One address space of 511/1023MB • 2 address spaces of 255/511MB • No virtual memory, no paging • The entire mapping fits into the TLB of PowerPC • Load in push mode: 1 CN reads the executable from FS and sends to all the others • One image loaded and then stays out of the way!!!

CNK • No OS scheduling (one thread) • No memory management (No TLB overhead) • No local file services • User level execution until: • Process requests a system call • Hardware interrupts: timer (requested by application), abnormal events • Syscall • Simple: handled locally (getting the time, set an alarm) • Complex: forward to I/O nodes • Unsupported (fork/mmap): error

Benefits of the simple solution • Robustness: simple design, implementation, test, debugging • Scalability: no interference among compute nodes • Low system noise • Performance measurements

I/O node • Two roles in Blue Gene/L: • Act as an effective master of its corresponding pset • To offer services request from compute nodes in its pset • Mainly I/O operations on locally mounted FSs • Only one processor used: due to the lack of memory coherency • Executes an embedded version of the Linux operating system: • Does not use any swap space • it has an in-memory root file system • it uses little memory • lacks the majority of LINUX daemons.

I/O node • Complete TCP/IP stack • Supported FS: NFS, GPFS, Lustre, PVFS • Main process: Control and I/O daemon (CIOD) • Launch a job • Job manager sends the request to the service node • Service node contacts the CIOD • CIOD sends the executable to all processes in pset

System calls

Service nodes • run the Blue Gene/L control system. • Tight integration with CNs and IONs • CN and IONs: stateless, no persistent memory • Responsible for operation and monitoring the CNs and I/ONs • Creates system partitions and isolates it • Computes network routing for torus, collective and global interrupt networks • loads OS code for CNs and I/ONs

Problems • Not fully POSIX compliant • Many applications need • Process/thread creation • Full server sockets • Shared memory segments • Memory mapped files

File systems for BG systems • Need for scalable file systems: NFS is not a solution • Most supercomputers and clusters in top 500 use one of these parallel file systems • GPFS • Lustre • PVFS2

GPFS/PVFS/Lustre mounted on the I/O nodes File system servers

Computer Architecture II

Computer Architecture II

Presentation Transcript

Csci 136 Computer Architecture II – MIPS Instruction Set Architecture

ECE/CS 757: Advanced Computer Architecture II

Computer architecture II

Computer architecture II

Computer architecture II

Computer architecture II

Computer architecture II

Computer architecture II

Csci 136 Computer Architecture II –Cache Memory

Computer architecture II

Computer architecture II

CMPE 325 Computer Architecture II

ACOE301: Computer Architecture II Labs

CS232: Computer Architecture II

ACOE301 – Computer Architecture II

Computer architecture II

Computer Organization and Architecture II

CS232: Computer Architecture II

CS 232: Computer Architecture II

Computer Architecture: Multithreading (II)

Chapter 6 Computer Architecture (part II)