SR8000 Concept

SR8000 Concept Tim Lanfear Hitachi Europe GmbH. t-lanfear@hpcc.hitachi-eu.co.uk

SR8000 Model Range

SR8000 Appearance

Compact Model

Vector vs SMP vs MPP

System Architecture Cross-bar Inter-node Network Node (ION) Node (PRN) Node (PRN) CPU CPU PCI System Control Network Control Main Memory Ether, ATM, HIPPI Service Processor Console RAID Disk

Programming Models

Main Memory Memory Switch Pre-fetch Pre-load Cache Load Floating Point Registers Arithmetic Unit CPU Architecture • 16 bytes/cycle memory BW • 128 Kbyte L1 cache • Pre-fetch and pre-load instructions • 160 f.p. registers • 2 f.p. pipelines • 4 flops/cycle

Slide Window Registers 32 to 125 0 to 15 16 to 31 126-7 Base=2 Logical 32 to 123 0 to 15 16 to 31 124-7 Base=4 Global part: 128 to 159 Physical Sliding part: 0 to 127 • Registers for all instructions • Registers for extended instructions only • Fixed registers: 4, 8, 16, 32 (16 illustrated) • Fixed + sliding = 128

Load and store with extended registers Floating point arithmetic with extended registers Slide window control Pre-fetch and pre-load Thread start-up and finish Predicate instructions Instruction Set Extensions

SR8000 Programming Instruction Level Parallelism (Pseudo-vector Processing: PVP)

Main Memory Memory Switch Pre-fetch Pre-load Cache Load Floating Point Registers Arithmetic Unit Pre-fetch and Pre-load • Pre-fetch: load cache line from memory to cache • Pre-load: load one word from memory to register • 16 streams

1 PF Latency LD Use data 2 LD Use data 3 LD Use data 4 LD Use data 5 PF Latency LD Use data 6 LD Use data Pre-fetch Iteration • Pre-fetch 128 bytes to cache • Follow by LD to register

2 PL Latency Use data 3 PL Latency Use data 4 PL Latency Use data 5 PL Latency Use data 6 PL Latency Use data Pre-load Iteration 1 PL Latency Use data • Pre-load 8 bytes to register • LD not required

I=1 I=2 I=3 I=3 I=1 I=2 I=3 I=1 I=2 Infinite resource Finite resource Initiation interval Recurrence =a I=1 a= =a I=2 a= =a I=3 a= Software Pipelining No SWPL Resources: registers, f.p. units, instruction issue, memory bandwidth etc

PF Lat LD + ST LD + ST LD + ST VST VADD VLD LD + ST PF Lat LD + ST LD + ST LD + ST Pseudo-vector Processing A(:) = A(:) + N Pseudo-Vector Vector

Effect of PVP Dot product: S = A(1:N)*B(1:N)

SR8000 Programming Multi-thread Parallelism (Cooperative Microprocessors in a Single Address Space: COMPAS)

Node Node Node Node COMPAS Multi-dimensional Crossbar Network Node IP IP IP IP . . . . Main memory (shared) Automatic Parallel Processing process IP IP IP COMPAS (Start Inst.) thread Pre-fetch Pre-fetch Load Load Arithmetic Arithmetic Store Store Branch Branch IP: Instruction Processor COMPAS ( End Inst.) COMPAS: Co-operative Micro-Processors in single Address Space

IP IP IP IP Hardware Support Software IP IP IP IP Scalar Part (waiting for startup) (waiting for startup) (waiting for startup) Start Parallel Inst. Loop Part Loop Part Loop Part Loop Part End Parallel Inst. Scalar Part Hardware Support Barrier Synchronization Mechanism SC IP:Instruction Processor SC:Storage Controller MS:Main Storage MS

[fork] • DO i =start,end • A(i)=B(i)+C(i) • ENDDO • [join] i loop parallelisation • DO i =1,N • A(i)=B(i)+C(i) • ENDDO • [fork] • DO j=start,end • W(j)=C(j)+D(j) • DO i=1,N • A(i,j)=B(i,j)+W(j) • ENDDO • ENDDO • [join] • DO j=1,M • W(j)=C(j)+D(j) • DO i=1,N • A(i,j)=B(i,j)+W(j) • ENDDO • ENDDO j loop parallelisation Loop Parallelisation

[fork] • DO j=2,M • DO i=start,end • A(i,j) = A(i,j-1)+A(i,j) • ENDDO • ENDDO • [join] • DO j=2,M • DO i=1,N • A(i,j) = A(i,j-1)+A(i,j) • ENDDO • ENDDO i loop parallelisation • [fork] • DO i=start,end • A(i) = B(i)+C(i) • ENDDO • DO j=start,end • D(j) = E(j)*F(j) • ENDDO • [join] i loop parallelisation • DO i=1,N • A(i) = B(i)+C(i) • ENDDO • DO j=1,M • D(j) = E(j)*F(j) • ENDDO j loop parallelisation Loop Parallelisation

Loop Parallelisation • *poption parallel force parallelisation • *poption tlocal(a,b,i) thread local variables • [fork] • DO i = 1,N • CALL sub(a,b,i) • ENDDO • [join] • DO i = 1,N • CALL sub(a,b,i) • ENDDO

Section Parallelisation Execution of independent blocks of code in different threads (sections are always single threaded) • *poption parallel_sections • *poption section • CALL SUB1 • *poption section • CALL SUB2 • *poption end_parallel_sections

Effect of COMPAS Dot product: S = A(1:N)*B(1:N)

SR8000 Programming Message Passing (MPI)

data Remote DMA Protocol Processing Context Switch Interrupt Handling Remote DMA Transfer No Buffering in Kernel No OS System Call Normal Transfer Node Node Program Program data memory copy memory copy OS OS Send Buffer data Receive Buffer data Crossbar Network

Inter-node MPI Cross-bar Inter-node Network MPI MPI MPI One MPI process per node; RDMA transfer possible

Intra-node MPI Cross-bar Inter-node Network MPI MPI MPI MPI MPI MPI MPI MPI MPI Shared memory Shared memory Shared memory One MPI process per IP; RDMA transfer not possible

MPI Ping-pong

Message passing (MPI) Multi-thread (COMPAS) Instruction level (PVP) Node 2 Node 1 SR8000 Parallelism

SR8000 Programming Memory Architecture

Memory Hierarchy fp registers (128+32) 32 b/cyc 16 b/cyc L1 cache (128 Kb 4-way) Store buffer (16 entries) Other IPs 16 b/cyc Switch Memory (2 to 16 Gb, 512 banks)

Address Translation Virtual address Page offset Virtual page number Main memory Cache recently used entries of page table in TLB Page table

Large TLB Virtual address Page offset Virtual page number Main memory Large TLB covers whole address space with 256 entries. Page size 16Mb to 128 Mb Large page table

Memory Address Hashing 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 xor xor memory controller data path storage controller data path

High performance RISC CPU with PVP High performance node with COMPAS High sustained memory bandwidth High scalability with fast network Low energy and space requirements Key Features of SR8000

SR8000 Programming Performance

Top 500 – June 2000

Linpack Performance 1000 917.15 (100 nodes) 10.88 Gflops on 1 node 20.50 Gflops on 2 nodes 40.76 Gflops on 4 nodes 900 800 700 605.30 (64 nodes) 600 577.49 (60 nodes) GFlops 500 400 313.32 (32nodes) 300 200 159.51 (16 nodes) 100 80.25 (8 nodes) 0 0 20 40 60 80 100 120 Number of nodes

35 28.78 30 ClassA 27.95 26.16 ClassB 25 ClassC 20 GFlops 15.10 14.84 14.01 15 8.37 8.31 10 7.92 5.39 5.14 5 0 1 2 4 8 Number of Nodes NAS Parallel FT

NAS Parallel CG

SR8000 Concept

SR8000 Concept

Presentation Transcript

Concept

Hitachi SR8000

Concept

CONCEPT

CONCEPT

Concept

CONCEPT

Overview of Hitachi’s Super Technical Server SR8000

Hitachi SR8000 Supercomputer

Concept

Concept

Concept

CONCEPT

Concept

Concept

Concept

Concept

Concept

CONCEPT

Concept

CONCEPT

Concept