1 / 42

SR8000 Concept

SR8000 Concept. Tim Lanfear Hitachi Europe GmbH. t-lanfear@hpcc.hitachi-eu.co.uk. SR8000 Model Range. SR8000 Appearance. Compact Model. Vector vs SMP vs MPP. System Architecture. Cross-bar Inter-node Network. Node (ION). Node (PRN). Node (PRN). CPU. CPU. PCI. System Control.

yen
Télécharger la présentation

SR8000 Concept

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SR8000 Concept Tim Lanfear Hitachi Europe GmbH. t-lanfear@hpcc.hitachi-eu.co.uk

  2. SR8000 Model Range

  3. SR8000 Appearance

  4. Compact Model

  5. Vector vs SMP vs MPP

  6. System Architecture Cross-bar Inter-node Network Node (ION) Node (PRN) Node (PRN) CPU CPU PCI System Control Network Control Main Memory Ether, ATM, HIPPI Service Processor Console RAID Disk

  7. Programming Models

  8. Main Memory Memory Switch Pre-fetch Pre-load Cache Load Floating Point Registers Arithmetic Unit CPU Architecture • 16 bytes/cycle memory BW • 128 Kbyte L1 cache • Pre-fetch and pre-load instructions • 160 f.p. registers • 2 f.p. pipelines • 4 flops/cycle

  9. Slide Window Registers 32 to 125 0 to 15 16 to 31 126-7 Base=2 Logical 32 to 123 0 to 15 16 to 31 124-7 Base=4 Global part: 128 to 159 Physical Sliding part: 0 to 127 • Registers for all instructions • Registers for extended instructions only • Fixed registers: 4, 8, 16, 32 (16 illustrated) • Fixed + sliding = 128

  10. Load and store with extended registers Floating point arithmetic with extended registers Slide window control Pre-fetch and pre-load Thread start-up and finish Predicate instructions Instruction Set Extensions

  11. SR8000 Programming Instruction Level Parallelism (Pseudo-vector Processing: PVP)

  12. Main Memory Memory Switch Pre-fetch Pre-load Cache Load Floating Point Registers Arithmetic Unit Pre-fetch and Pre-load • Pre-fetch: load cache line from memory to cache • Pre-load: load one word from memory to register • 16 streams

  13. 1 PF Latency LD Use data 2 LD Use data 3 LD Use data 4 LD Use data 5 PF Latency LD Use data 6 LD Use data Pre-fetch Iteration • Pre-fetch 128 bytes to cache • Follow by LD to register

  14. 2 PL Latency Use data 3 PL Latency Use data 4 PL Latency Use data 5 PL Latency Use data 6 PL Latency Use data Pre-load Iteration 1 PL Latency Use data • Pre-load 8 bytes to register • LD not required

  15. I=1 I=2 I=3 I=3 I=1 I=2 I=3 I=1 I=2 Infinite resource Finite resource Initiation interval Recurrence =a I=1 a= =a I=2 a= =a I=3 a= Software Pipelining No SWPL Resources: registers, f.p. units, instruction issue, memory bandwidth etc

  16. PF Lat LD + ST LD + ST LD + ST VST VADD VLD LD + ST PF Lat LD + ST LD + ST LD + ST Pseudo-vector Processing A(:) = A(:) + N Pseudo-Vector Vector

  17. Effect of PVP Dot product: S = A(1:N)*B(1:N)

  18. SR8000 Programming Multi-thread Parallelism (Cooperative Microprocessors in a Single Address Space: COMPAS)

  19. Node Node Node Node COMPAS Multi-dimensional Crossbar Network Node IP IP IP IP . . . . Main memory (shared) Automatic Parallel Processing process IP IP IP COMPAS (Start Inst.) thread Pre-fetch Pre-fetch Load Load Arithmetic Arithmetic Store Store Branch Branch IP: Instruction Processor COMPAS ( End Inst.) COMPAS: Co-operative Micro-Processors in single Address Space

  20. IP IP IP IP Hardware Support Software IP IP IP IP Scalar Part (waiting for startup) (waiting for startup) (waiting for startup) Start Parallel Inst. Loop Part Loop Part Loop Part Loop Part End Parallel Inst. Scalar Part Hardware Support Barrier Synchronization Mechanism SC IP:Instruction Processor SC:Storage Controller MS:Main Storage MS

  21. [fork] • DO i =start,end • A(i)=B(i)+C(i) • ENDDO • [join] i loop parallelisation • DO i =1,N • A(i)=B(i)+C(i) • ENDDO • [fork] • DO j=start,end • W(j)=C(j)+D(j) • DO i=1,N • A(i,j)=B(i,j)+W(j) • ENDDO • ENDDO • [join] • DO j=1,M • W(j)=C(j)+D(j) • DO i=1,N • A(i,j)=B(i,j)+W(j) • ENDDO • ENDDO j loop parallelisation Loop Parallelisation

  22. [fork] • DO j=2,M • DO i=start,end • A(i,j) = A(i,j-1)+A(i,j) • ENDDO • ENDDO • [join] • DO j=2,M • DO i=1,N • A(i,j) = A(i,j-1)+A(i,j) • ENDDO • ENDDO i loop parallelisation • [fork] • DO i=start,end • A(i) = B(i)+C(i) • ENDDO • DO j=start,end • D(j) = E(j)*F(j) • ENDDO • [join] i loop parallelisation • DO i=1,N • A(i) = B(i)+C(i) • ENDDO • DO j=1,M • D(j) = E(j)*F(j) • ENDDO j loop parallelisation Loop Parallelisation

  23. Loop Parallelisation • *poption parallel force parallelisation • *poption tlocal(a,b,i) thread local variables • [fork] • DO i = 1,N • CALL sub(a,b,i) • ENDDO • [join] • DO i = 1,N • CALL sub(a,b,i) • ENDDO

  24. Section Parallelisation Execution of independent blocks of code in different threads (sections are always single threaded) • *poption parallel_sections • *poption section • CALL SUB1 • *poption section • CALL SUB2 • *poption end_parallel_sections

  25. Effect of COMPAS Dot product: S = A(1:N)*B(1:N)

  26. SR8000 Programming Message Passing (MPI)

  27. data Remote DMA Protocol Processing Context Switch Interrupt Handling Remote DMA Transfer No Buffering in Kernel No OS System Call Normal Transfer Node Node Program Program data memory copy memory copy OS OS Send Buffer data Receive Buffer data Crossbar Network

  28. Inter-node MPI Cross-bar Inter-node Network MPI MPI MPI One MPI process per node; RDMA transfer possible

  29. Intra-node MPI Cross-bar Inter-node Network MPI MPI MPI MPI MPI MPI MPI MPI MPI Shared memory Shared memory Shared memory One MPI process per IP; RDMA transfer not possible

  30. MPI Ping-pong

  31. Message passing (MPI) Multi-thread (COMPAS) Instruction level (PVP) Node 2 Node 1 SR8000 Parallelism

  32. SR8000 Programming Memory Architecture

  33. Memory Hierarchy fp registers (128+32) 32 b/cyc 16 b/cyc L1 cache (128 Kb 4-way) Store buffer (16 entries) Other IPs 16 b/cyc Switch Memory (2 to 16 Gb, 512 banks)

  34. Address Translation Virtual address Page offset Virtual page number Main memory Cache recently used entries of page table in TLB Page table

  35. Large TLB Virtual address Page offset Virtual page number Main memory Large TLB covers whole address space with 256 entries. Page size 16Mb to 128 Mb Large page table

  36. Memory Address Hashing 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 xor xor memory controller data path storage controller data path

  37. High performance RISC CPU with PVP High performance node with COMPAS High sustained memory bandwidth High scalability with fast network Low energy and space requirements Key Features of SR8000

  38. SR8000 Programming Performance

  39. Top 500 – June 2000

  40. Linpack Performance 1000 917.15 (100 nodes) 10.88 Gflops on 1 node 20.50 Gflops on 2 nodes 40.76 Gflops on 4 nodes 900 800 700 605.30 (64 nodes) 600 577.49 (60 nodes) GFlops 500 400 313.32 (32nodes) 300 200 159.51 (16 nodes) 100 80.25 (8 nodes) 0 0 20 40 60 80 100 120 Number of nodes

  41. 35 28.78 30 ClassA 27.95 26.16 ClassB 25 ClassC 20 GFlops 15.10 14.84 14.01 15 8.37 8.31 10 7.92 5.39 5.14 5 0 1 2 4 8 Number of Nodes NAS Parallel FT

  42. NAS Parallel CG

More Related