The Cray X1 Multiprocessor and Roadmap

The Cray X1 Multiprocessor and Roadmap

Cray’s Computing Vision Scalable High-Bandwidth Computing 2010 ‘Cascade’ Sustained Petaflops ‘Black Widow’ ‘Black Widow 2’ 2006 X1E 2004 Product Integration X1 2006 ‘Strider X’ ‘Strider 3’ 2004 2005 ‘Strider 2’ RS Red Storm

Cray X1 • Cray PVP • Powerful vector processors • Very high memory bandwidth • Non-unit stride computation • Special ISA features • Modernized the ISA • T3E • Extreme scalability • Optimized communication • Memory hierarchy • Synchronization features • Improved via vectors Extreme scalability with high bandwidth vector processors

Cray X1 Instruction Set Architecture New ISA • Much larger register set (32x64 vector, 64+64 scalar) • All operations performed under mask • 64- and 32-bit memory and IEEE arithmetic • Integrated synchronization features Advantages of a vector ISA • Compiler provides useful dependence information to hardware • Very high single processor performance with low complexity ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr) • Localized computation on processor chip – large register state with very regular access patterns – registers and functional units grouped into local clusters (pipes)  excellent fit with future IC technology • Latency tolerance and pipelining to memory/network • very well suited for scalable systems • Easy extensibility to next generation implementation

Not Your Father’s Vector Machine • New instruction set • New system architecture • New processor microarchitecture • “Classic” vector machines were programmed differently • Classic vector: Optimize for loop length with little regard for locality • Scalable micro: Optimize for locality with little regard for loop length • The Cray X1 is programmed like other parallel machines • Rewards locality: register, cache, local memory, remote memory • Decoupled microarchitecture performs well on short loop nests • (however, does require vectorizable code)

P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ M M M M M M M M M M M M M M M M mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem IO IO Cray X1 Node 51 Gflops, 200 GB/s • Four multistream processors (MSPs), each 12.8 Gflops • High bandwidth local shared memory (128 Direct Rambus channels) • 32 network links and four I/O links per node

NUMA Scalable up to 1024 Nodes Interconnection Network • 16 parallel networks for bandwidth • Global shared memory across machine

Network Topology (16 CPUs) P P P P node 0 M15 M1 M0 P P P P node 1 M15 M1 M0 P P P P node 2 M0 M15 M1 P P P P node 3 M0 M15 M1 Section 0 Section 15 Section 1

R R R R R R R R Network Topology (128 CPUs) 16 links

Network Topology (512 CPUs)

Designed for Scalability T3E Heritage • Distributed shared memory (DSM) architecture • Low latency, load/store access to entire machine (tens of TBs) • Decoupled vector memory architecture for latency tolerance • Thousands of outstanding references, flexible addressing • Very high performance network • High bandwidth, fine-grained transfers • Same router as Origin 3000, but 32 parallel copies of the network • Architectural features for scalability • Remote address translation • Global coherence protocol optimized for distributed memory • Fast synchronization • Parallel I/O scales with system size

Decoupled Microarchitecture • Decoupled access/execute and decoupled scalar/vector • Scalar unit runs ahead, doing addressing and control • Scalar and vector loads issued early • Store addresses computed and saved for later use • Operations queued and executed later when data arrives • Hardware dynamically unrolls loops • Scalar unit goes on to issue next loop before current loop has completed • Full scalar register renaming through 8-deep shadow registers • Vector loads renamed through load buffers • Special sync operations keep pipeline full, even across barriers This is key to making the system like short-VL code

Address Translation • High translation bandwidth: scalar + four vector translations per cycle per P • Source translation using 256 entry TLBs with support for multiple page sizes • Remote (hierarchical) translation: • allows each node to manage its own memory (eases memory mgmt.) • TLB only needs to hold translations for one node scales

Cache Coherence Global coherence, but only cache memory from local node (8-32 GB) • Supports SMP-style codes up to 4 MSP (16 SSP) • References outside this domain converted to non-allocate • Scalable codes use explicit communication anyway • Keeps directory entry and protocol simple Explicit cache allocation control • Per instruction hint for vector references • Per page hint for scalar references • Use non-allocating refs to avoid cache pollution Coherence directory stored on the M chips (rather than in DRAM) • Low latency and really high bandwidth to support vectors • Typical CC system: 1 directory update per proc per 100 or 200 ns • Cray X1: 3.2 dir updates per MSP per ns (factor of several hundred!)

System Software

SB SB RS200 FC CB CB SB SB IOCA SB SB FC SB SB FC IOCA FC I/O Board FC IOCA FC FC IOCA FC Cray X1 I/O Subsystem PCI-X BUSES 800 MB/s SPC Channels 2 x 1.2 GB/s PCI-X Cards 200 MBytes/s FC-AL XFS and XVM ADIC SNFS I Chip Gigabit Ethernet or Hippi Network CNS RS200 X1 Node I Chip IP over Fiber CPES Gigabit Ethernet or Hippi Network CNS I/O Board

Cray X1 UNICOS/mp Single System Image UNIX kernel executes on each Application node (somewhat like Chorus™ microkernel on UNICOS/mk) Provides: SSI, scaling & resiliency UNICOS/mp Global resource manager (Psched) schedules applications on Cray X1 nodes 256 CPU Cray X1 Commands launch applications like on T3E (mpprun) System Service nodes provide file services, process manage- ment and other basic UNIX functionality (like /mk servers). User commands execute on System Service nodes.

Programming Models • Shared memory: • OpenMP, pthreads • Single node (51 Gflops, 8-32 GB) • Fortran and C • Distributed memory: • MPI • shmem(), UPC, Co-array Fortran • Hierarchical models (DM over OpenMP) • Both MSP and SSP execution modes supported for all programming models

Mechanical Design

Node board Field-Replaceable Memory Daughter Cards Spray Cooling Caps CPU MCM 8 chips Network Interconnect PCB 17” x 22’’ Air Cooling Manifold

Cray X1 Node Module

Cray X1 Chassis

64 Processor Cray X1 System~820 Gflops

Cray BlackWidow The Next Generation HPC System From Cray Inc.

BlackWidow Highlights • Designed as a follow-on to Cray X1: • Upward compatible ISA • FCS in 2006 • Significant improvement (>> Moore’s Law rate) in: • Single thread scalar performance • Price performance • Refine X1 implementation: • Continue to support DSM with 4-way SMP nodes • Further improve sustained memory bandwidth • Improve scalability up and down • Improve reliability/fault tolerance • Enhance instruction set • A few BlackWidow features: • Single chip vector microprocessor • Hybrid electrical/optical interconnect • Configurable memory capacity, memory BW and network BW

Cascade ProjectCray Inc.StanfordCal Tech/JPLNotre Dame

High Productivity Computing Systems Goals: • Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 – 2010) Impact: • Performance (efficiency): critical national security applications by a factor of 10X to 40X • Productivity (time-to-solution) • Portability (transparency): insulate research and operational application software from system • Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors HPCS Program Focus Areas • Applications: • Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)

HPCS Planned ProgramPhases 1-3 Early Academia Early Metrics and Benchmarks Metrics, Products Software Research Pilot Benchmarks HPCS Capability or Products Tools Platforms Platforms Requirements and Metrics Application Analysis Performance Assessment Research Prototypes & Pilot Systems System Design Review PDR & Early Prototype Concept Reviews Industry Guided Research Technology Assessments Research Phase 2 Readiness Reviews Phase 3 Readiness Review Fiscal Year 02 03 04 05 06 07 08 09 10 Reviews Industry BAA/RFP Critical Program Milestones Phase 1 Industry Concept Study Phase 2 R&D Phase 3 Full Scale Development (Planned)

Technology Focus of Cascade • System Design • Network topology and router design • Shared memory for low latency/low overhead communication • Processor design • Exploring clustered vectors, multithreading and streams • Enhancing temporal locality via deeper memory hierarchies • Improving latency tolerance and memory/communication concurrency • Lightweight threads in memory • For exploitation of spatial locality with little temporal locality • Thread  data instead of data  thread • Exploring use of PIMs • Programming environments and compilers • High productivity languages • Compiler support for heavyweight/lightweight thread decomposition • Parallel programming performance analysis and debugging • Operating systems • Scalability to tens of thousands of processors • Robustness and fault tolerance Lets us explore technologies we otherwise couldn’t. A three year head start on typical development cycle.

Cray’s Computing Vision Scalable High-Bandwidth Computing 2010 ‘Cascade’ Sustained Petaflops ‘Black Widow’ ‘Black Widow 2’ 2006 X1E 2004 Product Integration X1 2006 ‘Strider X’ ‘Strider 3’ 2004 2005 ‘Strider 2’ RS Red Storm

Questions? File Name: BWOpsReview081103.ppt

The Cray X1 Multiprocessor and Roadmap

The Cray X1 Multiprocessor and Roadmap

Presentation Transcript

CCSM Component Performance Benchmarking and Status of the CRAY X1 at ORNL

Scaling the Cray MTA

Cray

Introduction to the Cray XK7

My awesome trip to the cray - cray Aquarium!!! Go Aquarium!

Fig. X1

EndNote X1

Cray Supercomputers: The Cray X1

Cray X1 and Black Widow at ORNL Center for Computational Sciences

Seymour Cray

Seymour Cray

Implementing a Global Address Space Language on the Cray X1

DOE Evaluation of the Cray X1

The NUMAchine Multiprocessor

Cray SV1

The Cray Scoreboard Approach

Cygnus X1 – The First

Introducing the CRAY SV1

X1 - The Absolute Anchor

The Cray XT4 Programming Environment

Multiplying x0, x1, and x2