apeNEXT

apeNEXT Piero Vicini(piero.vicini@roma1.infn.it) INFN Roma Piero Vicini - SciParC workshop

APE keywords • Parallel system • Massively parallel 3D array of computing nodes with periodic boundary conditions • Custom system • Processor: extensive use of VLSI • Native implementation of the complex type a x b + c (complex numbers) • Large register file • VLIW microcode • Node interconnections • Optimized for nearest-neighbors communication • Software tools • Apese, TAO, OS, Machine simulator • Dense system • Reliable and safe HW solution • Custom mechanics for “wide” integration • Cheap system • 0.5 €/MFlops • Very low cost maintenance Piero Vicini - SciParC workshop

The APE family Our line of Home Made Computers Piero Vicini - SciParC workshop

APE (‘88) 1 GFlops Piero Vicini - SciParC workshop

APE100 (1993) - 100 GFlops PB (8 nodes) ~400 MFlops Piero Vicini - SciParC workshop

APEmille – 1 TFlops • 2048 VLSI processing nodes (0.5 MFlops) • SIMD, synchronous communications • Fully integrated ”Host computer”, 64 PCs cPCI based “Torre” 32 PB, 128GFlops “Processing Board” (PB) 8 nodes, 4GFlops Computing node Piero Vicini - SciParC workshop

APEmille installations • Bielefeld 130 GF (2 crates) • Zeuthen 520 GF (8 crates) • Milan 130 GF (2 crates) • Bari 65 GF (1 crates) • Trento 65 GF (1 crates) • Pisa 325 GF (5 crates) • Rome 1 650 GF (10 crates) • Rome 2 130 GF (2 crates) • Orsay 16 GF (1/4 crates) • Swansea 65 GF (1 crates) • Gr. Total ~1966 GF Piero Vicini - SciParC workshop

X+(cables) 12 14 13 15 DDR-MEM 8 10 X+ … … Z- J&T 9 11 4 6 5 7 Y+(bp) 0 2 1 3 Z+(bp) The apeNEXT architecture • 3D mesh of computing nodes • Custom VLSI processor - 200 MHz (J&T) • 1.6 GFlops per node (complex “normal”) • 256 MB (1 GB) memory per node • First neighbor communication network “loosely synchronous” • YZ internal, X on cables • r = 8/16 => 200 MB/s per channel • Scalable 25 GFlops -> 6 Tflops • Processing Board 4 x 2 x 2~ 26 GF • Crate (16 PB) 4 x 8 x 8 ~ 0.5 TF • Rack (32 PB) 8 x 8 x 8 ~ 1 TF • Large systems (8*n) x 8 x8 • Linux PCs as Host system Piero Vicini - SciParC workshop

Design metodology • VHDL incremental model of the (almost) whole system • Custom (VLSI and/or FPGA) components derived from VHDL synthesis tool • Stand-alone simulation of components VHDL model + simulation of the “global” VHDL model • Powerful test-bed for test vectors generation First-Time-Right-Silicon • Simplified but complete model of HW-Host interaction • Test environment for development of compilation chain, OS • performance (architecture) evaluation at design time Software design env. Piero Vicini - SciParC workshop

Assembling apeNEXT… J&T Asic J&T module PB Rack BackPlane Piero Vicini - SciParC workshop

Overview of the J&T Architecture • Peak floating point performance of about 1.6GFlops • IEEE compliant double precision • Integer arithmetic performance of about 400 MIPS • Link bandwidth of about 200 MByte/sec each direction • full duplex • 7 links: X+,X-,Y+,Y-,Z+,Z-, “7th” (I/O) • Support for current generation DDR memory • Memory bandwidth of 3.2 GByte/sec • 400 Mword/sec Piero Vicini - SciParC workshop

J&T: Top Level Diagram Piero Vicini - SciParC workshop

4 multipliers 4 adder/sub The J&T Arithmetic BOX • Pipelined complex “normal” a*b+c (8 flops) per cycle At 200 MHz (fully piped) = 1.6 GFlops Piero Vicini - SciParC workshop

The J&T remote IO • fifo-based communication: • LVDS • 1.6 Gb/s per link (8 bit @ 200MHz) • 6 (+1) independent links Piero Vicini - SciParC workshop

J&T summary • CMOS 0.18u, 7 metal (ATMEL) • 200 MHz • Double Precision Complex Normal Operation • 64 bit AGU • 8 KW program cache • 128 bit local memory channel • 6+1 LVDS 200 MB/s links • BGA package, 600 pins Piero Vicini - SciParC workshop

PB • Collaboration with NEURICAM spa • 16 Nodes 3D-Interconnected • 4x2x2 Topology 26 Gflops, 4.6 GB Memory • Light System: • J&T Module connectors • Glue Logic (Clock tree 10Mhz) • Global signal interconnection (FPGA) • DC-DC converters (48V to 3.3/2.5) • Dominant Technologies: • LVDS: 1728 (16*6*2*9) differential signals 200Mb/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance layers • High-Speed differential connectors: • Samtec QTS (J&T Module) • Erni ERMET-ZD (Backplane) Piero Vicini - SciParC workshop

J&T Module • J&T • 9 DDR-SDRAM, 256Mbit (x16) • 6 Link LVDS up to 400MB/s • Host Fast I/O Link (7th Link) • I2C Link (slow control network) • Dual Power 2.5V+1.8V, 7-10W estimated • Dominant technologies: • SSTL-II (memory interface) • LVDS (network interface + I/O) Piero Vicini - SciParC workshop

NEXT BackPlane • 16 PB Slots + Root Slot • Size 447x600 mm2 • 4600 LVDS differential signals, point-to-point up to 600 Mb/s • 16 controlled-imp. layers (32) • Press-fit only • Erni/Tyco connectors • ERMET-ZD • Providers: • APW (primary) • ERNI (2nd src) connector kit cost:7KEuro (!) PB Insertion force:80-150 Kg(!) Piero Vicini - SciParC workshop

a1 AIR-FLOW CHANNEL 1 DC/DC b1 AIR-FLOW CHANNEL 3 a3 b3 J&T Module AIR-FLOW CHANNEL 3 AIR-FLOW CHANNEL 2 b2 J&T Module a2 Frame TOP VIEW ( local ) PB Mechanics PB constraints: • Power consumption: up to 340W • PB-BP insertion force: 80-150 Kg (!) apeNEXT PB • Fully populated PB weight: 4-5 Kg Board-to-Board Connector Detailed study of airflow Custom design of card frame and insertion tool Piero Vicini - SciParC workshop

Rack mechanics • Problem: • PB weight: 4-5 Kg, • PB consumption: 340W (est.) • 32 PB + 2 Root Board • Power supply: (<48Vx150A per crate) • Integrated Host PCs • Forced air cooling, • Robust, expandable/modular, CE, EMC .... • Solution: • 42U rack (h: 2,10 m): • EMC proof, • efficient cables routing • 19”-1U slots per 9 “host PCs” (rack mounted) • Hot-swap power supply cabinet (modular) • Custom design of “card cage” and “tie bar” • Custom design of cooling system Piero Vicini - SciParC workshop

Piero Vicini - SciParC workshop

I2C: bootstrap & control 7th-Link (200MB/s) Host I/O Architecture Piero Vicini - SciParC workshop

QDR Mem Bank QDR Mem Ctrl PCI Master Ctrl Fifo Fifo Fifo 7Link Ctrl 7Link Ctrl PCITarget Ctrl I2C Ctrl Altera APEXII PCI Interface PLDA Host I/O Interface • PCI Board, Altera APEX II based • QuadDataRateMemory (x32) • 7th Link: 1(2) bidir. Chan. • I2C: 4 independent ports • PCI Interface 64bit, 66Mhz • PCI Master Mode for 7th Link • PCI Target Mode for I2C Piero Vicini - SciParC workshop

Status and expected schedule • J&T ready to test September 03 • We will receive between 300 to 600 chips • We need 256 processor to assemble a crate !! • We expect them to work !! • The same team designed 7 ASIC of the same complexity • Impressive full-detailed simulations of multiple J&T systems • More one simulate less one has to test !! • PB,J&T Module, BackPlane, Mechanics were built and tested • Within days/weeks the first working apeNEXT computer should operate • Mass production will follow asap • End 2003 mass production will start…. • INFN requirements is 8-12 TFlops of computing power !! Piero Vicini - SciParC workshop

Software • TAO compilers and linker ….. READY • All existing APE program will run with no change • Physical code already been run on the simulator • Kernel of PHYSICS codes • used to benchmark the efficiencies of the FP unit • C COMPILER • gcc (2.93) and lcc have be retargeted • lcc WORKS (almost). http://www.cs.princeton.edu/software/lcc/ Piero Vicini - SciParC workshop

Project Costs • Total development cost of 1700 k€uro • 1050 k€uro for VLSI development • 550 k€uro non VLSI • Manpower involved = 20 man/year • Mass production cost ~ 0.5 €uro/MFlops Piero Vicini - SciParC workshop

Future R&D activities • Computing node architecture • Adaptable/reconfigurable computing node • Fat operators, short/custom FP data, multiple node integration • Evaluation/integration of commercial processor in APE system • Interconnection architecture and technologies • Custom ape-like network • Interface to host, PCs interconnection • Mechanics assemblies (Perf/Volume,reliability) • Rack, Cables, Power distributions etc… • Software • Standard languages (C) full support (compiler, linker…) • Distributed OS • APE system integration in “GRID” environment Piero Vicini - SciParC workshop

Conclusions • J&T in fab, ready Summer 03 (300….600 chips) • Everything else ready and tested !!! • If tests ok • mass production starting 4Q03 • All components over-dimensioned • Cooling, LVDS tested @ 400 Mb/s, power supply on boards … • Makes possible a technology step with no extra design and relatively low test effort • Installation plans • INFN theoretical group requires 8-12 TFlops (10-15 cabinets)(on delivering of a working machine…) • DESY considering between 8 TFlops to 16 Tflops • Paris…. Piero Vicini - SciParC workshop

APE in SciParC • APE is the actual (“de-facto”) European computing platform for big volume LQCD applications. But…. • “Interdisciplinarity” is on our pathway (i.e. APE is not only QCD): • Fluid dynamics (lattice boltzman, weather forecast) • Complex Systems (spin glasses, real glasses, protein folding) • Neural networks • Seismic migration • Plasma physics (astrophysics, thermonuclear engines) • …… • So, in our opinion, it’s strategic to build “general purpose” massively parallel computing platform dedicated to approach large-scale computational problem coming from different fields of research. • The APE group can (want) contribute in development of such future machines Piero Vicini - SciParC workshop

apeNEXT

apeNEXT

Presentation Transcript

apeNEXT: Computational Challenges and First Physics Results Florence, 8-10 February, 2007

The apeNEXT project

L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

QCD thermodynamics on QCDOC and APEnext supercomputers

Optimization software for apeNEXT Max Lukyanov, 12.07.05

NSPT@apeNEXT