1 / 102

High Performance Computing and Networking Status & Trends

High Performance Computing and Networking Status & Trends. H. Leroy - 2005. Thanks : Franck Capello (Grid 5000) Dominique Lavenier (R-Disk, Remix) François Bodin And some SuperComputing Conference Tutorials ( http://supercomputing.org/ ). Table of contents

berget
Télécharger la présentation

High Performance Computing and Networking Status & Trends

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High PerformanceComputing and NetworkingStatus & Trends H. Leroy - 2005

  2. Thanks : • Franck Capello (Grid 5000) • Dominique Lavenier (R-Disk, Remix) • François Bodin And some SuperComputing Conference Tutorials ( http://supercomputing.org/ )

  3. Table of contents • Computer architecture (memory refresh) • HPCN Trends • Grid Computing • Grid 5000 • Parallel Programming Models

  4. Rappels Les enjeux du parallélisme : • Offrir une puissance de calcul importante • Mettre en oeuvre des modèles de simulation numériques de plus en plus complexes, avec par exemple des couplages de modèles. • Disposer de machines extensibles et de faible coût.

  5. Besoins en performance Pour : • Génie nucléaire • Aérodynamique • Météorologie • Imagerie (traitement et synthèse d’images, réalité virtuelle) • Recherche pétrolière • Simulation avant fabrication • Nouveaux problèmes en physique, biologie, ...

  6. Grand Challenge (1992) : 3T

  7. Objectif Teraflop Source: J. Normand CEA 1995

  8. Taxinomie de Michael Flynn(1972) MIMD : UMA, NUMA, COMA, CC NUMA, NORMA

  9. SIMD

  10. MIMD SMP (Shared Memory multi Processors)

  11. MIMD Distributed Memory multi processors

  12. MIMD (CC NUMA)

  13. Memory Hierarchy Process Working Set • Spatial locality : DO I=1,N S=S+A(I) ENDDO • Temporal locality : DO I=1,N A(I)=B(I) ENDDO Caches (data & instruction caches) TLB

  14. Caches L1cache : L2 cache : And now also L3 cache !

  15. Top 500 November 2004

  16. Top 500 June 2005 NB Procs Rmax Rpeak 1LLNL IBM eserver 65536 136800 183500 Blue Gene 2 IBM research center 40960 91290 114688 IBM eserver Blue Gene 3 Nasa, SGI Altix 10160 51870 60960 Voltaire Infiniband 4 NEC Earth Simulator 5120 35860 40960 5 Barcelona Mare Nostrum IBM JS20 Cluster 4800 27910 42144 PPC 970, Myrinet

  17. Top 500 : Architecture Trends Clusters : 06/2001 : 33/500 ( 6.6%) 11/2001 : 43/500 ( 8.6%) 06/2002 : 80/500 (16%) 11/2002 : 93/500 (18.6%) 06/2003 : 149/500 (29.8%) 11/2003 : 208/500 (41.6%) 06/2004 : 291/500 (58.2%) 11/2004 : 294/500 (58.8%) 06/2005 : 304/500 (60.8%)

  18. Top 500 : Performance Trend 11/2004 : 70,72 TF 06/2005 : 136,80 TF

  19. Top 500 : Performance Trend

  20. Top 500 : Applications

  21. Applications AstrophysicsNbody simulation extreme.indiana.edu/gc3/ Tokamak fluid dynamics http://www.acl.lanl.gov/Grand-Chal/Tok/gallery.html

  22. Applications www.llnl.gov/CASC/iscr/biocomp/ Winter temperaturesand CO2 dispersion www.llnl.gov/CASC/climate

  23. Applications Virtual prototyping www.irisa.fr/ProHPC

  24. Top 500 : Interconnect Trend

  25. Remember Beowulf Clusters ? • “Do-It-Yourself Supercomputers” (Science 1996) • Built around : - Pile Of PCs (POP) - Dedicated High Speed LAN - Free Unix : Linux - Free and COTS Parallel Programming and Performance Tools • COTS Hardware permits rapid development and technology tracking

  26. Avalon Hive Source: [NASA01] Source: [LANL01] MAPS Source: [GMU01] http://newton.gsfc.nasa.gov/thehive/ http://cnls.lanl.gov/avalon/ Inria Rocq. Crystal http://maps.scs.gmu.edu/

  27. Source: [Sci97] Reconfigurable ComputersThe microchip that rewires itself Scientific American – June 1997 : - Computers that modify their hardware circuits as they operate are opening a new era in computer design. Because they can filter data rapidly, they excel at pattern recognition, image processing and encryption - Reconfigurable computers architecture is based on FPGAs (Field Programmable Gate Arrays)

  28. Microprocessor and FPGA Performance Increases Conservative estimates for FPGAs Performance = # of Gates x Clock Rate Source: [SRC02]

  29. Reconfigurable ComputingClusters • Beowulf style clusters • COTS reconfigurable boards as accelerators at each node • Some parallel programming and execution model tool

  30. Delivered and Benchmarked • 48 nodes • 2U, back-to-back (net 1U/node) • 96 FPGA’s • Annapolis Micro • Xilinx Virtex II • 34 Tera-Ops • In use today • All Commodity Parts

  31. WILDSTAR/ WILDFIRE Example 2: Extended JMS http://www.gwu.edu/~hpc/lsf/ http://ece.gmu.edu/lucite/

  32. Massively Parallel Reconfigurable Systems • Massively parallel systems with large numbers of reconfigurable processors and microprocessors • Everything can be configured, things to configure include : • Processing • Network • Everything can be reconfigured over and over at run time (Run-Time Reconfiguration) to suite underlying applications • Can be easily programmed by application scientists, at least in the same way of programming conventional parallel computers

  33. P P FPGA FPGA P memory P memory FPGA memory FPGA memory Vision for Reconfigurable Supercomputers . . . Shared Memory and or NIC

  34. Microprocessor system Reconfigurable system . . . P P . . . FPGA FPGA P memory P memory FPGA memory FPGA memory . . . . . . Interface Interface I/O I/O Current Reconfigurable Architecture

  35. Direct Connect Fat Tree Multiple Chassis Connected to RapidArray Fabric Source: [Cray, MAPLD04] Cray XD1 System

  36. Chassis Front Chassis Rear Source: [Cray, MAPLD04] XD1 Chassis (OctigaBay 12K) Four 133 MH PCI-X Slots Six Two-way Opteron Blades Six FPGA Modules Six SATA Hard Drives 0.5 Tb/s Switch 12 x 2 GB/s Ports to Fabric

  37. Application Accelerator P RapidArray Switch RAP P RAP Six of These Configurations Per Chassis Source: [Cray, MAPLD04] Application Acceleration FPGA • High BW connection to fabric and Opteron • Fine-grained integration of FPGA logic and software • Well-suited for: Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation.

  38. RAP Source: [Cray, MAPLD04] Application Acceleration Co-Processor AMD Opteron HyperTransport 3.2 GB/s 3.2 GB/s 3.2 GB/s 3.2 GB/s 3.2 GB/s 3.2 GB/s RapidArray QDR SRAM Application Acceleration FPGA Xilinx Virtex II Pro 2 GB/s 2 GB/s Cray RapidArray Interconnect

  39. UserLogic RapidArray Transport Core QDR RAM Interface Core ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) Q(35:0) RAP ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) Q(35:0) • XC2VP30 running at 200 MHz. • 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s). • 16 bit RapidArray I/F at 400 MHz DDR (800 MTransfers/s.) • QDR and RapidArray I/F take up <20 % of XC2VP30.  The rest is available for user applications. Source: [Cray, MAPLD04] Application Acceleration Interface TX QDR SRAM RX RapidArray

  40. Source: [Cray, MAPLD04] FPGA Linux API Administration Commands fpga_open – allocate and open fpga fpga_close – close allocated fpga fpga_load – load binary into fpga Control Commands fpga_start – start fpga (release from reset) fpga_stop – stop fpga Status Commands fpga_status – get status of fpga Data Commands fpga_put – put data to fpga fpga_get – get data from fpga Interrupt/Blocking Commands fpga_intwait – blocks process, waits for fpga interrupt Programmer sees get / put and message passing programming model

  41. MPI bandwidth comparison MPI latency comparison Source: [Cray, MAPLD04] Cray XD1 Interconnect performance

  42. SGI Altix 3000 http://www.sgi.com/servers/altix

  43. Message Passing/Commodity Bus Distributed Shared Memory Source: [SGI, MAPLD04] SGI Altix 3000 FGPA products in development for the SGI Altix 3000 family and others - Up to 256 Itanium2 with 64 bit Linux – DSM - SGI NUMAlink™ GSM interconnect fabric (up to 256 devices) - Programming model to be determined

  44. App-Specific Graphics - GPU Signals - DSP Prog’ble - FPGA Other ASICs Scalar Intel Itanium SGI MIPS IBM Power Sun SPARC HP PA AMD Opteron Vector Cray X1 NEC SX Source: [SGI, MAPLD04] The 3 Single-Paradigm Architectures

  45. Scalar Scalar Vector IO IO Scalable Shared Memory . Globally addressable . Thousands of ports . Flat & high bandwidth . Flexible & configurable Vector FPGA DSP Graphics Reconfigurable Source: [SGI, MAPLD04] Multi-Paradigm Computing UltraViolet Terascale to Petascale Data Set :Bring Function to Data

  46. Source: [SGI, MAPLD04] Performance - Direct Connection to NUMAlink4 6.4GB/s/connection - Fast System Level Reprogramming of FPGA FPGA load at memory speeds - Atomic Memory Operations Same set as System CPUs - Hardware Barriers Dynamic Load Balancing - Configurations to 8191 NUMA/FPGA Nodes Scalability

  47. And what about I/O ?

  48. Number of base pairs of sequence in GenBank release 142 for selected organisms Growth of GenBank in billions of base pairs from release 3 in April of 1994 to the current release, 142. *2 every 12 monthsMoore Law : *2 /18 m

  49. Cluster Interconnect Latency Mesured Bandwith Ethernet Gb/s 28 ... 70 μs 100 MB/s Myrinet 4.7 μs 500 MB/s Quadrics 1.1 μs 950 MB/s Infiniband 3.7 μs 1500 MB/s (11/2004) Infiniband : goal of 500 nano sec. latency allready has RDMA, atomicity Parallel File Systems : GFS, GPFS, Lustre, PVFS, QFS, …

More Related