1 / 21

Is a Grid cost-effective?

Is a Grid cost-effective?. Ralf Gruber, EPFL-SIC/FSTI-ISE-LIN, Lausanne. HPC in Europe. TOP500: 176 in Europe, 12 have more than 1 Tflops/s Linpack First is CEA-DAM: No. 7 Germany: 71, UK: 39, France: 22, Italy: 16, Others: 28 Industry: 108, first (Telecom I) at No. 96

brice
Télécharger la présentation

Is a Grid cost-effective?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Is a Grid cost-effective? Ralf Gruber, EPFL-SIC/FSTI-ISE-LIN, Lausanne SOS7

  2. HPC in Europe TOP500: 176 in Europe, 12 have more than 1 Tflops/s Linpack First is CEA-DAM: No. 7 Germany: 71, UK: 39, France: 22, Italy: 16, Others: 28 Industry: 108, first (Telecom I) at No. 96 BMW: 11, Daimler-Chrysler: 5, Car F: 6 Not one big, but many smaller machines HPC Companies: Quadrics Scali, SCI-based clusters: No. 51 SCS: see Toni’s presentation Beowulf production: Paralline, Dalco, ...... SOS7

  3. Swiss-Tx project The Swiss-Tx machines (with TNet switch): 1998: Prototype Swiss-T0 with 16 Alphas 21164 1999: Swiss-T1 (Baby) with 16 Alphas 21264 2000: Swiss-T1 with 70 Alphas 21264 Know-how transfer to industry: 2001: GeneProt protein sequencing machine with 1420 Alphas 21264 Peak performance=1780Gflop/s In June 2001, would have been No. 12 in the Top500, 2nd in Europe and Was world number 1 of industrial computer installations Would be No. 48 (=C-Plant) in the Top500 list of November 2002 and Is still number 2 of industrial computer installations SOS7

  4. Is a grid cost-effective? NO! Reasons: Since 25 years, we can use machines all over the world Those who needed good connections, installed it (HEPNET, Swissprot, ..) Using Java is against HPC SOS7

  5. Parallel machines at EPFL and CSCS EPFL-SIC: SGI Origin3800 (500 MHz) 128 processors HP Alpha ES45/Quadrics (1.25 GHz) 100 processors Institutes PC clusters (CFD, Chemistry, Mathematics, Physics) IBM SP-2 (EFD) CSCS NEC SX-5 (16 processors) IBM Regatta (256 processors, 1.3 GHz) SOS7

  6. Optimal grid scheduling Parameterisation of . Single processor . Cluster . Application Application tailored Grid scheduling SOS7

  7. Characteristic single processor parameters Va and ra Va = Operations (Ops) / Memory accesses (LS) Examples SAXPY:y = y + a * x Ops = 2 LS = 3 (2 loads + 1 store) Va = 2 / 3 Matrix*matrix multiply and add: Va = n / 2 ra = min (R¥ , R¥ * Va / Vm) = min (R¥ , M¥ * Va) -> ra = 2/3 * M¥ -> ra = R¥ SOS7

  8. Results with MATMULT Va =1 (double precision) Vm = R¥[Mflop/s] / M¥[Mword/s] R¥[Mflop/s]= Theoretical peak performance M¥[Mword/s] = Theoretical peak memory bandwidth Machine P R¥ ra=M¥VM r % NEC SX-5 1 8000 8000 1 Pentium 4 1.5/R 1 1500 400 4 229 57 Alpha 21264 2 2000 333 6 200 60 Pentium 4 1.7/S 1 1700 133 12 92 69 AMD 1.2/S 1 2400 133 18 57 43 r: Performance mesurée %: 100*r/ ra /S: Slow SDRAM memory /R: Fast Rambus or RDRAM memory SOS7

  9. Tailoring clusters to applications G > 1 SOS7

  10. Tailoring clusters to applications G = ga / gm Application:ga= O / S Machine: gm = ra / b O: Number of operations in Flops S: Number of words sent in Words ra : Theoretical peak performance of application in Mflops/s b: Peak network bandwidth per processor in Mwords/s SOS7

  11. Cluster characterisation gm = ra / b b = C / P <d> gm = P * ra[Mflops/s] * <d> / C [Mwords/s] Table : The gm values for MATMULT (double precision) Machine P P*ra C <d> gm [Mflops/s] [Mwords/s] T1 (TNet) 32*2 21333 640 1.25 40 T1 (Fast Ethernet) 32*2 21333 48 1 444 IELNX (P4+FE) 22 8800 34 1 250 SOS7

  12. LAUTREC on Swiss-T1 + TNet Swiss-T1 (TNet): ra= 1000 Mflops/s, b = 10 Mwords/s gm = 100 Water molecules: ga = 5*P*(0.65*Norb+4.24*log2V) / 3*(P-1) P=8, Norb=128, log2V=20 ga = 330 G = 3.3 (3.6 measured) -> 25% of overall time is due to communication 75% is due to computation SOS7

  13. LAUTREC on Swiss-T1 + Fast Ethernet Swiss-T1 (FE): ra= 2000 Mflops/s, b = 1.5 Mwords/s gm = 1333 Water molecules: ga = 5*P*(0.65*Norb+4.24*log2V) / 3*(P-1) P=8, Norb=128, log2V=20 ga = 330 G = 0.25 (0.25 measured) -> 20% of overall time is due to computation 80% is due to communication SOS7

  14. LAUTREC : Effect of latency TNet/Swiss-T1: L=13 ms MPI latency, b=80MB/s Break-even message length: beml=L*b=1000B Fast Ethernet: L=100 ms MPI latency, b=10MB/s Break-even message length: beml=L*b=1000B Average message length in Lautrec: aml= p*V/16*P2 For test case (V=96**3, P=8): aml=40 kB>>beml SOS7

  15. Point-to-point applications ga = Operations (O) / Sends (S) FE/FV: O ~ Nb of volume nodes O ~ Nb of variables per node square O ~ Nb of non-zero matrix elements O ~ Nb of operations per matrix element FE/FV: S ~ Nb of surface nodes S ~ Nb of variables per node FE/FV: ga~ Nb of nodes in one direction ga~ Nb of variables per node ga~ Nb of non-zero matrix elements ga~ Nb of operations per matrix element ga~ 1/Nb of surfaces ga (NS/FV/100**3) C 2000 ga (Poisson/FD/100**3) C 400 Reminder (Beowulf+Fast Ethernet): gm C 250 SOS7

  16. Other quantities Memory usage Price per 1h CPU time Engineering salary Energy consumption Maintenance/servicing/personnel costs User commodity SOS7

  17. Optimal Grid scheduling Goal: Add an application tailored Grid scheduling to RMS . Estimate machine and application parameters by counts . Measure machine and application parameters (PAPI, ...) . Build up a data base on these parameters . Find and submit to best suited Grid ressource (not always optimum) . Update the data base dynamically . Perform statistics on decisions and decision failures SOS7

  18. Optimal Grid scheduling Settle and apply rules to find best suited ressource by: . Match machine/application (MPI or not MPI) . Best price/performance ratio based on parameterisation . Availability of the ressources . Engineering costs . Energy consumption SOS7

  19. Optimal Grid scheduling Perform statistics to: . Detect too often demanded unavailable ressources . Detect real costs of an application . Detect applications that should be parallelised/optimised to reduce costs . Guide decision making for the next purchase . Guide decision on R&D money attribution SOS7

  20. Is a grid cost-effective? Yes, it can be! Minimise overall costs by application adapted job execution Purchase not available demanded low-cost ressources Parallelise cost-ineffective applications Reduce engineering and energy costs Note: “Cheap” ressources do not have to be used up during 90% Results in More computing ressources for the same price More rapid increase of application efficiencies Questions Do computer manufacturers play the game? Do application owners play the game? Can we change users, decision makers and computing centres? SOS7

  21. Reference R. Gruber, P. Volgers, A. de Vita, M. Stengel, T.-M. Tran, Parameterisation to tailor commodity clusters to applications, Future Generation Computer Systems 19 (2003) 111-120 see also: http://sawww.epfl.ch/SIC/SA/publications/SCR02/scr13e.html SOS7

More Related