Is a Grid Cost-Effective for High-Performance Computing?

Is a Grid cost-effective? Ralf Gruber, EPFL-SIC/FSTI-ISE-LIN, Lausanne SOS7

HPC in Europe TOP500: 176 in Europe, 12 have more than 1 Tflops/s Linpack First is CEA-DAM: No. 7 Germany: 71, UK: 39, France: 22, Italy: 16, Others: 28 Industry: 108, first (Telecom I) at No. 96 BMW: 11, Daimler-Chrysler: 5, Car F: 6 Not one big, but many smaller machines HPC Companies: Quadrics Scali, SCI-based clusters: No. 51 SCS: see Toni’s presentation Beowulf production: Paralline, Dalco, ...... SOS7

Swiss-Tx project The Swiss-Tx machines (with TNet switch): 1998: Prototype Swiss-T0 with 16 Alphas 21164 1999: Swiss-T1 (Baby) with 16 Alphas 21264 2000: Swiss-T1 with 70 Alphas 21264 Know-how transfer to industry: 2001: GeneProt protein sequencing machine with 1420 Alphas 21264 Peak performance=1780Gflop/s In June 2001, would have been No. 12 in the Top500, 2nd in Europe and Was world number 1 of industrial computer installations Would be No. 48 (=C-Plant) in the Top500 list of November 2002 and Is still number 2 of industrial computer installations SOS7

Is a grid cost-effective? NO! Reasons: Since 25 years, we can use machines all over the world Those who needed good connections, installed it (HEPNET, Swissprot, ..) Using Java is against HPC SOS7

Parallel machines at EPFL and CSCS EPFL-SIC: SGI Origin3800 (500 MHz) 128 processors HP Alpha ES45/Quadrics (1.25 GHz) 100 processors Institutes PC clusters (CFD, Chemistry, Mathematics, Physics) IBM SP-2 (EFD) CSCS NEC SX-5 (16 processors) IBM Regatta (256 processors, 1.3 GHz) SOS7

Optimal grid scheduling Parameterisation of . Single processor . Cluster . Application Application tailored Grid scheduling SOS7

Characteristic single processor parameters Va and ra Va = Operations (Ops) / Memory accesses (LS) Examples SAXPY:y = y + a * x Ops = 2 LS = 3 (2 loads + 1 store) Va = 2 / 3 Matrix*matrix multiply and add: Va = n / 2 ra = min (R¥ , R¥ * Va / Vm) = min (R¥ , M¥ * Va) -> ra = 2/3 * M¥ -> ra = R¥ SOS7

Results with MATMULT Va =1 (double precision) Vm = R¥[Mflop/s] / M¥[Mword/s] R¥[Mflop/s]= Theoretical peak performance M¥[Mword/s] = Theoretical peak memory bandwidth Machine P R¥ ra=M¥VM r % NEC SX-5 1 8000 8000 1 Pentium 4 1.5/R 1 1500 400 4 229 57 Alpha 21264 2 2000 333 6 200 60 Pentium 4 1.7/S 1 1700 133 12 92 69 AMD 1.2/S 1 2400 133 18 57 43 r: Performance mesurée %: 100*r/ ra /S: Slow SDRAM memory /R: Fast Rambus or RDRAM memory SOS7

Tailoring clusters to applications G > 1 SOS7

Tailoring clusters to applications G = ga / gm Application:ga= O / S Machine: gm = ra / b O: Number of operations in Flops S: Number of words sent in Words ra : Theoretical peak performance of application in Mflops/s b: Peak network bandwidth per processor in Mwords/s SOS7

Cluster characterisation gm = ra / b b = C / P <d> gm = P * ra[Mflops/s] * <d> / C [Mwords/s] Table : The gm values for MATMULT (double precision) Machine P P*ra C <d> gm [Mflops/s] [Mwords/s] T1 (TNet) 32*2 21333 640 1.25 40 T1 (Fast Ethernet) 32*2 21333 48 1 444 IELNX (P4+FE) 22 8800 34 1 250 SOS7

LAUTREC on Swiss-T1 + TNet Swiss-T1 (TNet): ra= 1000 Mflops/s, b = 10 Mwords/s gm = 100 Water molecules: ga = 5*P*(0.65*Norb+4.24*log2V) / 3*(P-1) P=8, Norb=128, log2V=20 ga = 330 G = 3.3 (3.6 measured) -> 25% of overall time is due to communication 75% is due to computation SOS7

LAUTREC on Swiss-T1 + Fast Ethernet Swiss-T1 (FE): ra= 2000 Mflops/s, b = 1.5 Mwords/s gm = 1333 Water molecules: ga = 5*P*(0.65*Norb+4.24*log2V) / 3*(P-1) P=8, Norb=128, log2V=20 ga = 330 G = 0.25 (0.25 measured) -> 20% of overall time is due to computation 80% is due to communication SOS7

LAUTREC : Effect of latency TNet/Swiss-T1: L=13 ms MPI latency, b=80MB/s Break-even message length: beml=L*b=1000B Fast Ethernet: L=100 ms MPI latency, b=10MB/s Break-even message length: beml=L*b=1000B Average message length in Lautrec: aml= p*V/16*P2 For test case (V=96**3, P=8): aml=40 kB>>beml SOS7

Point-to-point applications ga = Operations (O) / Sends (S) FE/FV: O ~ Nb of volume nodes O ~ Nb of variables per node square O ~ Nb of non-zero matrix elements O ~ Nb of operations per matrix element FE/FV: S ~ Nb of surface nodes S ~ Nb of variables per node FE/FV: ga~ Nb of nodes in one direction ga~ Nb of variables per node ga~ Nb of non-zero matrix elements ga~ Nb of operations per matrix element ga~ 1/Nb of surfaces ga (NS/FV/100**3) C 2000 ga (Poisson/FD/100**3) C 400 Reminder (Beowulf+Fast Ethernet): gm C 250 SOS7

Other quantities Memory usage Price per 1h CPU time Engineering salary Energy consumption Maintenance/servicing/personnel costs User commodity SOS7

Optimal Grid scheduling Goal: Add an application tailored Grid scheduling to RMS . Estimate machine and application parameters by counts . Measure machine and application parameters (PAPI, ...) . Build up a data base on these parameters . Find and submit to best suited Grid ressource (not always optimum) . Update the data base dynamically . Perform statistics on decisions and decision failures SOS7

Optimal Grid scheduling Settle and apply rules to find best suited ressource by: . Match machine/application (MPI or not MPI) . Best price/performance ratio based on parameterisation . Availability of the ressources . Engineering costs . Energy consumption SOS7

Optimal Grid scheduling Perform statistics to: . Detect too often demanded unavailable ressources . Detect real costs of an application . Detect applications that should be parallelised/optimised to reduce costs . Guide decision making for the next purchase . Guide decision on R&D money attribution SOS7

Is a grid cost-effective? Yes, it can be! Minimise overall costs by application adapted job execution Purchase not available demanded low-cost ressources Parallelise cost-ineffective applications Reduce engineering and energy costs Note: “Cheap” ressources do not have to be used up during 90% Results in More computing ressources for the same price More rapid increase of application efficiencies Questions Do computer manufacturers play the game? Do application owners play the game? Can we change users, decision makers and computing centres? SOS7

Reference R. Gruber, P. Volgers, A. de Vita, M. Stengel, T.-M. Tran, Parameterisation to tailor commodity clusters to applications, Future Generation Computer Systems 19 (2003) 111-120 see also: http://sawww.epfl.ch/SIC/SA/publications/SCR02/scr13e.html SOS7

Is a Grid Cost-Effective for High-Performance Computing?

Is a Grid Cost-Effective for High-Performance Computing?

Presentation Transcript

On a Grid-Based Interface to a Special-Purpose Hardware Cluster

The Kangaroo approach to Data movement on the Grid

The Cost Approach

Cost and Cost Terminology

Introducing SigmaXL ® Version 5.2

Determine the Grid Coordinates on a Military Map

Introducing SigmaXL ® Version 5.3

A Heightfield on an Isometric Grid

Grid Architecture

Activity Analysis, Cost Behavior, and Cost Estimation

MIRRORS

Polynomial Bounds for the Grid-Minor Theorem

COST REDUCTION AND COST CONTROL

Cost

Access Tutorial

The GENIUS Grid Portal

Grid Authentication and Authorization

Conserving Water: How to Plan and Implement Cost-Effective Programs

A Complete Scenario on Grid - How to build, program, use a Grid -

Dr. Rajkumar Buyya