Franck Cappello INRIA Grid’5000 Email fci@lri.fr

www.grid5000.fr One of the 30+ ACI Grid projects * Grid’5000 Grid’5000 and a focus on a Fault tolerant MPI experiment Franck Cappello INRIA Grid’5000 Email fci@lri.fr *5000 CPUs CCGSC'06, Asheville

Agenda Grid’5000 Some early results Experiment on fault tolerant MPI CCGSC'06, Asheville

Peer-to-Peer CGP2P (F. Cappello, LRI/CNRS) Application Service Provider ASP (F. Desprez, ENS Lyon/INRIA) Algorithms TAG (S. Genaud, LSIIT) ANCG (N. Emad, PRISM) DOC-G (V-D. Cung, UVSQ) Compiler techniques Métacompil (G-A. Silbert, ENMP) Networks and communication RESAM (C. Pham, ENS Lyon) ALTA (C. Pérez, IRISA/INRIA) Visualisation EPSN (O. Coulaud, INRIA) Data management PADOUE (A. Doucet, LIP6) MEDIAGRID (C. Collet, IMAG) Tools DARTS (S. Frénot, INSA-Lyon) Grid-TLSE (M. Dayde, ENSEEIHT) Code coupling RMI (C. Pérez, IRISA) CONCERTO (Y. Maheo, VALORIA) CARAML (G. Hains, LIFO) Applications COUMEHY (C. Messager, LTHE) - Climate GenoGrid (D. Lavenier, IRISA) - Bioinformatics GeoGrid (J-C. Paul, LORIA) - Oil reservoir IDHA (F. Genova, CDAS) - Astronomy Guirlande-fr (L. Romary, LORIA) - Language GriPPS (C. Blanchet, IBCP) -Bioinformatics HydroGrid (M. Kern, INRIA) - Environment Medigrid (J. Montagnat, INSA-Lyon) - Medical Grid Testbeds CiGri-CIMENT (L. Desbat, UjF) Mecagrid (H. Guillard, INRIA) GLOP (V. Breton, IN2P3) GRID5000 (F. Cappello, INRIA) Support for disseminations ARGE (A. Schaff, LORIA) GRID2 (J-L. Pazat, IRISA/INSA) DataGRAAL (Y. Denneulin, IMAG) Thierry Priol ACI GRID projects CCGSC'06, Asheville

The Grid’5000 Project • Building a nation wide experimental platform for • Large scale Grid & P2P experiments • 9 geographically distributed sites • every site hosts a cluster (from 256 CPUs to 1K CPUs) • All sites are connected by RENATER (French Res. and Edu. Net.) • RENATER hosts probes to trace network load conditions • Design and develop a system/middleware environment • for safely test and repeat experiments • 2) Use the platform for Grid experiments in real life conditions • Port and test applications, develop new algorithms • Address critical issues of Grid system/middleware: • Programming, Scalability, Fault Tolerance, Scheduling • Address critical issues of Grid Networking • High performance transport protocols,Qos • Investigate original mechanisms • P2P resources discovery, Desktop Grids CCGSC'06, Asheville

Grid’5000 principle:A highly reconfigurable experimental platform Application Programming Environments Application Runtime Measurement tools Experimental conditions injector Grid or P2P Middleware Operating System Networking Let users create, deploy and run their software stack, including the software to test and their environment + measurement tools + experimental fault injectors CCGSC'06, Asheville

Grid’5000 map Lille: 500 (106) Nancy: 500 (94) Rennes 518 (518) Orsay 1000 (684) Lyon 500 (252) Bordeaux 500 (96) Grenoble 500 (270) Toulouse 500 (116) Sophia Antipolis 500 (434) CCGSC'06, Asheville

Hardware Configuration CCGSC'06, Asheville

Grid’5000 network Renater connections 10 Gbps Dark fiber Dedicated Lambda Fully isolated traffic! CCGSC'06, Asheville

Grid’5000 Grid’5000 as an Instrument • 4 main features: • A high security for Grid’5000 and the Internet, despite the deep reconfiguration feature • A software infrastructure allowing users to access Grid’5000 from any Grid’5000 site and have home dir in every site • A reservation/scheduling tools allowing users to select node sets and schedule experiments • A user toolkit to reconfigure the nodes and monitor experiments A confined system -Multiple access points -Single system view -User controlled data sync. CCGSC'06, Asheville

OS Reconfiguration techniquesReboot OR Virtual Machines Virtual Machine: No need for reboot Virtual machine technology Selection not so easy Xen has some limitations: -Xen3 in “initial support” status for intel vtx -Xen2 does not support x86/6 -Many patches not supported -High overhead on high speed Net. Reboot: Remote control with IPMI, RSA, etc. Disc repartitioning, if necessary Reboot or Kernel switch (Kexec) Currently we use Reboot, but Xen will be used in the default environment. Let users select its experimental environment: Fully dedicated or shared within virtual machine CCGSC'06, Asheville

Main TCP applications throughputs (Renater) Experimental Condition injectors Network traffic generator Faults injector FAult Injection Language A non Gaussian long memory model Gamma + Farima: Ga,b – farima (f, d, q) Normal D=10ms D=400ms DOS attack Flash Crowd D=32ms D=2ms CCGSC'06, Asheville

Agenda • Grid’5000 • Early results: • Communities, • Platform usage, • Experiments • Experiment on Fault tolerant MPI CCGSC'06, Asheville

Community: Grid’5000 users 345 registered Users Coming from 45 Laboratories. Univ.Nantes Sophia CS-VU.nl FEW-VU.nl Univ. Nice ENSEEIHT CICT IRIT CERFACS ENSIACET INP-Toulouse SUPELEC IBCP IMAG INRIA-Alpes INSA-Lyon Prism-Versailles BRGM INRIA CEDRAT IME/USP.br INF/UFRGS.br LORIA UFRJ.br LABRI LIFL ENS-Lyon EC-Lyon IRISA RENATER IN2P3 LIFC LIP6 UHP-Nancy France-telecom LRI IDRIS AIST.jp UCD.ie LIPN-Paris XIII U-Picardie EADS EPFL.ch LAAS ICPS-Strasbourg CCGSC'06, Asheville

April: just before SC’06 and Grid’06 deadlines Grid’5000 activity (Feb’06) Activity > 70% CCGSC'06, Asheville

Reference: 32 CPUs Experiment: Geophysics: Seismic Ray Tracing in 3D mesh of the Earth Stéphane Genaud , Marc Grunberg , and Catherine Mongenet IPGS: “Institut de Physique du Globe de Strasbourg” Building a seismic tomography model of the Earth geology using seismic wave propagation characteristics in the Earth. Seismic waves are modeled from events detected by sensors. Ray tracing algorithm: waves are reconstructed from rays traced between the epicenter and the sensors. A MPI parallel program composed of 3 steps 1) Master-worker: ray tracing and mesh update by each process with blocks of rays successively fetched from the master process, 2) all-to all communications to exchange submesh in-formation between the processes, 3) merging of cell information of the submesh associated with each process. CCGSC'06, Asheville

Solving the Flow-Shop Scheduling Problem “one of the hardest challenging problems in combinatorial optimization” • Schedule a set of jobs on a set of machines minimizing the makespan. • Exhaustive enumeration of all combinations would take several years. • The challenge is to reduce the number of explored solutions. • New Grid enabled Branch-and-Bound algorithm (Talbi, melab, et al.), combining grid computing, dynamic load balancing of an irregular tree, fault tolerance, termination detection of asynchronous processes, global information sharing. • The new approach is based a special coding of the explored tree and workunits, allowing dynamic distribution and checkpointing on the Grid and strong reduction of the communication needs for sharing global information • Problem: 50 jobs on 20 machines, optimally solved for the 1st time, with 1245 CPUs (peak) • Involved Grid5000 sites (6): Bordeaux, Lille, Orsay, Rennes, Sophia-Antipolis and Toulouse. • The optimal solution required a wall-clock time of 1 month and 3 weeks. CCGSC'06, Asheville

A C D B E F G Fully Distributed Batch Scheduler • Motivation : evaluation of a fully distributed resource allocation service (batch scheduler) • Vigne : Unstructured network, flooding (random walk optimized for scheduling). • Experiment: a bag of 944 homogeneous tasks / 944 CPU • Synthetic sequential code (monte carlo application). • Measure of the mean execution time for a task (computation time depends on the resource) • Measure the overhead compared with an ideal execution (central coordinator) • Objective: 1 task per CPU. • Tested configuration: • Result : mean: 1972 s • 944 CPUs • Bordeaux (82), Orsay(344), Rennes Paraci (98), Rennes Parasol (62), Rennes Paravent (198), Sophia (160)) • Total experiment duration: 12h CCGSC'06, Asheville

Large Scale experiment of DIET:A GridRPC environment 1120 clients submitted more than 45 000 REAL GridRPC requests (dgemm matrix multiply) to GridRPC servers Objectives : - Prove the scalability of DIET - Test the functionalities of DIET at large scale  7 sites : Lyon, Orsay,Rennes, Lilles, Sophia,Toulouse,Bordeaux  8 clusters - 585 machines - 1170 CPUs. Raphaël Bolze CCGSC'06, Asheville

Challenge: use P2P mechanismsto build grid middleware • Goals: study of a JXTA “DHT” • “Rendez vous” peers form the JXTA DHT • What are the performance of this DHT? • What is the scalability of this DHT? • Organization of a JXTA overlay (peerview protocol) • Each rendezvous peer has a local view of other rendezvous peers • Loosely-Consistent DHT between rendezvous peers • Mechanism for ensuring convergence of local views • Key point for the efficiency of the discovery protocol • Benchmark: time for a local view to converge • Up to 580 nodes on 6 sites CCGSC'06, Asheville

117 participants (we tryed to limit to 100) Non Grid’5000 users Grid’5000 users 4% Engineers 31% Students 26% 44% Scientists 56% PostDocs Don’t miss the Second Grid’5000 Winter School in Jan. 2007 http://ego-2006.renater.fr/ • Topics and exercises: • Reservation • Reconfiguration • MPI on the Cluster of Clusters • Virtual Grid based on Globus GT4 39% 1% 1% Toulouse 3% 6% Bordeaux 9% Sophia 6% 3% Grenoble Rennes 19% Computer science 14% Physics Mathematics Biology Lille 6% Chemistry Orsay 17% Nancy 86% 14% Lyon 15% CCGSC'06, Asheville

Grid@work (Octobre 10-14 2005) • Series of conferences and • tutorials including • Grid PlugTest (N-Queens and Flowshop Contests). The objective of this event was to bring together ProActive users, to present and discuss current and future features of the ProActive Grid platform, and to test the deployment and interoperability of ProActive Grid applications on various Grids. Don’t miss Grid@work 2006 in Nov. 26 to Dec. 1 http://www.etsi.org/plugtests/Upcoming/GRID2006/GRID2006.htm The N-Queens Contest (4 teams) where the aim was to find the number of solutions to the N-queens problem, N being as big as possible, in a limited amount of time The Flowshop Contest (3 teams) 1600 CPUs in total: 1200 provided by Grid’5000 + 50 by the other Grids (EGEE, DEISA, NorduGrid) + 350 CPUs on clusters. CCGSC'06, Asheville

Agenda Grid’5000 Early results Experiment on fault tolerant MPI [SC2006] CCGSC'06, Asheville

Fault tolerant MPI High performance computing platforms are using more and mode CPUs, networks devices, disks, etc. Several projects (FT-MPI, LAM, LA-MPI, MPICH-V, etc.) are developing Fault tolerant MPI implementations MPICH-V project, started in 2001, has made extensive comparisons between automatic fault tolerant protocols for MPI on large scale platforms. MPICH-V team is now integrating its skills into MPICH2 and OpenMPI There are still many research issues on automatic fault tolerant MPI: -fault tolerant protocols (low overhead, fast recovery, tolerant to many failures scenarios) -checkpointing (timing, fast and robust, distributed storage) -complex platforms (Grids, Desktop Grids, MPPs) This experiments: Coordinated checkpointing is one of the best fault tolerant protocols. Two strategies: Blocking (LAM, etc.), Non Blocking (MPICH-V) CCGSC'06, Asheville

Blocking VS. Non Blocking coordinated checkpointing on the Grid Blocking • all MPI processes communicate directly (no extra memory copy) • A process manager runs on all nodes, receiving order and computing the checkpoint • during a checkpoint, every process sends markers to flush the network channels. • After receiving a marker message, a process stops sending MPI messages • When all incoming channel are flushed, the checkpoint is executed. Main advantage: no overhead on comp/comm except during checkpoint Main drawback: communications stop during checkpoint CCGSC'06, Asheville

Using the reconfiguration and scale of Grid’5000 Non-Blocking • a communication daemon is associated with every MPI process • All received messages are first copied by the communication daemon • When a communication daemon receives a marker, it checkpoints the MPI process immediately and sends marker messages in all channels • The MPI process continues to run sending messages as normally • All messages received between first and last markers are saved by the com. daemon, and forwarded to the MPI process. • Only when all markers have been received, the daemon is checkpointed. Main advantage: Computations and communications continue during checkpoint Main drawback: communication overhead, since all received messages are copied by the communication daemon CCGSC'06, Asheville

Blocking VS. Non Blocking coordinated checkpointing on the Grid All measurements in an homogeneous environment including: -homogeneous clusters with AMD Opteron 248 (2.2 GHz/1MB L2 cache) dual-processors -The Berkeley Linux Check- point/Restart library. -All nodes rebooted under Linux2.6.13.5. -MPICH2 (FT-implementation consists on a new channel: ft-sock) Two experiments: -1 single Myrinet Cluster: Bordeaux with 64 CPUs -6 Clusters: 96 CPUs (Bordeaux), 106 CPUs (Lille), 432 CPUs (Orsay), 128 CPUs (Rennes), the 210 CPUs (Sophia) and 116 CPUs (Toulouse). Communication performance of RENATER 3 with NetPerf CCGSC'06, Asheville

Blocking VS. Non Blocking coordinated checkpointing on the Grid Performance evaluation on a Myrinet Cluster (64 CPUs) CCGSC'06, Asheville

Blocking VS. Non Blocking coordinated checkpointing on the Grid Performance evaluation on Grid’5000 Several experiments up to 529 CPUs over 6 sites. Ex: execution time of BT Class B at 400 CPUs, with blocking Ckpt. CCGSC'06, Asheville

Conclusion: toward an international Platform DAS 1500 CPUs Sept 2006 GENI (NSF) Grid’5000 Japan (Tsubame, I-Explosion) 2600 CPUs The Distributed systems (P2P, Grid) and Networking communities recognizes the necessity of large scale experimental platforms (PlanetLab, Emulab, Grid’5000, DAS, I-Explosion, GENI, etc.) CCGSC'06, Asheville

QUESTIONS? CCGSC'06, Asheville

Franck Cappello INRIA Grid’5000 Email fci@lri.fr

Franck Cappello INRIA Grid’5000 Email fci@lri.fr

Presentation Transcript