280 likes | 294 Vues
This research article discusses the use of parallel discrete event simulation (DES) on high-performance clusters. It covers the synchronization problems in parallel DES and introduces the CSAM tool. The architecture of the simulator kernel and the communication network model are also explained. Results on both mono-processor and multi-processor clusters are presented.
E N D
Parallel Simulations on High-Performance Clusters C.D. Pham RESAM laboratory Univ. Lyon 1, France cpham@resam.univ-lyon1.fr
Outline • Backgrounds • Discrete Event Simulation (DES) • Parallel DES and the synchronization problems • The CSAM Tool • Architecture of the simulator kernel • The communication network model • Results • On mono-processor cluster • On multi-processor cluster
Simulation • To simulate is to reproduce the behavior of a physical system with a model • Practically, computers are used to numerically simulate a logical model • Simulations are used for performance evaluation and prediction of complex systems • fluids dynamic, chemistry reactions (continous) • communication network models: routing, congestion avoidance, mobile… (discrete) • Simulation is more flexible than analytical methods
S2 Discrete Event Simulation (DES) • assumption that a system changes its state at discrete points in simulation time a1 a2 d1 a3 d2 d3 a4 S1 S3 time-step t 0 2t 3t 4t 5t 6t
DES concepts • fundamental concepts: • system state (variables) • state transitions (events) • simulation time: totally ordered set of values representing time in the system being modeled • the system state can only be modified upon reception of an event • modeling can be • event-oriented • process-oriented
Life cycle of a DES • a DES system can be viewed as a collec-tion of simulated objects and a sequence of event computations • each event computation contains a time stamp indicating when that event occurs in the physical system • each event computation may: • modify state variables • schedule new events into the simulated future • events are stored in a local event list • events are processed in time stamped order • usually, no more event = termination
A B 5 <e1,5> A receive packet P1 e1 <e2,10> A sends P1 to B e2 <e3,12> A receive packet P2 e3 <e4,15> B receive P1 from A e4 e5 <e5,16> B sends ACK(P1) to A <e6,17> A sends P2 to B e6 <e7,21> A receive ACK(P1) e7 e8 <e8,23> B receive P2 from A <e9,22> A receive packet P3 e9 local event list A simple DES model link model delay = 5 send processing time = 5 receive processing time = 1 packet arrival P1 at 5, P2 at 12, P3 at 22
Why it works? • events are processed in time stamp order • an event at time tcan only generate future events with timestamp greater or equal tot (no event in the past) • generated events are put and sorted in the event list, according to their timestamp • the event with the smallest timestamp is always processed first, • causality constraints are implicitly maintained.
Why change? It ’s so simple! • models becomes larger and larger • the simulation time is overwhelming or the simulation is just untractable • example: • parallel programs with millions of lines of codes, • mobile networks with millions of mobile hosts, • ATM networks with hundreds of complex switches, • multicast model with thousands of sources, • ever-growing Internet, • and much more...
Some figures to convince... • ATM network models • Simulation at the cell-level, • 200 switches • 1000 traffic sources, 50Mbits/s • 155Mbits/s links, • 1 simulation event per cell arrival. More than 26 billions events to simulate 1 second! 30 hours if 1 event is processed in 1us • simulation time increases as link speed increases, • usually more than 1 event per cell arrival, • how scalable is traditional simulation?
Parallel simulation - principles • execution of a discrete event simulation on a parallel or distributed system with several physical processors. • the simulation model is decomposed into several sub-models that can be executed in parallel • spacial partitioning, • temporel partitioning, • radically different from simple simulation replications.
Parallel simulation - pros & cons • pros • reduction of the simulation time, • increase of the model size, • cons • causality constraints are difficult to maintain, • need of special mechanisms to synchronize the different processors, • increase both the model and the simulation kernel complexity. • challenges • ease of use, transparency.
logical process (LP) h packet t event parallel Parallel simulation - example
A B link model delay = 5 send processing time = 5 receive processing time = 1 packet arrival P1 at 5, P2 at 12, P3 at 22 5 t <e1,5> A rec. packet P1 e1 <e2,10> A sends P1 to B e2 <e3,12> A rec. packet P2 e3 <e4,15> B rec. P1 from A <e6,17> A sends P2 to B e6 e4 <e9,22> A rec. packet P3 e9 e5 <e5,16> B sends ACK(P1) <e3,21> A rec. ACK(P1) e7 e8 <e8,23> B rec. P2 from A causality error, violation A simple PDES model local event list
Synchronization problems • fundamental concepts • each Logical Process (LP) can be at a different simulation time • local causality constraints: events in each LP must be executed in time stamp order • synchronization algorithms • Conservative: avoids local causality violations by waiting until it ’s safe • Optimistic: allows local causality violations but provisions are done to recover from them at runtime
CSAM (Pham, UCBL) • CSAM: Conservative Simulator for ATM network Model • Simulation at the cell-level • Conservative and/or sequential • C++ programming-style, predefined generic model of sources, switches, links… • New models can be easily created by deriving from base classes • Configuration file that describes the topology
CSAM - Kernel characteristics • Exploits the lookahead of communication links: transparent for the user • Virtual Input Channels • reduces overhead for event manipulation, • reduces overhead for null-messages handling. • Cyclic event execution • Message aggregation • static aggregation size, • asymmetric aggregation size on CLUMPS, • sender-initiated, • receiver-initiated.
Test case: 78-switch ATM network Distance-Vector Routing with dynamic link cost functions Connection setup, admission control protocols
Why is it difficult? • Very small granularity: 1 message represents 1 cell tranfer • high level of message synchronisation • very small computation/communication ratio • Load imbalance between links • large number of control messages • partitioning and load balancing are difficult
CSAM - Some results... Routing protocol’s reconfiguration time
Parallel Simulation on High Performance Clusters • Myrinet-based cluster of 12 Pentium Pro at 200MHz, 64 MBytes, Linux • Myrinet-based cluster of 4 dual Pentium Pro 450MHz, 128 Mbytes, Linux • Myrinet board with LANai 4.1, 256KB • BIP, BIP-SMP, MPI/BIP, MPI/BIP-SMP communication libraries
Speedup on a myrinet cluster Pentium Pro 200MHz More than 53 millions events to simulate 0.31s
Speedup with CLUMPS Dual Pentium Pro 450MHz
Increasing the model size (CLUMPS) Dual Pentium Pro 450MHz, 4x2 int
Conclusions • Parallel Simulation is very sensitive to latency • High Performance Clusters is a good alternative to traditionnal massively parallel computer • CLUMPS architectures are very attractive as the price on the communication card can be cut in half