Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers liun2@cs.rpi.edu

Outline • Backgound • Torus model, traffic model • BG/L • Ross: Massively Parallel Simulator • Experiment results • Future work

Background • CODES: Enabling Co-Design of Multilayer Exascale Storage Architectures • CODES GOAL: Develop a simulation framework for evaluating exascale storage design challenges. • Hardware Models • Storage Software Models • Storage System Architecture • Exascale I/O Workload Models • Simulation Framework - Integrate models and storage software into simulation framework

Torus Network • Blue Gene and Cray XT supercomputer families adopt a 3-D torus • Upcoming Blue Gene/Q will have a 5-D torus network • Provide low latencies and high bandwidth at a moderate cost to construct.

Torus Traffic and Routing • Using Markovian models • Each node continuously generates packets • Select random destination • Packet size fixed • Dynamic routing VS. static routing • Avoid deadlocks • BGL eager/rendezvous protocols

Discrete Event Model • Logic Process: Node • Events • Packet_generate_event • Packet_send_event • Packet_arrival_event • Packet_process_event

Simulation Testbed: BGL • 32-bit IBM PowerPCs running at only 700 MHz • 1 GB memory per node • 1,024 dual processor “node” per rack • 16-rack, 32,768-processor located at Rensselaer’s Computational Center for Nanotechnology Innovations (CCNI) • Confusion? Simulating BGL torus on top of BGL

ROSS: Parallel Simulator • Serial/Conservative/Optimistic Simulation • Using Jefferson’s Time Warp event scheduling mechanism • Reverse Computation

Validation Using Little’s Law Little's Law: the average number of customers in the store, L, is the effective arrival rate, λ, times the average time that a customer spends in the store

Validation Using Little’s Law

Latency Comparison: BGL vs. Simulation • Using MPI Send()/MPI Recv() • Collected data from 1,024-node torus in a 1x32x32 node configuration

Performance Metrics • The performance study examines the impact of processor/core count on four primary metrics: • (i) committed event-rate, • (ii) percentage of remote events, • (iii) efficiency and • (iv) secondary rollbacks.

Million-Node Torus Scalability • Packet injection rate 10 pkt/ms • peak event-rate of 4.78 G/sec

Efficiency

Remote Event Rate • Random destination selection creates a difficult scenario for parallel event scheduling because each packet randomly selects destination

Secondary Rollback Rate

Billion-Node Torus Scalability • consume 2 TB memory • total number of generated packets is O(1011) • total number of events scheduled is O(1013) • Packet injection rate 200 pkt/ns & 400 pkt/ns • higher rollback probability • larger event population leads to increased queuing overheads

Billion-Node Torus Scalability

Future work • Application workload models: Application I/O kernel models, I/O characterization models • I/O aggregator node models • I/O network models: network cards, switches, and topologies • I/O storage node models: storage software • I/O storage software: models and prototype system software • I/O controller models: RAID and enterprise storage devices • Disk models: HDDs and SSDs

Future work • Increase the fidelity of torus network model • Dynamic routing • Virtual channels • Different torus traffic model • Tree network model based on Blue Gene families • MPI_Alltoall(), MPI_Bcast(),MPI_Reduce(); • Complex I/O workload drivers, like PHASTA

Related Work • Heidelberger’s use of the YAWNS protocol to model the Blue Gene/L torus network on a per cycle basis appears to be one of the most accurate models created to date. • Min and Ould Khaoua proposed a torus network model based on circuit switching.

Conslusions • near linear speedup for our torus model • peak event-rate on 32K cores is 4.78 G/sec • demonstrated the ability to model a million-node and billion-node torus network on Blue Gene/L • conducted comparison tests between actual Blue Gene torus network and our model using MPI Send()/MPI Recv()

Thank you for your attention! Questions?

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation

Presentation Transcript

Parallel Discrete Event Simulation

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation

Discrete Event Simulation

Parallel Event Driven Simulation using GPU (CUDA)

Discrete Event Simulation

Discrete Event Simulation - 8

DISCRETE-EVENT SIMULATION MODEL

Discrete Event (time) Simulation

Discrete Event Simulation

Discrete Event Modeling and Simulation of Distributed Architectures using the DSSV Methodology

Discrete Event Systems Simulation

Parallel Discrete Event Simulation of Manufacturing Systems using PARSEC

Discrete Event Simulation - 3

Discrete Event Simulation

Parallel Discrete-Event Simulations

Discrete Event Simulation - 4

Discrete Event Simulation - 10

Discrete Event Simulation

Parallel Discrete Event Simulation

Parallel Discrete Event Simulation (PDES) at ORNL

Discrete Event Simulation