240 likes | 435 Vues
Bringing Up Anton: Taking Co-Design into Production. Joseph A. Bank September 24, 2010 D E Shaw Research. Talk Outline. Brief history of Anton Bringup challenges of Anton The bringup lessons. A Brief History of Anton.
E N D
Bringing Up Anton:Taking Co-Design into Production Joseph A. Bank September 24, 2010 D E Shaw Research
Talk Outline • Brief history of Anton • Bringup challenges of Anton • The bringup lessons
A Brief History of Anton • Massively parallel special purpose machine to accelerate Molecular Dynamics (MD) simulations • Custom designed ASICs connected by specialized toroidal network • First ASIC received Q1 2008 • 512-node machine operational Q4 2008 • 1-millisecond BPTI MD simulation Q2 2009 • 512-node achieves performance of ~17,000 ns/day for 5DHFR (23,558 atoms) MD simulation
Bringup Challenges of Anton • Application: MD itself is a bit hard to verify • Few simple metrics (energy drift, frms, folds/ms, …) • No one has simulated the time scales of Anton • Algorithm Changes • Gaussian Split Ewald method non-bonded far interactions • Neutral Territory method for non-bonded near interactions • Architecture • Massively parallel heterogeneous system • 512+ nodes, 13 cores per node, 3 types of cores • Custom communication primitives • Fixed point instead of floating point • Resource optimized => I/D caches, SRAMs are all tightly constrained • Software • From scratch MD code base for Anton • Anton simulation preparation framework is complex • Dynamic code generation • Specialization to machine size, chemical system, etc Summary => Application/Architecture Co-design makes bringup uniquely challenging
Bringup Lesson Outline as Quips • “Do your homework” • “Where’s the chip?” • “Repeat yourself, over and over and over” • “Inspector gadget” • “Use your eyes” • “Target practice” • “Trust no one”
“Do Your Homework”: Preparing for Bringup • Desmond: Verification of algorithms, develop experience with MD simulation • Pyrite: Verification of fixed point calculation kernels • Detailed architectural simulator • Interface compatible with ASIC design (allowing co-simulation with RTL) • Enabled earliest possible development and testing of complete software stack (embedded code, prep time, etc) • During bringup the simulator could rerun simulations with much higher visibility of the architectural and software state.
“Where’s the chip?”: Dealing with Scarcity • Challenge: Anton’s primary designed mode of operation is “SRAM mode” where all data fits in SRAM. This requires a configuration of at least 2x2x2 ASICs. During bringup, ASICs trickled in… • Solution: “DRAM mode” • We spent about 6 man months of software development on a mode of operation that choreographs paging data into SRAM from DRAM and could perform large chemistry simulations on small Anton configurations (even single ASICs). • DRAM mode was used to test every ASIC individually and at each machine size we have built:1, 2, 4, 8, 64, 128, 256, 512, 1024, 2048.
“Repeat Yourself”: Bit-wise Reproducibility • Anton and its embedded SW were designed to provide application level bit-wise reproducibility independent of HW configuration. • Detection: Rerun entire simulations and compare trajectories. • Primary means of detecting HW/SW bugs during bringup • Used with “golden” trajectories for suite of tests on every ASIC • Periodically used to check machine status • Isolation with Force comparison: Online checking of redundant force calculation • Generalized isolation with redundancy checker infrastructure: Online piecewise rerunning of simulation with arbitrary logging of lightweight checksums
“Inspector Gadget”: Anton’s Logic Analyzer • Anton ASICs include a builtin “logic analyzer” that can be configured to capture traces of various hardware signals without perturbing timing. • Extremely useful when it worked. • Limited number of signals could be traced in a single run, often requiring multiple runs • Traces can be “bumped” for other DRAM traffic, so often was not useful in DRAM mode simulations • Provided key performance tuning data • Lesson: HW visibility tools are a great investment.
“Use your eyes”: Visualization for Debugging and Optimization • Many of the most difficult bugs during bringup were initially tracked down by creating custom visualizations that provided key insights. • Favor quick and dirty over beautiful! • Example 1: Force mismatch blast patterns • Example 2: When ions attack • Example 3: Logic Analyzer for optimization/tuning
“Trust No One”: Paranoid Debugging • During Anton bringup, it was useful to be very paranoid. • Issues were found in both hardware and software at similar frequency and our initial guesses were often wrong. • Most engineers have little experience with this phase of a project; as a software developer it takes practice to learn to distrust the hardware. • Best example: SRAMs that would return bad results for some locations less than once an hour.
Conclusions • Application/Architecture Co-design made bringing up Anton extremely challenging • Most important lessons from Anton’s successful bringup • Preparation • Repeatability • Paranoia
Molecular Dynamics Simulation (MD) • 104 to 105 atoms in a simulation • Millisecond-scale simulations • Each time step is ~2fs (2x10-15 seconds) • Need 5x1011 time steps • Presently at ~108 time steps/day on a cluster with Desmond (Bowers et al, SC2006) • Simulating 1 ms takes >10 years on a cluster • Needed an architectural jump forward: Anton (Shaw et al, ISCA 2007, CACM2008, SC09)
Biomolecular Timescales (seconds) Simulation Experiment Adapted from Suits (IBM), originally from Chan & Dill (1993) Hours/days on workstation A few months on Anton, longest MD simulation ever run Long MD run with Desmond on Infiniband cluster (weeks to months) Less than a day on Anton
Compute Interactions on Neutral Territory Tower Plate Traditional Method NT Method D. E. Shaw, “A Fast, Scalable Method for the Parallel Evaluation of Distance-Limited Pairwise Particle Interactions”, J Comput. Chem., 2005
An Anton ASIC • Two computational subsystems connected by communication ring • Hardware datapaths compute over 25 billion interactions/sec • Software runs on 12 cores in the flexible subsystem • 6 links for the 3D Torus, each 42Gbps bandwidth, 50ns chip-chip latency • 1 Host Interface link for external I/O, 1Gbps. • 2 banks of DDR2-800 DRAM
Anton’s Flexible Subsystem General Purpose cores are 32bit Tensilica LX Remote Access Unit handles multiple parallel DMA to/from 32KB of local SRAM Geometry Cores are custom-designed, dual-slot VLIW, quad-word fixed-point SIMD Kuskin et al, HPCA 2008
Anton New York Segment • Anton 512 node system in NY. 2 of 4 racks shown under construction. • Each racks contains 32 boards • Each board holds 4 Anton nodes 512 nodes in an 888 3D torus can be built out to 4096 nodes in a larger data center D. E. Shaw Research