Accelerating Multiprocessor Simulation

Accelerating Multiprocessor Simulation Kenneth C. Barr6.895 Final Project

Motivation • Detailed simulation of large benchmarks can take days… • …and that’s just a uniprocessor. Parallel simulations are more complex. • Cache coherence • Interconnect / bus timing models • N CPUs • Memory Address Record (MAR): a structure to speed up simulation of directory-based cache coherent computers. Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Directory Based Cache Coherence:Review • Same state idea as snooping (e.g., MESI), but more scalable Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Directory Based Cache Coherence:Review • Same state idea as snooping (e.g., MESI), but more scalable • Add directory to hold state, and replace bus with network Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Directory Based Cache Coherence:Review • Same state idea as snooping (e.g., MESI), but more scalable • Add directory to hold state, and replace bus with network • Each cache line has state in directory • On load and store, contact the “home node” for line’s state • Exclusive + Owner • Shared + Readers • Invalid EXCL CPU17 SHRD 01101…01110 INV don’t care Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Directory Example • CPU1 prepares to modify word at 0x8000 • Contact home with “read-exclusive” request • Home replies: “<data> is shared by CPU2, CPU3, and CPU4” • Home transitions line to Exclusive • CPU1 sends invalidates to CPU2, CPU3, and CPU4 • Sharers invalidate their copies • CPU1 waits for all three replies • CPU1 puts data in cache (possibly evicting LRU line) • Not hard to see that this is intense simulation effort! • 11 hard-to-predict if-statements in top-level function! • Let’s skip it! Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Sampling Microarchitecture Simulation • MAR used in conjunction with “warming” • Long periods of fast “functional warming” • Short periods of detailed simulation • 35-60 times faster than detailed simulation • Less than 1% error in CPI - Wunderlich et al. 2003 DetailedSimulation FunctionalWarming FunctionalWarming DetailedWarming Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Proposal: Memory Address Record to enable fast warmup • Quick updates to MAR during warming • No detailed cache or directory model; all accesses straight to shared memory space (on simulator’s heap, so it’s easy and fast to access) • Everything looks like a hit • For each access, record {CPU, read/write, time} • Playback from MAR to enable detailed simulation DetailedSimulation FunctionalWarming FunctionalWarming DetailedWarming Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Proposal: Memory Address Record to enable fast warmup • For each access, record {CPU, read/write, time} Physical Memory struct mar_record{ int writer; stime_t writetime; vector<stime_t> readers; }; Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Proposal: Memory Address Record to enable fast warmup • For each access, record {CPU, read/write, time} • 3:07am, CPU1 issues “load r1, 0x8004” Physical Memory struct mar_record{ int writer; stime_t writetime; vector<stime_t> readers; }; Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Algorithm • Update • simply record <cpu, read/write, time> overwriting old values. • Playback • Two stages • Reconstruct Caches • Use caches to build directory Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

AlgorithmReconstructing Caches • Uses N w-deep priority queues and N “latest writers” for each set{ for each line in set { update latest w writers throw out all prior reads insert writers and remaining (subsequent) reads in queue } for each CPU{ empty priority queue (tag, dirty, valid bits) into set } Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

AlgorithmReconstructing Directory for each CPU{ for each set in current cache{ check other caches (N-1)W places to get correct state } } • All lines start as I. • Line present in one place (dirty) -> M • Line present in one place (clean) -> S (or E, but evictions may not let us distinguish). • Line present in many places (clean) -> S • Other cases should never arise Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Proof (or intuition) of correctness • We’ve got everything the cache has and more! • So how can MAR be faster? • No timing models, no state machines until playback. And when we do play-back, we’re only interested in the final state, not what happened in between • Default scheme time • all accesses ∙ T(simulate cache + simulate directory + simulate network) • MAR time • Update: all accesses ∙ T(hash table update) • Playback: touched lines ∙ (writes ∙ sharers + reads ∙ cpus) • In our favor • MAR “all access” step is fast • Touched lines are tiny compared to all accesses • Sharing should be minimal at the application level Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Measured sharing and cache lines Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Testing • UVSIM • Models SGI Origin 3000 CPU (R10K) and most structures • Used 4 CPUS • Added MAR update, playback, and stats Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Splash2 Benchmarks • Parallel Scientific Computing • Fast Fourier Transform (FFT) • Left-upper (LU) dense matrix factorization • Barnes-Hut n-body interaction • Ocean current simulation • Chosen subset has diverse behavior Data from Hennessy & Patterson, 2003 Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Remaining work • Time it! • Note, reduce space requirements • Simplify formulas Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Future work • Can we save space in the structure by making it more cache-like? • Other optimizations (eg, don’t stride, but take advantage of locality during replay) • What extra state needs to be stored for other schemes (eg MESI, MOESI, etc…) • A per-processor structure to ease parallel-hosted simulation • Caveats • If application threads get out of sync in real life, we’re modeling a correct execution, but not one we’d see in real life. • Don’t forget to sync/barrier when the applications sync. Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Questions… Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Reconstructing Caches and directory • Algorithm • reconstruct caches • “Stride” through MAR, looking at all cachelines that map to a set • For each line • Keep track of most recent write (per CPU) • Throw out all reads older than most recent write • If there are readers, then for each CPU • Insert tag and read time into a priority queue (sorted by time) • For each CPU • For each set of cache • copy the W most recent reads and writes • reconstruct directory from caches • foreach cpu’s cache, for each address, call add_sharer() sort of function • Note that cache-building step leaves us a consistent state (eg. Only one cache can be writing) • post process directory for subtleties • If only one reader, it’s E, not S. Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Conclusion • Pros • Cons • Replaces most of the work in animation with a quick update and O(N) replay • Supports multisimulation of cache params and directory schemes (MSI vs MESI) • Memory (grows with number of touched cache lines) Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Results • Benchmarks: Splash2 paper shows the fraction spent in load/store/lock/sync/etc… Make graph for this • How far-off are certain metrics if you don’t warm up the directory? • Pretty graphs showing timing of my scheme vs default (on four splash2 apps) • Growth of structure over time with these apps Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

Accelerating Multiprocessor Simulation

Accelerating Multiprocessor Simulation

Presentation Transcript

Multiprocessor Systems

MultiProcessor Linux

Multiprocessor Scheduling

Multiprocessor Initialization

Multiprocessor Architecture

Accelerating an N-Body Simulation

Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures

HEADTAIL simulation during the accelerating ramp in the PS

Multiprocessor Scheduling

Multiprocessor OS

Multiprocessor Initialization

Symmetric multiprocessor

Multiprocessor Scheduling

Multiprocessor scheduling

Accelerating

Multiprocessor Scheduling

Multiprocessor Systems

Multiprocessor Architectures

Multiprocessor Scheduling

Multiprocessor Scheduling

Multiprocessor Initialization

Symmetric multiprocessor