250 likes | 325 Vues
The PMaC group led by Dr. Allan Snavely aims to enhance performance prediction for machine procurement, architectural tradeoffs, and application optimization. By bridging benchmarks with cycle-accurate simulation, innovative tools like MAPS, Meta-Sim, and Pseudocode Cache Simulator are developed for faster and more accurate performance tuning. The group collaborates with experts like Dr. Laura Carrington, Dr. Stuart Johnson, and Dr. Nicole Wolter. This initiative strives to create a robust framework for meaningful machine performance evaluation and modeling.
E N D
Towards More Meaningful Machine Comparisons Dr. Allan Snavely PMaC (Performance Modeling & Characterization) Group Leader www.sdsc.edu/PMaC SDSC
PMaC Mission • To bring scientific rigor to the art or performance prediction • for procurement • for architectural tradeoffs • for guiding applications to best-suited machine • for performance tuning
PMaC Mission • To bridge the gap between benchmarks and cycle-accurate simulation • Benchmarks have dubious relevancy to real apps, particularly on future machines • Cycle-accurate simulations take too long
Projects • MAPS (Memory Access Patterns) • memory subsystem & interconnect signatures • MetaSim • an on-the-fly simulator for playing “what if?” (4 orders of magnitude faster than cycle-accurate simulation) • Pseudocode Cache Simulator • Scientific Application Loop Set • Terascale Application Information • IDC HPC List
People • Dr. Allan Snavely, Group Leader • Dr. Laura Carrington, Xiaofeng Gao (MAPS) • Dr.Stuart Johnson (Pseudocode simulator) • Dr. Larry Carter (senior technical advisor) • Dr. Wayne Pfeiffer (Scientific Application Loop Set) • Nicole Wolter (Paraver/Dimemas) • Dr. Bob Leary (resident mathemeticain)
What’s wrong with benchmarks? • May anti-correlate to actual performance1 1: Conventional Benchmarks as a Sample of the Performance Spectrum John L. Gustafson, Rajat Todi Ames Laboratory, USDOE
PMaC Methods • Performance modeling via separation of concerns • Machine signatures • Application profiles • Convolution methods
L1 8192 word 128 way 16 block TLB 131072 word 4KB pages 2 way L2 1048576 word 4 way 16 block
MAPS • Useful in its own right for more meaningful machine comparisons at a glance • Work going forward to port to Compaq TCS1, SX-5, T90, Sv1, MTA, Sun HPC 10K, Origin, others? • Provides input to MetaSim (next)
Meta-Sim • Takes 2 inputs • a program • a description of a machine • Consumes instrumented trace data “on-the-fly” • 100 fold slowdown (as opposed to 1M fold!) • Performs an automated predictive convolution
Meta-Sim • Models caches and TLB • any number of levels • arbitrary sizes, line lengths, associativities • Does accounting on the Basic Block level • Looks for memory access patterns
A (simplistic) Convolution i=1 = Wt. BB Rate BB Intensity BB * * MFLOPS i i i n Wt. BB = % of total memory references i Rate BB = sustained rate of memory references i Intensity BB = ratio of floating point ops to memory ops i
How to determine rate of memory access for BB? • sum = sum + a(k)*b(colidx(k)) • Even if only 33% of memory references in a BB fall out to MM, they may slow down the whole BB to the speed of MM accesses • Why?
Occam’s Razor • Only add complexity if required to explain observed phenomena • Observation - this approach just as accurate as SMTSIM (Tullsen, Snavely, et al) but 4 orders of magnitude faster!
Conventional Benchmarks as a Sample of the Performance Spectrum
Work going forward • Development of probes ala MAPS for floating point and integer functional unit issue, logical operations, I/O • Increase sophistication of convolutions as required to fit observed facts • Big goal; a robust set of metrics and methods for performance modeling and characterization