Scaling Parallel Applications

Mukesh Agrawal Scaling Parallel Applications

Introduction • Parallel systems are ccNUMA • ...so is ccNUMA useful? • How much faster is it? • How can we make it faster? • How hard is it?

ccNUMA (review) • Multiple processors • Private physical memories • Shared address space • Hardware support for cache coherence

Scenario • Scientific computation problems (SPLASH-2) • Metric: • Simulation study (simulate Stanford FLASH) • Experimental study (SGI Origin 2000, 128 proc)

Efficiency and Size • What is the smallest problem instance to achieve 60% efficiency? • Why might this be a bad metric?

Efficiency and Size • What is the smallest problem instance to achieve 60% efficiency? • Why might this be a bad metric? • Assumes more efficiency for larger instances • May not happen if data is laid out poorly (cache usage) • Why might larger instances run more efficiently?

Efficiency and Size • What is the smallest problem instance to achieve 60% efficiency? • Why might this be a bad metric? • Assumes more efficiency for larger instances • May not happen if data is laid out poorly (cache usage) • Why might larger instances run more efficiently? • Better communication/computation ratio (nearest neighbor) • Less load imbalance (less waiting for others) • Cache capacity (many misses on uniprocessor) • Cache sharing (small problem may share lines)

Efficiency and Size (results) • Depends on problem • For some, efficiency on reasonable sizes (Barnes-Hut) • Others never efficient (Radix) • Experiments show: reality requires larger instances than simulation

Efficiency and Structure • Can we get higher efficiency on small instances by modifying computation structure? • What might we try?

Efficiency and Structure • Can we get higher efficiency on small instances by modifying computation structure? • What might we try? • Reduce communication! • Algorithmic changes • Cache management (keep remote data in cache) • Static partitioning

Efficiency and Structure • Can we get higher efficiency on small instances by modifying computation structure? • What might we try? • Reduce communication! • Algorithmic changes • Cache management (keep remote data in cache) • Static partitioning • Most programs can scale after restructuring • Bonus: changes for ccNUMA often help with SVM (cluster) systems as well

Programming Guidelines • Partition statically; optimize for locality • Load balance should not be compromised • Separate partitions, avoid write sharing

Conclusion • ccNUMA can deliver scalable performance for scientific computation • Restructuring program usually required • ccNUMA and SVM machines need similar program mods • Simulator good for qualitative questions; not so good for quantitative

Scaling Parallel Applications