Moving Complex Apps To Take Advantage of Complex Hardware
190 likes | 346 Vues
Moving Complex Apps To Take Advantage of Complex Hardware. Salishan. 4 /24/2014. Ian Karlin. ASC Codes Last Many HW Generations. Tuning large complex applications for each hardware generation is impractical Performance Productivity Code Base Size.
Moving Complex Apps To Take Advantage of Complex Hardware
E N D
Presentation Transcript
Moving Complex Apps To Take Advantage of Complex Hardware Salishan • 4/24/2014 • Ian Karlin
ASC Codes Last Many HW Generations • Tuning large complex applications for each hardware generation is impractical • Performance • Productivity • Code Base Size Solutions must be general, adaptable to the future and maintainable
Power Efficiency is Driving Computer Designs Power vs. frequency for Intel Ivy Bridge What drives these designs? • Handheld mobile device weight and battery life • Exascale power goals • Power cost Lower power reduces performance and reliability
Reliability of Systems Will Decrease Chips operating near threshold voltage encounter • More transient errors • More hard errors Checkpoint restart is our current reliability mechanism
Advancing Capability at Reduced Power Requires More Complexity Complex power saving features • SIMD and SIMT • Multi-Level memory systems • Heterogeneous systems Memory NVRAM Processing NVRAM Memory Processing GPU Multi-Core CPU In-Package Memory In-Package Memory Exploiting these features is difficult
Currently, We Do Not Use These Features • No production GPU or Xeon Phi code • GPU and Xeon Phi optimizations are different • No production codes explicitly manage on-node data motion • Less than 10% of our FLOPs use SIMD units, even with the best compilers • Architecture dependent data layouts may hinder the compiler Mechanisms are needed to isolate architecture specific code
What Would Continuing on Today’s Path Look Like? • We add directives to existing codes where portable • Multi-level memory handled by OS, runtime or used as a cache • We continue to get little SIMD and probably a bit better SIMT parallelism Overall performance improvement is incremental at best
Can We Get Today’s Codes Where We Need To Be Tomorrow? • Are our algorithms well suited for future machines? • Can we rewrite our data structures to match future machines? We will address these questions in the next few slides
We Can Manage Locality and Reduce Data Motion • Loop fusion • Make each operator a single sweep over a mesh • Data structure reorganization • Reduce mallocs or Use better libraries LULESH BG/Q However, better implementations only get us 2-3x
We Can Reduce Serial Sections • Throughput optimized processors execute serial sections slowly • Design codes with limited serial sections • Better runtime support is needed to reduce serial overhead • OpenMP • Malloc Libraries Use latency optimized processor for what remains
We Can Vectorize Better • More parallelism exists in current algorithms than we exploit today • Code changes are required to express parallelism more clearly • SIMT or SIMD with HW Gather/Scatter are easier to exploit LULESH Sandy Bridge Bandwidth constraints will eventually limit us
However, There Are Fundamental Data Motion Requirements Many of today’s apps need 0.5-2 bytes for every FLOP performed.
Future Machines Can Not Move Data Fast Enough For Current Algorithms to Exploit All Resources Excess FLOPs
High-order Algorithms Can Bridge the Gap B to F Requirement vs. Algorithmic Order • More FLOPs per byte • Small dense operations • More accurate • Potentially more robust and better symmetry preservation
They Present New Questions • How do you use the FLOPs efficiently? • What does high-order accuracy mean when a there is a shock? • Can you couple all they physics we need at high-order? We are working to answer these, but whether we use new algorithms or our current ones there is a pervasive challenge…
Are Optimizations Portable Across Architectures? Mechanisms are needed to isolate non-portable optimizations
You Saw One Approach Earlier This Week • RAJA, Kokkos and Thrust allow portable abstractions in today’s codes There Are Other Attractive Research Approaches For The Future Charm++ Liszt
Ultimately We Need to Make Performance a First Class Citizen Architectures Programming Models Algorithms Today’s RAJA High Order