1 / 17

Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst

Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst. Mustafa M. Tikir Jeffrey K. Hollingsworth. Introduction. Cache-coherent SMPs are widely used High performance computing Large-scale applications Client-server computing cc-NUMA is the dominant architecture

mikel
Télécharger la présentation

Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Locality Optimizations in cc-NUMA Architectures UsingHardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth

  2. Introduction • Cache-coherent SMPs are widely used • High performance computing • Large-scale applications • Client-server computing • cc-NUMA is the dominant architecture • Allows construction of large servers • Data locality is an important consideration • Faster access to local memory units

  3. Data Placement • Memory intensive applications on cc-NUMA servers • May have significant non-local memory accesses • Possible optimization to increase locality • First-touch placement of memory pages • Commonly used in modern systems • May not place pages local to the processors accessing them most • Dynamic page placement/migration • Page access frequencies at runtime

  4. Our Page Migration Approach • User-level dynamic page migration • Profiling and page migration during the same run • Application Profiling • Gathers data from hardware counters • Sample the interconnect transactions • Transaction Type + Physical Address + Processor ID • Identifies preferred locations of memory pages • Memory unit local to the processor that accesses most • Page Placement • Kernel moves memory pages to their preferred locations • At fixed time intervals • Pages are frozen for a while if recently migrated • Eliminates ping-ponging of memory pages

  5. Address Bus System Board 2 System Board 1 Memory Unit Memory Unit Processor 1 Processor 1 Processor 2 Processor 2 Transaction Sampling Instrumentation Software Processor 3 Processor 3 Physical Page Physical Page Processor 4 Processor 4 Explicit binding (processor_bind) Sun Fire 6800 Virtual to Physical Mapping (meminfo) Page Migration using move-on-next-touch feature (madvise) Thread1 Threadj Virtual Page Hardware/Software Components Sun Fire Link Hardware Counters Application

  6. Instrumentation Code Insertion • Instrumentation using Dyninst • Entry point of main • Loads a shared library • Creates two helper threads • One for address transaction sampling • Other for actual migrations of the pages • Exit point(s) of thr_create • Calls processor_bind • Binds new threads to available processors • Helper threads are bound to dedicated processors • Entry point of exithandle • Termination detection • Clean-up hardware counters

  7. Preliminary Experiment • Impractical to record all transactions • Interval sampling • Sampling at every Nth transaction • Continuous sampling • Sampling at the maximum speed of the instrumentation software • Are samples representative of transactions?

  8. SAll PA PS SSample Representative Sampling Technique • Potential sampling error • How much do sampled transactions deviate from all transactions? • Distance between two sets • SALL and SSAMPLE • Ratio of transactions requested by a processor, P

  9. Sampling Error for CG • Interval sampling is more representative • Interval used also has an impact • Continuous sampling is less representative due to difference between the rates • Transaction samples are taken • Processor requests transactions

  10. Page Migration Experiments • Applications • OpenMP C implementation of NAS Parallel Benchmark suite • BT(B), CG(C), EP(C), FT(B), LU(C), MG(B), SP(C) • Optimized to support parallelized code • Platform • 24 processor Sun Fire 6800 • 24 GB main memory • Execution • 12 threads • 2 threads on each system board • Page migration at every 5 seconds • Interval sampling at every 1K transactions

  11. Reduction in Non-Local Memory Accesses

  12. Performance Improvement

  13. SPECjbb2001 Results • Potential improvement? • Migration working at object granularity

  14. MG.B Address Space [0-512MB)

  15. MG.B with Page Migration

  16. Conclusions • Our dynamic page migration approach • Reduced non-local memory accesses by upto 90% • Improved the execution times by upto 16% • Potentially more effective on larger cc-NUMA servers • Sun Fire 15K (latency ratio => 1:1.78) • User level page migration approach • Relies on the OS kernel to provide the actual migration mechanism.

  17. Questions???

More Related