Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst

Locality Optimizations in cc-NUMA Architectures UsingHardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth

Introduction • Cache-coherent SMPs are widely used • High performance computing • Large-scale applications • Client-server computing • cc-NUMA is the dominant architecture • Allows construction of large servers • Data locality is an important consideration • Faster access to local memory units

Data Placement • Memory intensive applications on cc-NUMA servers • May have significant non-local memory accesses • Possible optimization to increase locality • First-touch placement of memory pages • Commonly used in modern systems • May not place pages local to the processors accessing them most • Dynamic page placement/migration • Page access frequencies at runtime

Our Page Migration Approach • User-level dynamic page migration • Profiling and page migration during the same run • Application Profiling • Gathers data from hardware counters • Sample the interconnect transactions • Transaction Type + Physical Address + Processor ID • Identifies preferred locations of memory pages • Memory unit local to the processor that accesses most • Page Placement • Kernel moves memory pages to their preferred locations • At fixed time intervals • Pages are frozen for a while if recently migrated • Eliminates ping-ponging of memory pages

Address Bus System Board 2 System Board 1 Memory Unit Memory Unit Processor 1 Processor 1 Processor 2 Processor 2 Transaction Sampling Instrumentation Software Processor 3 Processor 3 Physical Page Physical Page Processor 4 Processor 4 Explicit binding (processor_bind) Sun Fire 6800 Virtual to Physical Mapping (meminfo) Page Migration using move-on-next-touch feature (madvise) Thread1 Threadj Virtual Page Hardware/Software Components Sun Fire Link Hardware Counters Application

Instrumentation Code Insertion • Instrumentation using Dyninst • Entry point of main • Loads a shared library • Creates two helper threads • One for address transaction sampling • Other for actual migrations of the pages • Exit point(s) of thr_create • Calls processor_bind • Binds new threads to available processors • Helper threads are bound to dedicated processors • Entry point of exithandle • Termination detection • Clean-up hardware counters

Preliminary Experiment • Impractical to record all transactions • Interval sampling • Sampling at every Nth transaction • Continuous sampling • Sampling at the maximum speed of the instrumentation software • Are samples representative of transactions?

SAll PA PS SSample Representative Sampling Technique • Potential sampling error • How much do sampled transactions deviate from all transactions? • Distance between two sets • SALL and SSAMPLE • Ratio of transactions requested by a processor, P

Sampling Error for CG • Interval sampling is more representative • Interval used also has an impact • Continuous sampling is less representative due to difference between the rates • Transaction samples are taken • Processor requests transactions

Page Migration Experiments • Applications • OpenMP C implementation of NAS Parallel Benchmark suite • BT(B), CG(C), EP(C), FT(B), LU(C), MG(B), SP(C) • Optimized to support parallelized code • Platform • 24 processor Sun Fire 6800 • 24 GB main memory • Execution • 12 threads • 2 threads on each system board • Page migration at every 5 seconds • Interval sampling at every 1K transactions

Reduction in Non-Local Memory Accesses

Performance Improvement

SPECjbb2001 Results • Potential improvement? • Migration working at object granularity

MG.B Address Space [0-512MB)

MG.B with Page Migration

Conclusions • Our dynamic page migration approach • Reduced non-local memory accesses by upto 90% • Improved the execution times by upto 16% • Potentially more effective on larger cc-NUMA servers • Sun Fire 15K (latency ratio => 1:1.78) • User level page migration approach • Relies on the OS kernel to provide the actual migration mechanism.

Questions???

Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst

Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst

Presentation Transcript

Using Counters

Demand-Driven Software Race Detection using Hardware Performance Counters

Scalable CC-NUMA Design Study - SGI Origin 2000

Composing Scalability and Node Design in CC-NUMA

Performance Optimizations in Dyninst

Dynamic Emulation and Fault-Injection using Dyninst

Locality Optimizations in OceanStore

Floating Point Analysis Using Dyninst

Operating System Support for improving data locality on CC-NUMA machines

Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

Compiler Optimizations for Modern VLIW/EPIC Architectures

Locality Optimizations in Tapestry

Optimizations using SSA

Hardware needed for CC

Performance Optimizations for NUMA-Multicore Systems

Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

Scalable CC-NUMA Design Study - SGI Origin 2000

Operating System Support for improving data locality on CC-NUMA machines