COMP60621 Concurrent Programming for Numerical Applications

COMP60621Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Overview • Processor • AMD Opteron quad-core processor (‘Shanghai’) • Chronos has four processors (i.e. 16 cores) • Cache structure • L1 and L2 cache per core • L3 cache shared between the four cores • Memory • 6GB (6 x 1GB memory modules) per processor (24GB total) • Interconnect • AMD ‘Direct Connect Architecture’ (Coherent HyperTransport Technology) • No ‘Front side bus’, as found in some Intel platforms • Performance issues • Further Information

Processor: Quad-Core AMD Opteron Source: www.amd.com, Quad-Core AMD Opteron Product Brief

Processor – AMD Opteron 8378 • ‘Shanghai’ 64 bit • 2.4GHz clock speed • Separate 64KB level 1 data and instruction caches per core • 2-way set associative, LRU replacement, exclusive • 512KB level 2 cache per core (exclusive, i.e. data in L1 does not need to be in other caches) • unified (code and data) • 16-way set associative, pseudo LRU replacement • 6144KB (6MB) level 3 cache per processor (can be inclusive) • Shared by 4 cores • unified • 64-way set associative, pseudo LRU replacement • Cache line sizes are 64B (‘unit of coherency’)

AMD Opteron cache behaviour • L1 and L2 are exclusive caches • data is never in both caches. L2 holds data evicted from L1 • On L2 hit, data is moved to L1 and removed from L2 • L2 evicts data to L3 • Access to an address that would lead to an L3 miss brings data straight to L1 • Only after eviction from L1 and L2 does data come into L3 (L2 and L3 are ‘victim’ caches) • If data is required in L1 again, L3 keeps a copy (inclusive behaviour) if the data is likely to be shared with other cores but doesn’t keep a copy if the data is unlikely to be shared (exclusive). • Cache behaviour on the Opteron is ‘mostly exclusive’

AMD Opteron latencies • Getting data into the registers • L1 access, 3 cycles then 1 cycle per load (~1.5ns) • L2 access, 9 cycles beyond L1 (~4ns) • L3 access, 29 cycles (at best) (~13ns) • Local memory (read access), ~140ns (not directly related to cpu cycles!) • An average benchmarked figure using, e.g. lmbench • On chronos, 1 cpu cycle is just under ~0.42ns • Memory access time is approximate… • Depends on how much work the memory system has to do to get the data and how ‘busy’ it is

AMD Opteron 4P server architecture Source: www.amd.com, AMD 4P Server and Workstation Comparison

AMD Quad-quad ccNUMA architecture • Each processor is directly connected to some memory • Each processor has a memory controller • Bandwidth, 12.8GB/s (aggregate over two channels) • Processors are connected to each other with: • Bi-directional Coherent HyperTransport Technology (HT) • Coherency unit is 64 Bytes (i.e. cache line size) • Up to 8.0GB/s per link (4GB/s in each direction) • 3 HT links per processor, usually 2 used to connect to other processors and 1 used for I/O (via PCI bridge) • Separate memory and I/O paths • Compare with Front side bus architecture used by, e.g., Intel

Performance issues • Cores on the same processor can access directly some of the system’s memory (local memory) through the cache hierarchy • Can communicate with each other via shared L3 cache • Cores on different processors access remote memory via the cHT (coherent HyperTransport) links which maintains coherency of data in the L3 caches (and memory) • Access to remote memory may take 1 ‘hop’ (to memory on two other processors one cHT link away) or 2 ‘hops’ (to memory on the fourth processor, two cHT links away)

AMD Opteron Memory latencies • Local memory reads, =100% (base case) • Local memory writes, ~113% • 1 hop reads, ~108% • 2 hop reads, ~130% • 1 hop writes, ~128% • 2 hop writes, ~150% • Remember, data is placed in physical memory as a result of a ‘first touch’ by a thread policy! • This is bechmarked data, 1 thread, idle machine

Further information • See www.amd.com. Follow: Products and Technologies -> Server Products -> Server Processors: • Product Brief • Key Architectural Features • Direct Connect Architecture • HyperTransport Technology • Quad-Core AMD Opteron Processor 4P Server and Workstation Comparison • Another useful, though slightly old, document is: • Performance Guidelines for AMD Athlon and Opteron ccNUMA Multiprocessor Systems. Available at: www.amd.com.cn/CHCN/assets/content_type/white_papers_and_ tech_docs/40555.pdf

Information on chronos • Look in files such as: • /proc/cpuinfo • /proc/meminfo • /sys/devices/system/cpu/cpu0/cache/index0 to index3 • From information in /proc/cpuinfo you can create a map of the logical processor ids (in the range [0-15], one per core) to physical processor ids [0-3] and (physical) core ids [0-3]. • You should do this!

Results of vec.f on chronos Performance (Mflop/s) L1 = 64KB L2 = 512KB L3 = 6MB log10N (bytes)

COMP60621 Concurrent Programming for Numerical Applications