400 likes | 575 Vues
ABACUS is a novel hardware-based profiler designed to characterize thread behavior in modern multi-core processors. By integrating seamlessly with the operating system, it collects runtime metrics critical for better thread placement and resource utilization without imposing significant overhead. ABACUS employs configurable profiling units to analyze parameters such as memory reuse and instruction mix. Our proof-of-concept implementation reflects its potential effectiveness, outperforming traditional simulation methods like Simics in average runtime. This paper outlines its architecture, functionality, and performance evaluation, aiming to contribute to more efficient processor management in heterogeneous computing environments.
E N D
ABACUS: A Hardware-Based Software Profiler for Modern Processors Sergey Blagodurov • Sergey Zhuravlev • Alexandra Fedorova School of Computing Science Eric Matthews • Lesley Shannon School of Engineering Science Simon Fraser University, Vancouver, BC, Canada
Overview • Legendary Introduction to ABACUS • Delicious Profiling Units • Epic Conclusion 2
ABACUS 7
ABACUS ASPLOS rocks! 8
ABACUS 9
Performance comparison • Memory Reuse Profile • ABACUS avg runtime: 48.5seconds • Simics avg runtime: 1 hour 6minutes ABACUS Simics 10
Conclusion • ABACUS is a generic profiler that can be easily integrated into modern processors • It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments 11
Motivation • Future systems will be multi-core and heterogeneous • How does the OS place threads on this architecture? • Characterize thread behaviour • Instruction Mix • Memory Reuse Profile • Effectiveness of pre-fetching • Memory bandwidth utilization 13
Motivation (cont'd) • How are these metrics collected? • Offline analysis • Code Instrumentation • Simulation (e.g., Simics) • Software-based instruction set simulator • Models systems with full OS support 14
Motivation (cont'd) • Why not use current hardware counters? • Architecture-specific • Not all desired metrics provided • Help detect symptoms, not causes • Limited in number and in concurrent use 15
Goal • Create a hardware profiler to collect thread characteristics at runtime • Imposed constraints • External to processor • Minimally invasive • Cycle accurate • OS controllable 16
ABACUS • hArdware-Based Analyzer for the Characterization of User Software • A collection of runtime configurable profiling units • Collects metrics useful for thread placement • Controllable through the O/S 17
Hardware Platform • Proof-of-concept System • LEON3 Sparc v8 Instruction Set Architecture • Single core, single threaded • Test System • OpenSparc Niagara T1 soft processor • 1 to 4 hardware threads • Multi-core Multi-board support 18
ABACUS 20
External Interface • Bus slave and master modules • Processing required on processor signals • Designed such that only external interface changes with different processor/system 21
Portability • Previously integrated with a LEON3 (Sparc v8 ISA) based system • Differences: • AMBA Advanced High-performance Bus (AHB) vs Processor Local Bus (PLB) • Processor internals 22
Controller • Starts or stops profiling • Can limit profiling to a specific address range • DMA interface for retrieving collected data • Linux device driver support 23
Profiling Units • Operate on one or more processor signals: • Instruction • PC • Cache Reuse Distance • etc. • Store data in a collection of counters 24
Profiling Units (cont'd) • Focus on two dimensional metrics • Gives bigger picture / greater insight • Aim to be as architecture independent as possible 25
Profile Unit • Behaves like a traditional software profiler • Operates on Program Counter Code Space Range Overlap Range Non-Overlap Trace 26
Memory Reuse Unit • Collects a measure of code or data reuse • Utilizes Least Recently Used (LRU) stack • Reuse distance is movement in the LRU stack or a miss • Uses in cache contention management 27
Memory Reuse Unit • Creates histogram of cache reuse pattern • Range: [0, set associativity – 1] or cache miss 4-way set-associative reuse profile Reuse Distance 28
Instruction Mix • Identify current instruction subset in use • Divide instructions into logical categories • Load/Store • Floating Point • Control Flow • Opcode-based table lookup 29
Latency Unit • Break down miss latency into constituent sources • Bus contention • DRAM latency • etc. • For each category create a histogram of latency in cycles 30
Stall Unit • Break down Cycles Per Instruction • Attribute cycles to their sources • Cache miss • Translation Lookaside Buffer (TLB) miss • Floating Point busy stalls • etc. 31
Verification • Run a subset of the SPECCPU2006 benchmarks • Those with memory usage within board specs • Collect metrics with ABACUS and Simics • Profile for a few billion instructions • Limited by Simics performace 32
Test Platform • Proof-of-concept System • Single core, single threaded XUP V2Pro: 90% slice utilization 33
Simulation Platform • Simics System: • Differences: • SPARC v9 ISA (64-bit processor) • Local filesystem vs NFS 34
LEON3 Comparison ABACUS Simics 35
LEON3 Comparison (cont'd) • DC Memory Reuse Profile ABACUS Simics 36
Resource Usage • Default: 2–way LRU Instruction Cache 2–way LRU Data Cache 5 Instruction Types 32bit counters 40bit counters 32bit counters Profile Unit added 37 37
Conclusion • ABACUS is a generic profiler that can be easily integrated into modern processors • It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments 38
Future Plans • Move to multi-core/multi-threaded system • Memory reuse distance independent of existing cache implementation • Process tracking • Integrate results into OS scheduler 39