Cycle Accurate Performance Measurement

Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain, Jason Fritts, John Lockwood, and Ron Cytron rh3@wustl.edu http://liquid.arl.wustl.edu/ Funded by NSF Grant ITR-0313203

Outline • Introduction • Motivation • Background • Architecture • Usage • Results • Future Work • Related Work • Conclusion

Introduction – What Are We Doing? • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems

Introduction – What Are We Doing? • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module

Introduction – What Are We Doing? Program Bottlenecks Program Runtime • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module

Introduction – What Are We Doing? Program Bottlenecks Memory Accesses ISA Decoding Program Runtime Cache Hits • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module

Background - FPX • Designed and implemented on the FPX platform • The FPX platform is: • Designed for developing pluggable network circuits • Contains a Virtex 2000e FPGA for design deployment • Possesses a smaller FPGA used as a network interface device • Can potentially operate at gigabit line rates

Background - LEON2 • Developed by Gaisler Research • Sparc-V8 • Open-Source VHDL • Widely used • European Space Agency, etc. • Second in popularity only to the Microblaze

Motivation – Why Not Use Software? • Software Profiling Is: • Inaccurate • Many data points estimated • Time slices not absolute • Profiling affects results • Inefficient • Unreasonable for real-system deployment • Ineffective • Difficult to separate OS overhead

Motivation – Why Not Use Simulation? • Simulation is: • Slow • A simple simulation could require 100X more time than running the program • Bound by the quality of the model • The model used may be inaccurate • Processors often tweaked without updating the documentation [Larus]

Motivation – Why Use FPGAs? • ASICs are expensive • FPGAs provide good blend of cost and accuracy • Software simulation of processors is incredibly slow • Allows for easy prototyping • Test new caching methods, tweak the ISA, etc.

Motivation – Why Put Statsmod In A FPGA? • The Statistics Module Allows You To: • Pull Event Signals from anywhere • Evaluate both software and hardware optimizations • Tweak the architecture • Integrate hardware accelerated modules into software solutions • Adjust the software algorithm • Gather repeatable and reliable results

Architecture – Naïve Solution • Interested in 10 events and counters • Naïve solution implements a counter for each possibility • 100 counters! • Not scalable for large systems

Architecture – Our Solution • Better Approach • Associate counters to events and methods at run time • Covers the problem area, but uses less chip space

Architecture – An In Depth Look

Architecture – Scalability Naïve Approach Address Range Registers Counters Events

Usage

Results – What do we get? • The next few slides contain data from the Linpack benchmark running on the FPGA • Linpack is a FPU intensive benchmark • While the following slides focus on runtime, it is important to remember that the graphs could in principle be of *any* event

Results 323,686,726 Clock Cycles

Results

Future Work – Where can we go? • As of a week ago, the StatsMod was successfully integrated into a Linux 2.6.11 OS running on Leon • Changes have been made to allow a clear separation between Process IDs • OS, background tasks, threads • A device driver allows any program, including the program being profiled, to gather the statistics

Future Work – Where can we go? • Programs could now potentially collect statistics on themselves perform runtime introspection • Adjust operation to conserve power, memory accesses, etc. • Deeper integration could occur at the kernel level to affect scheduler decisions • Adds a new dimension for slicing resources • Network activity, device activity, page faults, etc.

Related Work • SnoopP • Developed by Lesley Shannon and Paul Chow at the University of Toronto • Collects timing characteristics of programs running on a Microblaze processor • Focuses on clock cycles only • Integrated into the EDK

Conclusion In closing, I would like to thank: • Phillip Jones for his hard work and support • Ron Cytron for his mentoring and persistence • Scott Friedman for his work on the web interface • The rest of the Liquid Architecture team • And WISA for the invitation to present

Questions?

Background – Liquid

Usage • Connect to a secure web server controlling the FPGA hardware • Upload the desired binary executable, associated mapfile, and desired programming bitfile • A perl script parses the map file and provides a graphical interface for selecting the desired address ranges and events • Statistic results are tabulated at the end of the program’s execution

Cycle Accurate Performance Measurement

Cycle Accurate Performance Measurement

Presentation Transcript

Performance Measurement

Performance Measurement

Performance Measurement

Performance Measurement

Performance Measurement

Performance Measurement

Performance Measurement

Performance Measurement

Performance Measurement

Accurate Performance

Performance measurement

Performance Measurement

Performance Measurement

Performance Measurement

Performance Measurement

Performance Measurement

Performance Measurement

PERFORMANCE MEASUREMENT

Cycle Accurate Performance Measurement

Performance Measurement

Performance Measurement