1 / 23

Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Shadow Profiling: Hiding Instrumentation Costs with Parallelism. Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh Peri (Intel Corporation). Motivation. An ideal profiler will… Collect arbitrarily detailed and abundant information

sibley
Télécharger la présentation

Shadow Profiling: Hiding Instrumentation Costs with Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shadow Profiling:Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh Peri (Intel Corporation)

  2. Motivation • An ideal profiler will… • Collect arbitrarily detailed and abundant information • Incur negligible overhead • A real profiler, e.g., using Pin, satisfies condition 1 • But the cost is high • 3X for BBL counting • 25X for loop profiling • 50X or higher for memory profiling • A real profiler, e.g. PMU sampling or code patching, satisfies condition 2 • But the detail is very coarse

  3. HighDetail Low Overhead Motivation “Bursty Tracing” (Sampled Instrumentation),Novel Hardware,Shadow Profiling VTune, DCPI, OProfile, PAPI, pfmon, PinProbes, … Pintools, Valgrind, ATOM, …

  4. Goal • To create a profiler capable of collecting detailed, abundant information while incurring negligible overhead • Enable developers to focus on other things

  5. The Big Idea • Stems from fault tolerance work on deterministic replication • Periodically fork(), profile “shadow” processes * Assuming instrumentation overhead of 3X

  6. Challenges • Threads • Shared Memory • Asynchronous Interrupts • System Calls • JIT overhead • Overhead vs. Number of CPUs • Maximum speedup is Number of CPUs • If profiler overhead is 50X, need at least 51 CPUs to run in real-time (probably many more) • Too many complications to ensure deterministic replication

  7. Goal (Revised) • To create a profiler capable of sampling detailed traces (bursts) with negligible overhead • Trade abundance for low overhead • Like SimPoints or SMARTS (but not as smart :)

  8. The Big Idea (revised) • Do not strive for full, deterministic replica • Instead, profile many short, mostly deterministic bursts • Profile a fixed number of instructions • “Fake it” for system calls • Must not allow shadow to side-effect system

  9. Design Overview

  10. Design Overview • Monitor uses Pin Probes (code patching) • Application runs natively • Monitor receives periodic timer signal and decides when to fork() • After fork(), child uses PIN_ExecuteAt() functionality to switch Pin from Probe to JIT mode. • Shadow process profiles as usual, except handling of special cases • Monitor logs special read() system calls and pipes result to shadow processes

  11. System Calls • For SPEC CPU2000, system calls occur around 35 times per second • Forking after each puts lots of pressure on CoW pages, Pin JIT engine • 95% of dynamic system calls can be safely handled • Some system calls can be allowed to execute (49%) • getrusage, _llseek, times, time, brk, munmap, fstat64, close, stat64, umask, getcwd, uname, access, exit_group, …

  12. System Calls • Some can be replaced with success assumed (39%) • write, ftruncate, writev, unlink, rename, … • Some are handled specially, but execution may continue (1.8%) • mmap2, open(creat), mmap, mprotect, mremap, fcntl • read() is special (5.4%) • For reads from pipes/sockets, the data must be logged from the original app • For reads from files, the file must be closed and reopened after the fork() because the OS file pointer is not duplicated • ioctl() is special (4.8%) • Frequent in perlbmk • Behavior is device-dependent, safest action is to simply terminate the segment and re-fork()

  13. Other Issues • Shared Memory • Disallow writes to shared memory • Asynchronous Interrupts (Userspace signals) • Since we are only mostly deterministic, no longer an issue • When main program receives a signal, pass it along to live children • JIT Overhead • After each fork(), it is like Pinning a new program • Warmup is too slow • Use Persistent Code Caching [CGO’07]

  14. Multithreaded Programs Issue:fork() does not duplicate all threads • Only the thread that called fork() Solution: • Barrier all threads in the program and store their CPU state • Fork the process and clone new threads for those that were destroyed • Identical address space; only register state was really ‘lost’ • In each new thread, restore previous CPU state • Modified clone() handling in Pin VM • Continue execution, virtualize thread IDs for relevant system calls

  15. Tuning Overhead • Load • Number of active shadow processes • Tested 0.125, 0.25, 0.5, 1.0, 2.0 • Sample Size • Number of instructions to profile • Longer samples for less overhead, more data • Shorter samples for more evenly dispersed data • Tested 1M, 10M, 100M

  16. Experiments • Value Profiling • Typical overhead ~100X • Accuracy measured by Difference in Invariance • Path Profiling • Typical overhead 50% - 10X • Accuracy measured by percent of hot paths detected (2% threshold) • All experiments use SPEC2000 INT Benchmarks with “ref” data set • Arithmetic mean of 3 runs presented

  17. Results - Value Profiling Overhead • Overhead versus native execution • Several configurations less than 1% • Path profiling exhibits similar trends

  18. Results - Value Profiling Accuracy • All configurations within 7% of perfect profile • Lower is better

  19. Results - Path Profiling Accuracy • Most configurations over 90% accurate • Higher is better • Some benchmarks (e.g., 176.gcc, 186.crafty, 187.parser) have millions of paths, but few are “hot”

  20. Results - Page Fault Increase • Proportional increase in page faults • Shadow/Native

  21. Results - Page Fault Rate • Difference in page faults per second experienced by native application

  22. Future Work • Improve stability for multithreaded programs • Investigate effects of different persistent code cache policies • Compare sampling policies • Random (current) • Phase/event-based • Static analysis • Study convergence • Apply technique • Profile-guided optimizations • Simulation techniques

  23. Conclusion • Shadow Profiling allows collection of bursts of detailed traces • Accuracy is over 90% • Incurs negligible overhead • Often less than 1% • With increasing numbers of cores, allows developers’ focus to shift from profiling to applying optimizations

More Related