Prospector : A Toolchain To Help Parallel Programming

Prospector: A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by Samsung

Motivation (1/2) • Parallel programming is hard • What if there is a tool that helps parallel programming? • Already we have some tools like race detectors • However, not many tools on guiding parallel programming itself • A program wants to parallelize a serial code • Where to parallelize? • How to parallelize?

Motivation (2/2) • We propose Prospector • A set of dynamic program analyzers to help parallelization of serial code • Goals • Give information to find right parallelization targets • Provide advices on writing correct and optimized parallelized code

Overview of Prospector • Parallelism Pattern Advisor • Parallel Performance Analyzer • Parallelizable Section Finder • Parallel Speedup Predictor • Architecture Advisor • Loop-Centric Profiler Func1(){ Loop1; Loop2; Func2(); } Input Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } Func1(){ Loop1; Loop2; Func2(); } Func2() { Loop3 } Source code or Binary Loop1 Invocation: Iteration: Max Iter: Min Iter: 8 5,000 1,600 40 Speedup Speedup CPU 2 4 GPU 8 # of core

Prospector: Loop-Centric Profiler • Q: Which code section would good for parallelization? • Mostly frequently executed loops • Legacy profilers only report hot functions and instructions • We provide details of loop execution • # of trip count  Sufficient work? • # of invocation  Low fork/join overhead? • Stats of the length of loop iteration  Balanced? • Min, Max, Stdev Loop1 Invocation: Iteration: Max Iter: Min Iter: 8 5,000 1,600 40

Prospector: Parallel Speedup Predictor (1/2) • Q: What would be expected speedup? • Analytical models (e.g., Amdahl’s Law) are not practical to predict speedup in the presence of locks • Our approach • Dynamically predicting speedup based on light profiling • Challenges • How to model architecture factors (e.g., caches, memory)? Speedup 2 4 8 # of core

Prospector: Parallel Speedup Predictor (2/2) • Mechanisms • Programmers annotate the serial code • Describe the behaviors of parallel execution + locks • Fast and light profiling • Measure time between annotations • Emulation • Obtain estimated parallel execution time for speedup • Modeling architectural parameters • Sampling memory accesses • Using an analytical model for cache hit/miss prediction

Prospector: Parallelizable Section Finder (1/3) • Q: Is this code section parallelizable? • Data dependences determine the parallelizability • Compilers may not be good due to pointers and complex control flows • Our approach • Dynamic data-dependence profiling • Provides detailed dependence information for a given input • Challenges • Too much overhead; Smart algorithm is needed Func1(){ Loop1; Loop2; Func2(); } Parallelizable!

Prospector: Parallelizable Section Finder (2/3) • Mechanisms • A dynamic profiler by using instrumentations • Instrumentation can be either binary and source level • At instrumentation time (or static time) • Analyzes control flow graphs and loop structures • At runtime • We observe memory addresses (no pointer-to analysis) • These memory addresses are stored and analyzed to discover data dependences

Prospector: Parallelizable Section Finder (3/3) • Mechanisms • Scalability • Current tools require too much memory and time to analyze data dependence • Prospector implements a new scalable algorithm for data dependence profiling • Key ideas • Using compression and parallelization (MICRO ‘10)

Prospector: Parallelism Pattern Advisor • Q: How can I transform the serial code? • If dependences are easily removable • I.e., Embarrassingly parallel loops with some reductions • Guide parallelization strategy directly • E.g., Use OpenMP pragma here • If severe dependences exist • Can we give advice on avoiding these dependences? • General solutions are extremely hard • Instead data-dependence pattern analysis • E.g., pipeline parallelism, a certain form of locking Loop3 { Statements; Lock(); Statements; Unlock(); Statements; }

Prospector: Parallel Architecture Advisor • Q: Which parallel hardware would be better? • Can we predict performances on different hardware? • E.g., Speedups on multicore and GPGPU • Challenges • Need to model more architectural factors Speedup CPU GPU

Prospector: Parallel Performance Analyzer • Q: What is the reason of poor speedup? • There are a couple of profiler for this purpose • Analyzes the degree of concurrency • Profiles lock contentions (wait time) • Too low-level information to understand problems • Alternative • Macroscopic profiling of parallelized programs • An alternative form of visualizations Loop3 { Statements; Lock(); Statements; Unlock(); Statements; }

Related Work • State-of-the-art tools • Parallel Advisor from Intel Parallel Studio 2011 • Speedup Predictor: cannot model architectures • Parallelizable Section Finder: scalability issues • vfAnalystfrom VectorFabric • Parallelizable Section Finder: scalability issues

Current Status and Timeline • June 2010 • Initial Prospector’s idea is presented in HotPar‘10 • Dec 2010 • Scalable data-dependence profiling algorithm (for Parallelizable Section Finder and Pattern Advisor) will be presented in MICRO ’10 • Beta version will be released as open source • Loop-centric profiler • Parallelizable Section Finder (i.e. Data-Dependence profiler) • Parallel speedup predictor • Mar 2010 • Parallel Speedup Predictor will be released • Aug 2010 • First Parallelism Pattern Advisor will be released

Conclusion • We need a new type of tool to help parallel programming • Prospector is a set of parallel programming advisor based on dynamic program analysis • Finds good parallelization target • Analyzes serial code to understand the behavior • Predicts speedup • Provides advice on code changes

Thank you! • Q&A • References • Overall tool architecture • Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "Prospector: Helping Parallel Programming by A Data-Dependence Profiler", 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar '10), June 2010. • Scalable data-dependence profiling • Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "SD3: A Scalable Approach To Dynamic Data-Dependence Profiling", Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010.

Prospector : A Toolchain To Help Parallel Programming