David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture(Plus a Few Thoughts on Simulation Methodology) David J. Lilja Department of Electrical and Computer Engineering University of Minnesota lilja@ece.umn.edu

Acknowledgements • Graduate students (who did the real work) • Ying Chen • Resit Sendag • Joshua Yi • Faculty collaborator • Douglas Hawkins (School of Statistics) • Funders • National Science Foundation • IBM • HP/Compaq • Minnesota Supercomputing Institute

Problem #1 • Speculative execution is becoming more popular • Branch prediction • Value prediction • Speculative multithreading • Potentially higher performance • What about impact on the memory system? • Pollute cache/memory hierarchy? • Leads to more misses?

Problem #2 • Computer architecture research relies on simulation • Simulation is slow • Years to simulate SPEC CPU2000 benchmarks • Simulation can be wildly inaccurate • Did I really mean to build that system? • Results are difficult to reproduce • Need statistical rigor

Outline (Part 1) • The Superthreaded Architecture • The Wrong Execution Cache (WEC) • Experimental Methodology • Performance of the WEC [Chen, Sendag, Lilja, IPDPS, 2003]

Hard-to-Parallelize Applications • Early exit loops • Pointers and aliases • Complex branching behaviors • Small basic blocks • Small loops counts → Hard to parallelize with conventional techniques.

Introduce Maybe dependences • Data dependence? • Pointer aliasing? • Yes • No • Maybe • Maybe allows aggressive compiler optimizations • When in doubt, parallelize • Run-time check to correct wrong assumption.

CONTINUATION -Values needed to fork next thread Fork Fork TARGET STORE -Forward addresses of maybe dependences CONTINUATION -Values needed to fork next thread … … Fork Sync Sync TARGET STORE -Forward addresses of maybe dependences … COMPUTATION -Forward addresses and computed data as needed CONTINUATION -Values needed to fork next thread … … Sync COMPUTATION -Forward addresses and computed data as needed TARGET STORE -Forward addresses of maybe dependences … WRITE-BACK COMPUTATION -Forward addresses and computed data as needed Sync Sync Thread i WRITE-BACK Sync Thread i+1 WRITE-BACK Thread i+2 Thread Pipelining Execution Model

Instruction Cache Super-Scalar Core Super-Scalar Core Super-Scalar Core Super-Scalar Core Registers Registers Registers Registers PC Execution Unit PC Execution Unit PC Execution Unit PC Execution Unit Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Data Cache The Superthread Architecture

Predicted path Speculative execution Correct path Prediction result is wrong Wrong path Wrong path execution Not ready to be executed Wrong Path Execution Within Superscalar Core

Parallel region Parallel region Sequential region Kill all the wrong threads from the Previous parallel region Mark the successor threads as wrong threads Sequential region between two parallel regions Wrong thread kills itself Wrong Thread Execution

How Could Wrong Thread Execution Help Improve Performance? When i=4, j=0,1,2,3=>y[0], y[1], y[2], y[3],y[4]… When i=5, j=0,1,2,3,4 =>y[0],y[1],y[2],y[3],y[4],y[5]… for (i=0; i<10; i++) { …… for (j=0; j<i; j++) { …… x=y[j]; …… } …… } i=4 TU1 TU2 TU3 TU4 y[0] y[1] y[2] y[4] y[3] y[5] i=5 TU1 TU2 TU3 TU4 y[0] y[1] y[2] Parallelized y[4] y[3] y[5] y[6] wrong threads

Correct execution Wrong execution Operation of the WEC

Processor Configurations for Simulations SIMCA (the SIMulator for the Superthreaded Architecture) features configurations

Parameters for Each Thread Unit

Characteristics of the Parallelized SPEC2000 Benchmarks

Performance of the Superthreaded Architecture for the Parallelized Portions of the Benchmarks Baseline configuration

Performance of the wth-wp-wec Configuration on Top of the Parallel Execution

Performance Improvements Due to the WEC

Sensitivity to L1 Data Cache Size

Sensitivity to WEC Size Compared to a Victim Cache

Sensitivity to WEC Size Compared to Next-Line Prefetching (NLP)

Additional Loads and Reduction of Misses %

Conclusions for the WEC • Allow loads to continue executing even after they are known to be incorrectly issued • Do not let them change state • 45.5% average reduction in number of misses • 9.7% average improvement on top of parallel execution • 4% average improvement over victim cache • 5.6% average improvement over next-line prefetching • Cost • 14% additional loads • Minor hardware complexity

Typical Computer Architecture Study • Find an interesting problem/performance bottleneck • E.g. Memory delays • Invent a clever idea for solving it. • This is the hard part. • Implement the idea in a processor/system simulator • This is the part grad students usually like best • Run simulations on n “standard” benchmark programs • This is time-consuming and boring • Compare performance with and without your change • Execution time, clocks per instruction (CPI), etc.

Problem #2 – Simulation in Computer Architecture Research • Simulators are an important tool for computer architecture research and design • Low cost • Faster than building a new system • Very flexible

Performance EvaluationTechniques Used in ISCA Papers * Some papers used more than one evaluation technique.

Simulation is Very Popular, But … • Current simulation methodology is not • Formal • Rigorous • Statistically-based • Never enough simulations • Design a new processor based on a few seconds of actual execution time • What are benchmark programs really exercising?

An Example -- Sensitivity Analysis • Which parameters should be varied? Fixed? • What range of values should be used for each variable parameter? • What values should be used for the constant parameters? • Are there interactions between variable and fixed parameters? • What is the magnitude of those interactions?

Let’s Introduce Some Statistical Rigor • Decreases the number of errors • Modeling • Implementation • Set up • Analysis • Helps find errors more quickly • Provides greater insight • Into the processor • Effects of an enhancement • Provides objective confidence in results • Provides statistical support for conclusions

Outline (Part 2) • A statistical technique for • Examining the overall impact of an architectural change • Classifying benchmark programs • Ranking the importance of processor/simulation parameters • Reducing the total number of simulation runs [Yi, Lilja, Hawkins, HPCA, 2003]

A Technique to Limit the Number of Simulations • Plackett and Burman designs (1946) • Multifactorial designs • Originally proposed for mechanical assemblies • Effects of main factors only • Logically minimal number of experiments to estimate effects of m input parameters (factors) • Ignores interactions • Requires O(m) experiments • Instead of O(2m) or O(vm)

Plackett and Burman Designs • PB designs exist only in sizes that are multiples of 4 • Requires X experiments for m parameters • X = next multiple of 4 ≥ m • PB design matrix • Rows = configurations • Columns = parameters’ values in each config • High/low = +1/ -1 • First row = from P&B paper • Subsequent rows = circular right shift of preceding row • Last row = all (-1)

PB Design Matrix

PB Design • Only magnitude of effect is important • Sign is meaningless • In example, most → least important effects: • [C, D, E] → F → G → A → B

Case Study #1 • Determine the most significant parameters in a processor simulator.

Determine the Most Significant Processor Parameters • Problem • So many parameters in a simulator • How to choose parameter values? • How to decide which parameters are most important? • Approach • Choose reasonable upper/lower bounds. • Rank parameters by impact on total execution time.

Simulation Environment • SimpleScalar simulator • sim-outorder 3.0 • Selected SPEC 2000 Benchmarks • gzip, vpr, gcc, mesa, art, mcf, equake, parser, vortex, bzip2, twolf • MinneSPEC Reduced Input Sets • Compiled with gcc (PISA) at O3

Functional Unit Values

Memory System Values, Part I

Memory System Values, Part II

Processor Core Values

Determining the Most Significant Parameters 1. Run simulations to find response • With input parameters at high/low, on/off values

Determining the Most Significant Parameters 2. Calculate the effect of each parameter • Across configurations

Determining the Most Significant Parameters 3. For each benchmark Rank the parameters in descending order of effect (1=most important, …)

David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

David J. Lilja Department of Electrical and Computer Engineering University of Minnesota

Presentation Transcript

Department of Electrical and Computer Engineering

University of Tehran Department of Electrical and Computer Engineering

Department of Electrical and Computer Engineering

Department of Electrical and Computer Engineering

Department of Electrical and Computer Engineering

Department of Electrical and Computer Engineering

Department of electrical and computer engineering

Department of Electrical and Computer Engineering

Jarvis Haupt Department of Electrical and Computer Engineering University of Minnesota

Polytechnic University Department of Electrical and Computer Engineering

Department of Computer and Electrical Engineering

Maria Gini , Department of Computer Science and Engineering, University of Minnesota

Xiaochu Qi Department of Computer Science and Engineering University of Minnesota

Department of Computer Science and Electrical Engineering

Naomi J. Halas Department of Electrical and Computer Engineering Department of Chemistry

Florida Atlantic University Department of Computer and Electrical Engineering

The Department of Electrical and Computer Engineering

Jarvis Haupt Department of Electrical and Computer Engineering University of Minnesota

Department of Computer and Electrical Engineering

Naomi J. Halas Department of Electrical and Computer Engineering Department of Chemistry

THE DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING