Behavioral Application-Dependent Superscalar Core Modeling

Behavioral Application-Dependent Superscalar Core Modeling Ricardo Andrés Velásquez Advisor: Pierre Michaud Co-advisor: André Seznec

Introduction – Simulation Jack Kilby's original integrated circuit - 1958 1 Transistor Intel Core i7 - 2012 731E6 Transistors Some engineering fields allow us to build prototypes identical to the target design. Computer engineering in contrast makes extensive use of computer simulation to test the boundaries of a design. Behavioral Application-dependent superscalar core modeling

Introduction – Microarchitecture simulation Slow simulators Simulation complexity increases faster than computer performance. Wider design space exploration Complexity does not allow to rely on intuition. Designers rely on simulators to compare designs. Multi/Many-core Processors Complexity “doubles” with every generation. Research focus on uncore(shared cache, interconnection, main memory, etc.) Behavioral Application-dependent superscalar core modeling

Introduction – Microarchitecture simulation Various models targeting different objectives. RTL models Single core Multi/many core High Cyc.-accurate models Detailed simulation Core models 1-IPC models Statistical models Empirical models Accuracy Low Simulation speed Behavioral Application-dependent superscalar core modeling

Contributions I • BADCO  modeling technique for approximate simulation of modern superscalar cores. • Workload stratification  methodology for selecting small and representative multiprogram workloads. Behavioral Application-dependent superscalar core modeling

Simulation time – Detailed simulator Even worse for multicore architectures!!! Behavioral Application-dependent superscalar core modeling

Core models Benchmark Benchmark Functional model / Oracle Functional model / Oracle Fetch Fetch Alloc. Alloc. Decode Decode Exec. Exec. Commit Commit ITLB ITLB IL1 IL1 DTLB DTLB DL1 DL1 Temporal model Uncore (L2, LLC, MM, Interconnection, etc.) L2 Behavioral Application-dependent superscalar core modeling

Core models Core 0 Core 1 Core N-1 Core 0 Model Core 1 Model Benchmark Benchmark Benchmark Core N-1 Model Model Simulator Model Simulator Model Simulator Functional model Functional model Functional model L2 L2 L2 Fetch Alloc. Decode Exec. Commit Fetch Fetch Alloc. Alloc. Decode Decode Exec. Exec. Commit Commit Interconnection ITLB IL1 DTLB DL1 ITLB ITLB IL1 IL1 DTLB DTLB DL1 DL1 LLC Main memory Uncore What if our design target is just the Uncore? Behavioral Application-dependent superscalar core modeling

Core models Approximate model of a superscalar core that can be connected to a detailed uncore model. Structural core models Emulate internal behavior. Model first order parameters (ROB length, width). Interval Simulation, In-N-Out, etc Behavioral core models Emulate external behavior. Derived from detailed simulation. PDCM, ASPEN, etc. Behavioral Application-dependent superscalar core modeling

Behavioral Core Models uncore A uncore B 2 REAL traces of uncore requests – identical instructions. Requests timing changes in no obvious ways. Current practices fail to model the timing changes. Behavioral core models try to reproducethe external behavior of the core. Behavioral Application-dependent superscalar core modeling

Pairwise Dependent Cache Miss model (PDCM) K. Lee, S. Evans, and S. Cho, ISPASS 2009. Trace of retired uops with uncorerequests  ideal L2 3 kinds of requests: IL1 miss, DL1 load-misses and DL1 store-misses. Emulate ROB to limit number of parallel requests. Consider data dependencies between trace items. SimpleScalar+Perfect branch prediction + no HW-prefetching. Behavioral Application-dependent superscalar core modeling

PDCM – Simulation flow Uncore config. uncore config. uncore config. Benchmark + core config. Performed once for every benchmark and core config. pair SLOW Cyc. Accu. Sim. Zero penalty PDCM trace Trace simulator Performed once for every uncore configuration FAST Uncore simulator Performance Behavioral Application-dependent superscalar core modeling

PDCM – Model building Data dependencies: Reg + mem 2 RT=17 3 RT=17 1 RT=16 2 RT=17 3 RT=17 4 RT=19 5 RT=20 5 RT=20 6 RT=20 6 RT=20 7 RT=22 7 RT=22 8 RT=23 8 RT=23 9 RT=25 9 RT=25 1 RT=16 4 RT=19 Request uop Non-request uop S=3 W=17 1,2,3 S=3 W=3 4,5,6 S=2 W=3 7,8 S=1 W=2 9 Trace Item RT = retirement time S = number of uops W = number of cycles Behavioral Application-dependent superscalar core modeling

Tuning PDCM to Zesto Zestois a highly detailed cycle-level simulator  Loh et al. ISPASS’09 PDCM++ CPI Error (%) PDCM +prefetch +wrong_path +write_backs +TLB_misses +delayed_hits Average CPI error 4.5 % SimpleScalarvs7.8 % Zesto. Considering additional requests increases accuracy. Behavioral Application-dependent superscalar core modeling

PDCM limitations Different sources of dependencies: Data dependencies (register & memory). Resource dependencies (queues:LDQ, STQ, etc). Resource dependencies impact performance. Long latency accesses Contentionfor resources More request in wrong path Tracking all sources of dependencies is complex. Behavioral Application-dependent superscalar core modeling

Behavioral application-dependent superscalar core model – BADCO New core model inspired from PDCM. Two cycle accurate traces: Null latency T0  same as PDCM. Long latency TL infer dependencies. Emulate ROB and level-1 MSHRs to limit the number of parallel requests. Differentiated processing for Instruction request and store requests. Behavioral Application-dependent superscalar core modeling

BADCO Simulation Flow Uncore config. uncore config. uncore config. Benchmark + core config. Simulation Zero penalty Simulation Long penalty Performed once for every benchmark and core config. Pair SLOW T0 TL Model Building Model Graph Performed once for every uncore configuration FAST BADCO machine Uncore simulator Behavioral Application-dependent superscalar core modeling

Trace Generation Two traces (Zesto) of retired μops. T0 Level1 cache misses – zero penalty. μops annotated with retirement time. Capture fixed cost (W) of μops. TL Level1 cache misses – long penalty (1000 cycles). μops annotated with: issue time (IT), completion time (CT) and uncore requests. Infer and expose dependencies - capture requests. RT=16 RT=20 W=4 dependent independent IT=9 CT=2009 IT=9 CT=2009 IT=2010 CT=2013 IT=14 CT=19 Behavioral Application-dependent superscalar core modeling

Model Building T0 TL N1 W=16 S=1 D=0 1 Request uop RT=16 IT=9 CT=2009 1 1 2 3 4 5 6 7 N1 W=16 S=1 D=0 1 N2 W=1 S=1 D=N1 2 RT=17 IT=2010 CT=2013 2 Non-request uop N1 W=16 S=2 D=0 1,3 N2 W=1 S=1 D=N1 2 RT=17 IT=14 CT=19 3 RT = retirement time IT = issue time CT = completion time N1 W=16 S=2 D=0 1,3 N2 W=1 S=1 D=N1 2 N3 W=2 S=1 D=N1 4 RT=19 IT=2011 CT=3012 4 N1 W=16 S=2 D=0 1,3 N2 W=1 S=1 D=N1 2 N3 W=2 S=1 D=N1 4 N4 W=1 S=1 D=N3 5 Request node RT=20 IT=3014 CT=3016 5 N4 W=1 S=1 D=N3 5 N1 W=16 S=2 D=0 1,3 N2 W=1 S=1 D=N1 2 N3 W=2 S=2 D=N1 4,6 RT=20 IT=2012 CT=2019 6 Non-request node N3 W=2 S=2 D=N1 4,6 N1 W=16 S=2 D=0 1,3 N2 W=1 S=1 D=N1 2 N4 W=4 S=2 D=N3 5,7 RT=23 IT=3013 CT=3021 7 W = weight (cycles) S = size (μops) Behavioral Application-dependent superscalar core modeling

Model Simulation – BADCO Machine N1 W=17 S=4 D(N1)=0 ------------ ITLB IL1 N2 W=9 S=8 D(N2)=N1 ------------ DTLB1 DL1_LD DL1_PF N3 W=25 S=26 D(N3)=0 ------------ N4 W=51 S=48 D(N4)=N2 ------------ DL1_ST N5 W=50 S=56 D(N5)=N1 ------------ DL1_HoM DL1_HoM DL1_HoM N6 W=73 S=80 D(N6)=N4 ------------ DL1_LD DL1_WB N7 W=4 S=5 D(N7)=N6 ------------ N8 W=10 S=13 D(N8)=N6 ------------ DL1_LD DL1_LD DL1_PF N9 W=21 S=19 D(N9)=N8 ------------ DL1_LD DL1_PF N10 W=50 S=56 D(N5)=N1 ------------ DL1_HoM DL1_HoM DL1_HoM Cycle = 1004 1017 2037 1001 1002 1005 1502 2002 2088 2003 2036 1003 1006 2011 501 0 1 ROB=86 ROB=4 ROB=12 ROB=38 ROB=142 ROB=138 ROB=0 ROB=104 ROB=130 ROB=184 ROB=136 Fetch Exe. Store N2 W=9 S=8 D(N2)=N1 ------------ N2 W=9 S=8 D(N2)=N1 ------------ DL1_LD DL1_PF N2 W=9 S=8 D(N2)=N1 ------------ DL1_LD N3 W=25 S=26 D(N3)=0 ------------ N1 W=17 S=4 D(N1)=0 ------------ ITLB1 IL1 N1 W=17 S=4 D(N1)=0 ------------ IL1 N1 W=17 S=4 D(N1)=0 ------------ N3 W=25 S=26 D(N3)=0 ------------ N2 W=9 S=8 D(N2)=N1 ------------ DTLB1 DL1_LD DL1_PF N4 W=51 S=48 D(N4)=N2 ------------ DL1_ST N5 W=50 S=56 D(N5)=N1 ------------ N3 W=25 S=26 D(N3)=0 ------------ STALL ITLB IL1 DTLB1 DL1_ST DL1_LD DL1_PF Uncore Behavioral Application-dependent superscalar core modeling

Evaluation methodology • Compare single-core accuracy of PDCM and BADCO with respect to Zesto: • Quantitative Accuracy (3 core config.) • Relative Accuracy (6 uncoreconfig.) • Compare simulation speed of PDCM and BADCO for single thread. • Measure multi-core accuracy and simulation speed of BADCO with respect to Zesto. Behavioral Application-dependent superscalar core modeling

Experimental Setup 22 SPEC2K6 benchmarks + 2 SPEC2K benchmarks (Vortex & Crafty) Behavioral Application-dependent superscalar core modeling

Quantitative Accuracy – Big core Behavioral Application-dependent superscalar core modeling

Quantitative Accuracy – Summary Behavioral Application-dependent superscalar core modeling

Relative Accuracy Design Space Exploration. Speedup more relevant than absolute performance. We would like minimum Speedup Error Behavioral Application-dependent superscalar core modeling

Relative AccuracyConfig:256KB L2, 16MB LLC and 2-byte Bus Ref config.: 256KB L2, 2MB LLC and 8-byte Bus. Behavioral Application-dependent superscalar core modeling

Relative Accuracy - Summary Ref config.: 256KB L2, 2MB LLC and 8-byte Bus. Behavioral Application-dependent superscalar core modeling

Simulation Speed (68x) Speedup (47x) (17x) (15x) Behavioral Application-dependent superscalar core modeling

Multicore simulationestimated CPI vs. measured CPI Behavioral Application-dependent superscalar core modeling

Simulation speed (14.8x) (25.2x) (38.9x) (68.1x) Behavioral Application-dependent superscalar core modeling

Behavioral core modeling summary • Behavioral core models increase simulation speed between one and two orders with respect to detailed simulation. • PDCM has limitations  We introduce BADCO, a new behavioral core model. • BADCO models are built from two cycle-accurate simulations. • BADCO is more accurate than PDCM and PDCM++. Behavioral Application-dependent superscalar core modeling

Contributions II • BADCO  modeling technique for approximate simulation of modern superscalar cores. • Workload stratification  methodology for selecting small and representative multiprogram workloads. Behavioral Application-dependent superscalar core modeling

Workload design Select from the workload space a set of representative workloads. Single-core Workload = 1 benchmark. Well established methods (Benchmark design). Multi-core Workload = combination of benchmarks. No standard method for workload selection Behavioral Application-dependent superscalar core modeling

Multiprogram workload selection The number “W” of possible multiprogram workloads: For 29 SPEC-CPU benchmarks B  num. benchmarks K  num. cores Impossible to simulate all possible benchmark combinations Behavioral Application-dependent superscalar core modeling

Current practices I Survey 2007 – 2012 (ISCA, MICRO and HPCA) 75 papers 9/75 random sampling. Arbitrary sample size. 66/75 class-based selection. Benchmark classes selected manually. Define workload types. Diverse practices to select workloads. Arbitrary sample size. Behavioral Application-dependent superscalar core modeling

Current practices II “Interesting Sample”  High degree of subjectivity. Sample may be interesting but it may not be representative of the population. Caution to make general conclusion. Behavioral Application-dependent superscalar core modeling

Representative sample? Probability that a characteristic of the population is kept for the sample totally or with certain tolerance. Example characteristics: Global throughput. Global speedup. Global ranking of microarchitectures. Which of two microarchitecture is better? TARGET: define a question that you want to ask to the sample and then look for a way to answer that question. Behavioral Application-dependent superscalar core modeling

Methodology TARGET: Small representative sample Correct ranking of two microarchitectures. Case study= 5 shared-cache replacement policies LRU, RANDOM, FIFO, DIP and DRRIP. Use approximate simulation (BADCO) All benchmark combinations (2 & 4 cores). 10000 workloads for 8 cores. Study random sampling. Analytical model to compute sample size. Study alternative sampling methods. Behavioral Application-dependent superscalar core modeling

Random Sampling All workloads have the same probability to be selected. Safe way to avoid biases if the sample is big enough. Lends itself to analytical modeling. Behavioral Application-dependent superscalar core modeling

Analytical Model What we want from the random sample is to know whether or not a microarchitecture Y is better than X. tY(w) and tX(w)  per-workload throughput of Y and X. TY and TX average throughput We define the following random variable: d(w) is the per-workload throughput difference. D is the average throughput difference. Behavioral Application-dependent superscalar core modeling

Analytical Model Central limit theorem  sample throughput D can be approximated by a normal distribution. The degree of confidence that Y is better than X is equal to the probability that D is greater than zero. Assuming almost 100% confidence and after some math we have Where W is the sample size and cv is the coefficient of variation of d(w). Behavioral Application-dependent superscalar core modeling

Coefficient of variation The coefficient of variation (CV) is a normalized measure of dispersion of a probability distribution. σ=0.5, μ=0.5 σ=1, μ=0.5 σ=5, μ=0.5 Estimate CV = compute sample size. Behavioral Application-dependent superscalar core modeling

CV Estimation: 4 cores – WSU Behavioral Application-dependent superscalar core modeling

Random sampling model validation Experimental confidence vs. model confidence that “DRRIP outperforms DIP” using WSU Behavioral Application-dependent superscalar core modeling

Can we do better? Big Samples Explore alternative sampling techniques: Balanced random sampling. Stratified sampling: • benchmark classes. • per-workload throughput. 4 cores Behavioral Application-dependent superscalar core modeling

Balanced Random Sampling Each benchmark occurs the same number of times in the whole workload population. Balanced random  each benchmark occurs the same number of times in the sample. Probability of selecting a workload depends on the previous workloads selected. No mathematical model. Behavioral Application-dependent superscalar core modeling

Stratified Random Sampling Classical sampling method. Exploit homogeneities. Divide the population in non-overlapping subsets (strata). Take random samples in each strata. Sample throughput is a weighted average. We study 2 variants: Benchmark stratification. Workload stratification. Behavioral Application-dependent superscalar core modeling

Benchmark stratification Attempt to formalize common practices. Class-based selection. Divide benchmarks in classes. Group benchmarks with similar behavior. Build strata for every combination of classes. For example: three classes in a 4 core machine generates 15 strata (004, 013, 022, 112, …) Behavioral Application-dependent superscalar core modeling

Workload stratification Estimate per-workload throughput: Approximate simulator (BADCO). Large sample (>800 workloads). Measure per-workload throughput difference d(w). Sort the workloads according to d(w). Use a cluster algorithm to group workloads in strata. Behavioral Application-dependent superscalar core modeling

Alternative Sampling MethodsDIP > LRU, IPCT, 4 cores CV= 10.86 Behavioral Application-dependent superscalar core modeling

Behavioral Application-Dependent Superscalar Core Modeling

Behavioral Application-Dependent Superscalar Core Modeling

Presentation Transcript

Superscalar Processors

SUPERSCALAR ARCHITECTURE

Chapter 6: Behavioral Modeling

Superscalar Implementation

Behavioral Application-Dependent Superscalar Core Modeling

6. Basic Behavioral Modeling

Superscalar Processor Design Superscalar Architecture

Behavioral Modeling

Behavioral Modeling in VHDL

Verilog HDL (Behavioral Modeling)

Superscalar Processor

Superscalar Processors

Modeling Application Process

Superscalar - summary

Verilog HDL (Behavioral Modeling)

Behavioral Modeling

Context-Dependent Modeling

Web Application Modeling

Behavioral Modeling

Superscalar Processors

Behavioral Modeling with UML

Superscalar Processors