MacSim Architecture Studies

MacSim Architecture Studies MacSim Tutorial (In ISCA-39, 2012)

Architecture Studies Using MacSim • Thread fetch policies • Branch predictor • Software and Hardware prefetcher • Cache studies (sharing, inclusion) • DRAM scheduling • Interconnection studies • Power model Front-end Memory System Misc. MacSim Tutorial (In ISCA-39, 2012)

Prefetcher Study MacSim Trace Generator (PIN, GPUOCelot) Frontend Memory System Software prefetch instructions PTX  prefetch, prefetchu x86  prefetcht0, prefetcht1, prefetchnta Hardware prefetch requests Hardware Prefetcher Stream, stride, GHB, … • Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010] • When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012] MacSim Tutorial (In ISCA-39, 2012)

Cache and NoC Studies $ $ $ $ $ $ $ Private Caches Interconnection Interconnection Shared $ Shared Cache • TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012] Cache studies – sharing, inclusion property On-chip interconnection studies MacSim Tutorial (In ISCA-39, 2012)

Heterogeneity Aware NoC • Heterogeneous link configuration CPU GPU MC Ring Network Different topologies L3 C C M M C C M M C0 C1 C2 G0 G1 G2 C C G G M1 M0 L3 L3 L3 L3 C0 G0 C2 G1 C1 G2 C C G G M1 M0 L3 L3 L3 L3 • On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. under review] MacSim Tutorial (In ISCA-39, 2012)

Instruction Fetch and DRAM Scheduling Trace Generator (GPUOCelot) Frontend RR, ICOUNT, FAIR, LRF, … Execution DRAM FCFS, FRFCFS, FAIR, … • Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010] MacSim Tutorial (In ISCA-39, 2012)

DRAM Scheduling in GPGPUs DRAM Bank DRAM Controller Qs for Core-0 Qs for Core-1 Potential of Requests from Core-0 = |W0|α + |W1|α + |W2|α+ |W3|α = 4α+ 3α+ 5α (α < 1) Reduction in potential if: row hit from queue of length L is serviced next Lα – (L – 1)α row hit from queue of length L is serviced next Lα – (L – 1/m)α m = cost of servicing row miss/cost of servicing row hit Tolerance(Core-0) < Tolerance(Core-1)  select Core-0 Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next W0 W1 W2 W3 W0 W1 W2 W3 RH RM RM RM RM RH RM RM RM RH RM RM RM RH RH RM RM Core-0 Core-1 Tolerance(Core-0) < Tolerance(Core-1) • DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011] MacSim Tutorial (In ISCA-39, 2012)

Power Research & Validation • Verifying simulator and GTX580 • Modeling X86-CPU power • Modeling GPU power • Still on-going research MacSim Tutorial (In ISCA-39, 2012)

MacSim’s Roadmap OpenGL Program ARM Architecture Mobile Platform Power/Energy Model 2012 ~ 2013 MacSim Tutorial (In ISCA-39, 2012)

MacSim Architecture Studies