Paper Presentation

Paper Presentation A Helper Thread Based Dynamic Cache Partitioning Scheme for Multithreaded Applications 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE Page 954 – 959

Outline • Abstract • Related Work • Motivation • Difference between inter and intra application • Proposed Method • Experiment Result • Conclusion

Abstract Focusing on the problem of how to partition the cache space given to a multithreaded application across its threads, we show that different threads of a multithreaded application can have different cache space requirements, propose a fully automated, dynamic, intra-application cache partitioning scheme targeting emerging multicores with multilayer cache hierarchies, present a comprehensive experimental analysis of the proposed scheme, and show average improvements of 17.1% and 18.6% in SPECOMP and PARSEC suites.

Related Work Resource Management Processor cores[6] Shared cache[5, 4, 8, 11, 12, 17, 18, 20] Off-chip bandwidth[3, 10, 13] Application granularity Intra-application shared cache[16] This paper Improve the cache layer problem

Motivation • Run application of facesim(PARSEC) and art(SPECOMP). • Perform six scheme and recorded the Average Memory Access Time(AMAT). • No-partition • Uniform • Nonuniform • Nonuniform-L2 • Nonuniform-L3 • Dynamic • Dynamic outer perform the rest • Divide application into fixed epoch and performs the best.

Difference between Inter & Intra App. • The objectives and the implementation are different on cache partition. • The intra-application cache partition tries to minimize the latency of the slowest thread. • Runtime system or dynamic compiler • The inter-application cache partition tries to optimize workload throughput. • OS problem

The Proposed Method • Dynamic Partition System • Helper Thread whose main responsibility is to partition the cache space allocated to the application to maximize its performance. Performance Modeling System Interfacing Performance Monitoring

Proposed Method(cont.) • Each OS epoch is composed many application, which divided into 5 epoch. • Performance Monitoring • Performance Modeling • Resource Partitioning • System Interfacing • Application Execution

Performance Monitoring • Use Average Memory Access Time as measure of the cache performance of a thread. • AMAT • The ratio of total cycles spent on memory instructions and total number of instructions • Depends on the cache partition size • Take into account with different level of cache

Performance Modeling • Need to predict the impact of increasing and decreasing the cache space to a thread. • Expressed a thread with 3D plot • X and Y respectively for cache space allocation from L2 and L3 • Thread i, point d(sL2, sL3) value to build dynamic model for thread i. • Purpose – predict the performance of a thread

Cache Space Partitioning • ith L2 cache, qL2,i denotes the total cache way allocated to this application. • qL2,i are shared by mL2,i thread(from 0 to mL2,i) • The number of ways allocated to the kth thread is denoted as sL2,i(k)

Cache Space Partitioning Algorithm • P[t] denotes cache resources(numbers of way in L2 & L3).

System Interfacing • New partition information is delivered to the OS using system call. • Add new instruction to ISA • COID = core ID, CLVL = cache level, CAID = cache ID, W = 64bit wide way allocation

What we want to know • The experimental environment • Compare with other scheme • Average Memory Access Time • The main target of the performance monitoring • Execution Cycle

Experiment Environment • SIMICS and GEMS to model below multicore architecture. • Run SPECOMP and PARSEC application. • Use 120 million instruction as application epoch.

Experiment Environment(cont.) • Perform 8 schemes and recorded average memory access time • No-partition • Uniform – as evenly as possible for each core • Static Best – static partition for best result through exhaustive search • Dynamic – the proposed method • Dynamic-L2 – partition only L2 • Dynamic-L3 – partition only L3 • L2+L3 – a separate performance model for each one. • Ideal – optimal strategy

Improve Performance • Shows that balancing the data access latency of different threads. • As the execution went on, they all end up at about 8 AMAT(cycle).

Conclusion • Intra-application cache partitioning for multithread • Dynamic model, able to partition cache in multiple layer. • Average improvement of 17.1% in SECOMP and 18.6% in PARSEC. • My Comment • Remind me the importance of software and hardware cooperation. • Thread is a main issue in CMP.

Paper Presentation