Cache Miss-Aware Dynamic Stack Allocation

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems(ISCAS), 2007 Presenter: Tareq Hasan Khan ID: 11083577 ECE, U of S Literature review-4 (EE 800)

Outline • Introduction to Cache and Stack • Proposed Dynamic Stack Allocator • Cache Miss Predictor • Stack Pointer Manager • Results • Conclusion

Introduction • Cache • A small and high-speed on-chip memory • Bridges the speed gap between microprocessor and main memory • It is necessary to reduce cache misses without increasing cache associativity for low-power embedded systems • Stack • A group of memory location used for local variables, temporary data of an application or return location of function calls • Last in First Out (LIFO) structure • Half (49%) of memory access related to stack

Dynamic Stack Allocator • Conventional stack allocation is a method that inserts and extracts data sequentially withoutthe consideration of cache misses • Proposed hardware - Dynamic Stack Allocator (DSA) • Cache Miss Predictor (CMP) • computes a cache miss probability at each cache line using the history of cache misses • Stack Pointer Manager (SPM) • select a location for the stack pointer that has the lowest cache miss probability

Dynamic Stack Allocator

Cache Miss Predictor (CMP) • Cache Miss Controller (CMC) • Cache Miss (CM) buffer • Consists of “index” and “count” register pairs

Cache Miss Controller (CMC) • Cache controller detects cache misses through comparing the tags in the cache with tag bits of address requested by the processor. • When a cache miss is detected, the cache controller transfers cache miss signal to notify CMP that cache miss has occurred and an index of missing line is also supplied. • On a cache miss, the index is saved at CM buffer and its corresponding counter is incremented by the CMC. • When the CM buffer is full, an entry is replaced according to the interval-based LRUpolicy

Cache Miss (CM) buffer • Recent CM buffer (RCM buffer) • History CM buffer (HCM buffer)

Cache Miss (CM) buffer • Recent CM buffer (RCM buffer) • On a cache miss to cache line k, an associative lookup into the RCM buffer is performed using k. If there is an entry with index k, then the counter for the line k is incremented. • However, if no match occurs and the RCM buffer is not full, the index is recorded in one of the empty lines and the corresponding counter is incremented. • History CM buffer (HCM buffer) • When the RCM buffer is full, the HCM buffer is replaced with the contents of the RCM buffer according to the LRU policy. The indices in the HCM buffer are replaced with the indices in the RCM buffer with a larger value. • In the interval-based LRU policy, the comparison for the replacement doesn’t occur until the RCM buffer is full.

Stack Pointer Manager (SPM) When an application requires a stack, the SPM looks for a location that has the lowest cache miss probability using the contents of the RCM and HCM buffer

Stack Pointer Manager (SPM) • When a function is called, the SPM calculates the total cache miss probability within the searching window (R1, R2) of each sub-stack. • To calculate the total cache miss probability, • SPM looks up and down the RCM and HCM buffer to know whether indices included in the searching window exist or not. If it exists, SPM adds the corresponding value to get the total cache miss probability. • After computation, SPM compares the computed probability of a sub-stack with one of other sub-stacks. • Then, SPM dynamically selects a sub-stack that has the lowest cache miss probability as the stack for an application.

Result Implemented within the OpenRISC 1200 microprocessor with 8KB direct-mapped data cache and 8KB direct-mapped instruction cache, each with 16-byte line size • The amount of data traffic between cache and main memory according to the size of the RCM and HCM buffer, where the traffic is normalized to one for conventional • The amount of traffic of FFT is 42% smaller than one of the conventional scheme. • Some cases, traffic increases, e.g., DFT with the DSA configurations of RCM(5) and HCM(8).

Result…cont. • Variation of the amount of data traffic according to the number of sub-stacks. • In all cases, the more the number of sub-stack is, the smaller the amount of traffic. But not a very significant improvement.

Result…cont. • ASIC implementation of DSA was done • The maximum speed was 87MHz • The size of DSA is 0.3mm X 0.4mm which is about 1% of total core area

Conclusion • Proposed a hardware for cache miss-aware dynamic stack allocationto reduce cache misses • Based on the history of cache misses, the proposed scheme controls the stack pointer to a location expected to cause smaller cache misses. • In various benchmarks, it was shown that traffic between cache and main memory was reduced by DSA from 4% to 42%.

Thanks

Cache Miss-Aware Dynamic Stack Allocation

Cache Miss-Aware Dynamic Stack Allocation

Presentation Transcript

Dynamic Memory Allocation

Dynamic Memory Allocation – beyond the stack and globals

Stack and Heap Allocation

Dynamic Allocation

Dynamic Allocation

Dynamic Memory Allocation

Dynamic Memory Allocation

Dynamic Memory Allocation

Cache-Miss Prediction

Reducing Cache Miss Penalties

Dynamic memory allocation

Dynamic Allocation

Dynamic Partition Allocation

Pointer: Dynamic Allocation

Dynamic Allocation

Stack and Heap Allocation

Dynamic Allocation

Cache Miss Rate Computations

Dynamic Memory Allocation

Dynamic Allocation

Dynamic Memory Allocation – beyond the stack and globals

Pointer: Dynamic Allocation