Cache Miss Analysis of Walsh-Hadamard Transform Algorithms

Cache Miss Analysis of Walsh-Hadamard TransformAlgorithms Mihai A. Furis Advisor: Jeremy Johnson Ph.D. Department of Computer Science

Abstract Processor speed has been increasing at a much greater rate than memory speed leading to the so called processor-memory gap. In order to compensate for this gap in performance, modern computers rely heavily on a hierarchical memory organization with a small amount of fast memory called cache. The true cost of memory access is hidden, provided data can be obtained from cache. Substantial performance improvement in the runtime of a program can be obtained by making intelligent algorithmic choices that better utilize cache.In this work, we investigate cache performance using both empirical and analytic techniques. The goal is to better understand how algorithmic choices affect cache performance. Using cache simulators and hardware counters we compare cache performance for different memory access patterns, and use this data to model and analyze the cache behavior of a more complicated algorithm.

Motivation

Objective, strategy and results • Objective Determine the relation between memory access patterns (algorithm) and the memory architecture. Develop a performance model to analyze and predict cache behavior for an algorithm. Improve and optimize algorithms by making intelligent choices based on the performance model. • Strategy 1) Investigate strided access patterns on different memory configurations. 2) Extend results to a more complicated algorithm (WHT). Experiments and Results 1) Measured runtime performance and cache misses of benchmark program. 2) Used simulator to investigate different memory configurations. 3) Measured performance and cache misses of the WHT algorithm(s) and developed a parameterized model to predict cache behavior. 4) Analyzed the cache performance of the WHT.

Outline • Part I • Review of cache design parameters • Tools for measuring cache performance • Performance counters, memory trace & simulator • Investigation of strided memory access patterns • Part II • Review family of WHT algorithms • Cache model for WHT • Investigation of cache misses for WHT

L1 Instruction Cache ITLB Processor L2 Unified Cache Main Memory L1 Data Cache DTLB Cache structure and organization

Cache Design Parameters • Cache size usually a power of two • Block size Smallest amount of data that can be transferred between memory and cache. Provides prefetching. The mapping between main memory and cache is done using the formula: (Block address) MOD (Number of blocks in the cache) • Associativity Provide a set of locations in cache which contain data that maps to the same cache block. Direct mapped, fully associative and in between. Mapping formula: (Block address) MOD (Number of sets in cache)

Three C model • Compulsory misses The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold startmisses or first referencemisses. • Capacity misses If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved • Conflict misses If the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set.

The Stride program for (csize = CACHE_MIN; csize <= CACHE_MAX; csize = csize * 2) for (stride = 1; stride <= csize/2; stride = stride * 2) { sec = 0; /* initialize timer */ limit = csize - stride + 1; /* cache size this loop */ steps = 0; do { /* repeat until collect 1 second */ sec0 = get_seconds(); /* start timer */ for (i = SAMPLE * stride; i != 0; i = i - 1) /* larger sample */ for (index = 0; index < limit; index = index + stride) x[index] = x[index] + 1; /* cache access */ steps = steps + 1; sec = sec + (get_seconds() - sec0); } while (sec < 1.0); /* until collect 1 second */

The memory access pattern for the Stride program Memory location 1 2 3 4 5 6 7 …………………….. Stride 0 1 2 3

Performance Analysis Tools Performance Counters During the last years the microprocessors have been designed to include special hardware support for measuring and monitor their performance. The performance monitors interface that I used for this paper is called The Performance Counter Library, or PCL. More information about PCL can be found at the address: http://www.fz-juelich.de/zam/PCL/. The hardware support for performance measuring comes under the form of a set of performance counters with a defined set of countable events. The PCL interface allows us to initialize the set of counters we a specified set of events and record this events. At the end the interface allows us to retrieve the results. Cache Simulator We used the Dinero cache simulator to simulate the execution of the Stride program from different cache sizes and strides on a virtual machine similar with our lab machine n1-10-78 (Pentium III)

Machine Configuration Identification: GenuineIntel, Pentium III n1-10-78.mcs.drexel.edu Hardware: TLB Instruction 4K-Byte Pages, 4-way set associative, 32 entries 4M-Byte Pages, fully associative, two entries Data 4K-Byte Pages, 4-way set associative, 64 entries 4M-Byte Pages, 4-way set associative, eight entries L1 Cache Instruction 16K Bytes, 4-way set associative, 32 byte line size Data 16K Bytes, 2-way or 4-way set associative, 32 byte line size L2 Unified Cache 512K Bytes, 4-way set associative, 32 byte line size

The Stride program results

The Stride program results using PCL

The Stride Program results using the Dinero cache simulator

The Stride program results using the Dinero cache simulator

The Walsh-Hadamard transform The Walsh – Hadamard Transform of a signal x, of size N = 2n is the matrix vector product WHTN * x where:

WHT factorizations • Recursive factorization • Iterative factorization • General Factorization

WHT algorithms R=N; S=1; for i=t,…,1 R=R/Ni forj=0,…,R-1 for k=0,…,S-1 S=S* Ni;

Space of WHT algorithms (partition trees) 7 7 2 5 2 2 2 1 2 3 2 1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Vector Breakdown StrategiesThe Interleaved Split 4 3 1 2 1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Vector Breakdown strategiesThe Cut Split 4 1 3 1 2

Formula for calculating the number of cache misses

Generating random WHT factorization trees 2 | 3 | 1 | 2 = [2, 3, 1 , 2] 1 1 1 1 1 1 1 1 . . . . . . . 0 1 0 0 1 1 0

The cache size influence on the number of cache misses

The Block Size influence on the number of cache misses

The associativity influence on the number of cache misses

Conclusions and Future Work • Developed model for counting the number of cache misses in a WHT algorithm. • Empirically investigated the number of cache misses for different WHT algorithms and cache parameters. Future Work • Theoretical understanding of max, min, avg., and distribution of cache misses for WHT. • Refine model to account for runtime. • Generalize to other algorithms.

Cache Miss Analysis of Walsh-Hadamard Transform Algorithms

Cache Miss Analysis of Walsh-Hadamard Transform Algorithms

Presentation Transcript

Cache Algorithms

Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform

Walsh Transform

The Study of Cache Oblivious Algorithms

Analysis of Algorithms

Hadamard matrices and the hadamard conjecture

Cache Replacement Algorithms with Nonuniform Miss Costs

Cache Based Iterative Algorithms

Cache-Miss Prediction

3.2 Cache Oblivious Algorithms

Improving Data Cache Performance Under a Cache Miss

Reducing Cache Miss Penalties

TRANSFORM ANALYSIS

Adaptive Fourier analysis based on the Walsh Hadamard transformation

Benefits of Early Cache Miss Determination

Theory of Algorithms: Transform and Conquer

TRANSFORM ANALYSIS

Cache-Oblivious Algorithms

Err cache miss error fixes

Cache Miss Rate Computations

Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform

Theory of Algorithms: Transform and Conquer