Intelligent Memory Allocation in Embedded Systems Using Heterogeneous Memory Units

Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.

Heterogeneous Memory • Heterogeneous = different types of… • Embedded Systems come with a small amount of on-chip SRAM, a moderate amount of off-chip SRAM, a considerable amount of off-chip DRAM and large amounts of EEPROM (Flash memory)

Relative RAM costs and Latencies Latency • On-chip SRAM < off-chip SRAM < on-chip DRAM < off-chip DRAM Cost • On-chip SRAM > off-chip SRAM > on-chip DRAM > off-chip DRAM

Caches in Embedded Chips • Caches are power hungry • Cache miss penalties make it hard to give real-time performance guarantees • Solution : do away with caches and create a non-overlapping address space for systems with heterogeneous memory units (DRAM, SRAM, EEPROM).

Memory Allocation in ES • Memory allocation for program data is done by the embedded system programmer, in software, as current compilers are not capable of doing it over heterogeneous memory units • Code is written in Assembly : tedious and non-portable • Solution : An intelligent compilation strategy that can achieve optimal memory allocation in ES.

Memory Allocation Example

The need for Profiling • Recall : RAM Latencies • Optimal if most frequently accessed code sections are stored in the memory unit with lowest latency. • Access frequencies of memory references need to be measured. • Solution : Profiling.

Intelligent Compilers • The intelligent compiler must be able to • Optimally allocate memory to program data • Base memory allocation on frequency estimates collected through profiling • Correlate memory accesses with the variables they access • Task 3 demands inter-procedural pointer analysis, which is costly.

Profiling • Instead of pointers, a more efficient statistical method is used. Each accessed address is marked checked against a table of address ranges for the different variables. • Provides exact statistics as opposed to pointer analysis

Memory Access Times • Total access time (Sum) of all the memory accesses in the program needs to be minimized • The formulation is first defined for global variables and then extended for heap and stack variables.

Formulation for global variables • Key terms TrjNr(vi) – Total time taken for N reads of variable i stored on memory unit j. TwjNw(vi) – Total time taken for N writes of variable i stored on memory unit j. Ij(vi) – The set of 0/1 integer variables.

Formulation for global variables • Total Access time = ∑(j=1 to U) ∑(i=1 to G)Ij(vi)[TrjNr(vi) + TwjNw(vi) ] U = Number of Memory units G = Number of Variables TrjNr(vi) + TwjNw(vi) contributes to the inner sum only if variable i is stored in memory unit j (if not, Ij(vi) = 0 and the whole term will be 0).

0/1 integer linear program solver • The 0/1 integer linear program solver tries out all combinations of the summation to arrive at the lowest total memory access time and returns this solution to the compiler • The solution is the optimal memory allocation. • MATLAB is used as the solver in this paper.

Constraints • The following constraints also hold : • The embedded processor allows at most one memory access per cycle. Overlapping memory latencies are not considered. • Every variable is allocated on only one memory unit • The sum of the sizes of all the variables allocated to a particular memory unit must not exceed the size of the unit.

Stack variables • Extending the formulation for local variables, procedure parameters and return variables (collectively known as stack variables). • Stacks are sequentially allocated abstractions, much like arrays. • Distributing stacks over heterogeneous memory units optimizes memory allocation.

Stack split example

Distributed Stacks • Multiple stack pointers…from example, 2 stack pointers will have to be incremented on entry (on for each split of the stack) and 2 will have to be decremented on leaving the procedure. • Induces overhead when 2 stack pointers have to be maintained.

Distributed Stacks • software overhead…tolerated for long-running procedures and eliminated by allocating each stack frame to one memory unit for short procedures (one stack pointer per procedure) • Distributed stacks are implemented by compiler for ease of use…..abstraction of stack as a contiguous data structure is maintained for the programmer

Comparison to globals • Stack variables have limited lifetimes compared to globals. They are ‘live’ when a particular procedure is executing and can be garbage collected once the procedure is exited. • Hence variables with non-overlapping lifetimes can share the same address space and their total size can be larger than that of the memory unit they are stored in.

Formulation for Stack Frames • 2 ways of extending the method to handle stack variables. • Each procedure’s stack frame is stored in a single memory unit. • No multiple stack pointers • Distributed stack as different stack frames may still be allocated to different memory units

Stack-extended formulation • Total access time = time taken to access global variables + time taken to access stack variables • The fis refer to the number of functions in the program (as each function has a stack frame).

Constraints • Each stack frame may at most be stored in one memory unit • Stack reaches maximum size when a call-graph leaf node is reached. • A call-graph leaf node is the deepest nested procedure called….if all such procedures’ stack frames can be allocated, program allocation will fit into memory if all paths to leaf nodes on the call graph fit into memory.

Stack-extended formulation • 2nd alternative • Stack variables from the same procedure can be mapped to different memory units • Stack variables are thus treated like globals with the total access time equal to = • However memory requirements are relaxed as in the stack-frame case based on disjoint lifetimes of the stack variables

Heap-extended formulation • Heap data cannot be allocated statically as the allocation frequencies and block sizes are unknown at compile time. • Calls such as malloc( ) fall into this category • Allocation has to be estimated using a good heuristic. • Each static heap allocation site is treated as a variable v in the formulation

Heap-extended formulation • The number of references to each site is counted through profiling. • The variable size is bounded as a finite multiple of the total size of memory allocated at that site. • If a malloc( ) site allocates 20 bytes 8 times over in a program, 160 bytes is the size of v which is multiplied by a safety factor of 2 to give 320 bytes as the allocation size for this site.

Heap-extended formulation • This optimizes for the common case • Calls like malloc( ) are cloned for each memory level which in turn maintains a free list. • If allocation size is exceeded at runtime (max size is passed as a parameter for each call site) a memory block from slower and larger memory is returned.

Heap-extended formulation • Latency would be ≤ latency of slowest memory • If real-time guarantees are needed, all heap allocation must be assumed to go to the slowest memory.

Experiment • This compiler was implemented as an extension to the commonly used GCC cross-compiler to target the Motorola M-Core processor. • Benchmarks used represent code in typical applications. • The runtimes were normalized using only the fastest memory type (SRAM) and then slower memories were introduced for subsequent tests to measure runtimes.

Results

Results • Using 20% SRAM and the rest DRAM still produces runtimes closer to the all SRAM case. Cheaper and without much of a performance loss. • This proves that (at least for the benchmark programs) memory allocation is optimal. The FIB with a linear recurrence to compute Fibonacci numbers is an exception with equal number of accesses to all variables.

Experiment 2 • Enough DRAM and EEPROM was provided while SRAM size was varied for each of the benchmark programs. • This would help determine the minimum amount of SRAM needed to maintain performance reasonably close to the 100% SRAM case

FIR Benchmark

Matrix multiplication benchmark

Fibonacci series benchmark

Byte to ASCII converter

Results • Clear that most frequently accessed code is between 10-20% of entire program • This portion of code is successfully put on SRAM through profile-based optimizations.

Comparing Stack frames and stack variables

Results • The BMM benchmark is used as it has the most number of functions/procedures (hence most number of stack frames/variables). • Allocating stack variables on different units performs better in theory due to the finer granularity and thus a more custom allocation. The difference is apparent for the smaller SRAM sizes.

Applications • The approach in the paper can be used to determine an optimal trade-off between minimum SRAM size and meeting performance requirements.

Adapting to pre-emption • In context-switching environments, all data has to be live at any given time on some live memory. • The variables of all the live programs are combined and the formulation is solved by multiplying the relative frequencies of the contexts with their respective variables. An optimal allocation is achieved in this case.

Summary • Compiler method to distribute program data efficiently among heterogeneous memories. • Caching hardware is not used • Static allocation of memory units • Stack distribution • Optimal guarantee • Runtime depends on relative access frequencies.

Related work • Not much work on cache-less embedded chips with heterogeneous memory units • Memory allocation task is usually left to the programmer • Compiler method is better for larger, more complex programs • It is error free and is also portable over different systems with minor modifications to the compiler.

Related work • Panda et al., Sjodin et al. have researched on memory allocation in cached embedded chips. • Cached systems spend more effort on minimizing cache misses than minimizing memory access times…no optimal guarantee. • Earlier studies only take into account 2 memory levels (SRAM and DRAM) while this formulation can be extended to N levels of memory.

Related work • Dynamic allocation strategies are also possible but not explored here. • Software caching (emulation of a cache in fast memory) is an option. • Methods to overcome software overhead need to be devised. • Inability to provide real-time guarantees should be addressed. THE END

Intelligent Memory Allocation in Embedded Systems Using Heterogeneous Memory Units

Intelligent Memory Allocation in Embedded Systems Using Heterogeneous Memory Units

Presentation Transcript

Embedded Systems

EMBEDDED SYSTEMS

Embedded Systems

Embedded Systems

Embedded Systems

EEL 6935: Embedded Systems Seminar

Embedded Systems

Embedded Systems

Embedded Systems

Embedded Systems

Embedded Systems

Embedded Systems

EMBEDDED SYSTEMS