Hardware Support for Dynamic Memory Management

Hardware Support for Dynamic Memory Management J. Morris Chang Witawas Srisa-an Chia-Tien Dan Lo Illinois Institute of Technology Edward F. Gehringer North Carolina State University

The Problem • O-o applications make frequent requests for dynamic memory. • C++ programs do an order of magnitude more than C programs. • Most objects are abandoned quickly. • --> Much time used in memory mgt. • Up to 30% in C programs ... • Garbage collection has been optimized, but still takes time.

Hardware-Implemented Allocation • Makes use of an allocation vector (A-vector) and a bit-flipper. address 0 1 2 3 4 5 6 7 the A-vector before the allocation 1 0 1 1 0 0 1 1 (a) Combinational logic (the complete binary tree) determines that there is enough free memory to fill the request for two blocks (b) The address of the free block is 100 . 2 (c) The bits at 100 and 101 are flipped. 2 2 the A-vector after the allocation 1 0 1 1 1 1 1 1

The Complete Binary Tree • A binary tree of bits is used to locate the first free region combinationally. Level 0 Size 24 1 Level 1 Size 23 1 1 Level 2 Size 22 1 0 1 0 Level 3 Size 21 1 1 0 0 0 1 0 0 Size 20 Level 4 1 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 A-vector address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Keeping Track of Object Size • Meanwhile, the size bit-vector (S-vector) records the boundaries between objects. Complete Binary Tree Allocation bit-vector (A-vector) S-Unit (Size encoder) Size bit-vector (S-vector)

Five Hardware-Implemented Instructions • h_malloc • mark • h_free • sweep • h_realloc • All are implemented in the Dynamic Memory Management Unit. • DMMU manages the heap

The DMMU • Each entry contains three bit-vectors. • X-vector used for reallocation & g.c. A-vector S-vector X-vector h_malloc / h_free / h_realloc mark / sweep gc_ack O.S. sbrk/brk CPU DMMU object_size Kernel object_pointer

The ALB • Each entry in the DMMU tracks the allocation status of a region of memory. • Compare with a TLB, which tracks the location of a region of virtual memory. • So, these entries make up the Allocation Lookaside Buffer. • Entries can be saved and fetched to A-, S-, and X- bitmaps.

Steps in Allocation • Compare requested size with largest_available_size in each ALB entry. • Select an entry & pass requested size to CBT • CBT locates first available chunk. • Chunk is allocated using buddy system. • Unused words at end are returned to free memory. • Address of block is returned, and status changed to allocated. • S-vector is updated accordingly. Size (A1) Address pointer (A1) Complete Binary Tree ( CBT ) h_ malloc Allocation bit-vector (A bit-vector) (A2) (Size encoder) S-Unit (A3) Size bit-vector (S bit-vector)

Steps in Deallocation • Deallocation is very similar to allocation. Address pointer (D1) Complete Binary Tree ( CBT ) h_ free Allocation bit-vector (A bit-vector) (D2) (Size encoder) S-Unit (D3) Size bit-vector Size boundaries (S bit-vector)

Steps in Marking • Each live-object pointer sent to CBT, one after another. • Page # of object pointer selects a bit-vector. • Signal generated by CBT is latched in X-vector. Address pointer Complete Binary Tree ( CBT ) mark Auxiliary bit-vector Live-object pointers (X-vector)

Steps in Sweeping • Bit-sweeper receives the sweep signal. • Size info from S-vector and liveness status from X-vector generate new alloc. status and largest_avail_size. Allocation bit-vector (A vector) (E2) (Size encoder) S-Unit (E2) (E1) Size bit-vector (S vector) (E1) sweep (E1) GC_ ack (E3) Bit-Sweeper/ X-Unit (C1) Auxiliary bit-vector (X vector)

Putting it All Together Size (A1) Address pointer (A1) Complete Binary Tree ( CBT ) Address pointer (B1, D1) h_ malloc , h_ free, mark (A1, B1, D1) Allocation/ deallocation output (A1, B1) (D1) Allocation bit-vector (A vector) (A2,B2,E2) (Size encoder) S-Unit (A2,B2,E2) (E1) Size bit-vector Size boundaries (B1). (S vector) (C1, E1) h_ realloc (C1) / sweep (E1) GC_ ack (E3) Bit-Sweeper/ X-Unit (C1) (E1) Starting_address (C1) (C1) Auxiliary bit-vector Enable signal live object pointer Ending_address(C1) generator (X-vector) (C2) Reallocation Status (RS-Unit) A. Steps required for allocation B. Steps required for deallocation C. Steps required for reallocation D. Steps required for marking Reallocation Status (C2) E. Steps required for sweeping

Memory Usage • Most schemes encode size information in objects themselves. • This is more efficient with large objects. • Bit-vector is more efficient with small objects. • If object contains 8 bytes for size and1 for marking, bitmap scheme more efficient when avg. size < 384 bytes. • Avg. object size for C++ & Java programs:  101 bytes.

Performance Gain • ALB miss penalty. • Bit-vector length of 500 bits ( 64 bytes) gives 97% hit ratio. • This => ALB entry is 192 bytes. • 64-bit 100 MHz bus gives 800 MB/s. transfer rate. • => miss penalty is 96 cycles (192x400/800) • With ALB hit, it takes 2 cycles to allocate memory. • => avg. hw. malloc time is 4.82 cycles. • Software malloc varies from 51 to 900 cycles, with avg. 192. • In an application that spends 30% of time allocating, speedup would be 41%.

Summary • O-o applications spend a lot of their time allocating memory. • To allocate in hardware, we use a bit-vector based approach. • Allocation/deallocation done combinationally using a complete binary tree on top of the bit-vector. • Yields speedup of > 40% on memory-intensive programs.

Hardware Support for Dynamic Memory Management