1 / 27

Ran Liu ( Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)

SSMalloc A Low-latency, Locality-conscious Memory Allocator with Stable Performance Scalability. Ran Liu ( Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.). Background. Many-Core Era Computers with tens of cores are available Many-Thread Application

lemuel
Télécharger la présentation

Ran Liu ( Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SSMallocA Low-latency, Locality-conscious Memory Allocator with Stable Performance Scalability Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)

  2. Background • Many-Core Era • Computers with tens of cores are available • Many-Thread Application • Server Program • Scientific Computation Program • … • Many Applications’ performance heavily relies on memory allocator

  3. Allocator performance matters • Web server throughput with different memory allocators • *Taken from Facebook website

  4. Is it a solved problem? glibc SFMalloc(PACT11) Scale Up jemalloc(BSDCan06) Streamflow(ISMM06)

  5. Is it a solved problem? #Core Unstable Scale Up kernel contention User-level contention

  6. The main problems in modern memory allocators • Unstable scalability • Critical path contention • Global data structure contention • Kernel contention • With 64 threads, SFMalloc spent a great amount of time in mmap calls. • Unstable locality • Kernel execution • Allocator data structure operation • Context switch • Unstable Latency • Algorithm complexity • Jemalloc use RB trees (O(log N)) internally. • Hardware details(pipeline, branch prediction, cache)

  7. This paper #Core Stable Scale Up

  8. Design of ssmalloc

  9. Mechanism for object of different size • Small Object • Closely related to scalability • Handled in private heap • Large Objects • Forward to OS via mmap

  10. Small object (<=64KB) management Thread N Thread 2 Thread 1 … Private Heap 1 Private Heap 2 Private Heap N Memory Chunks Global Pool OS

  11. Memory Chunk • Basic unit of memory management • Contains multiple objects of the same size class Obj 1 … Obj N Header Private RW Shared R SharedW Avoid false sharing on allocator data structure

  12. Memory Chunk (Cont.) • Same size • Cross size-class reuse • Easy metadata locating • Unaligned size (65536 + 256 Byte) • Mitigate cache conflict on header Header 256 Byte Data Area 65536 Byte cache

  13. Private Heap Full Chunks Foreground Chunks Background Chunks (LIFO Linked List) Local Free Chunks

  14. Private Heap (Cont.) Hot Chunks Full Chunks Foreground Chunks Background Chunks (LIFO Linked List) Local Free Chunks Cold Chunks

  15. Global Pool Private Heap A Private Heap B Global reuse (Lock Free) Alloc new chunk(Lock Free) Raw Memory Pool Raw Memory Pool is Enlarged Exponentially to avoid mmap calls

  16. Global Pool (Cont.) • Interact with OS Memory Amount SSMalloc (Time-directed reclamation) • Reduce VM management Calls Many other allocators (Space-directed reclamation) • Memory pages ping-pongs from user & kernel • Excessive VM management calls Time

  17. How to free an object? • Problem: decide the size of memory object • Textbook solution: per object header • Easy to locate, Bad locality • Modern allocators: centralized metadata • Hard to locate (bitmap, hash table, radix tree…), Good locality H ? H H

  18. How to free an object? • Problem: decide the size of memory object • SSMalloc: Unified header for small & large objects • All the object’s header is at the previous chunk boundary • Easy to locate (Align to chunk boundary), Good locality Small Objects Large Objects

  19. Design summary • Scalability • Sync-free critical path • Local memory reuse • Lock-free global data structure • Excessive VM management calls avoidance(mmap, munmap) • … • Latency • Wait-free algorithm within private heap • Short critical path • Unified header • … • Locality • Locality-conscious memory chunk management • Allocator false-sharing avoidance • …

  20. Evaluation

  21. Evaluation • Platform • 8 Six-Core (2.4 GHz) AMD x64 system (48 cores in total) • 128 GB memory • Linux 3.2.10 • Other memory allocators • Glibc • TCMalloc from google-perftools 1.7 • jemalloc 2.1.2 • streamflow • SFMalloc

  22. latency • Allocation intensive serial programs

  23. Scalability • shbench performance

  24. Locality • Wordcountfrom phoenix 2.0: cache miss

  25. Map-reduce performance • Wordcountfrom phoenix 2.0

  26. Conclusion • Analysis the performance problem of memory allocators • Explore the design space of memory allocator for many-thread applications on many-core systems • A prototype: SSMalloc • Low latency • Stable scalability • Good locality Thanks!

  27. Why not modify kernel to improve mmap scalability? • Parallelize the VM management operations includes huge kernel code refactoring • Memory manager itself • Device driver • Apply a new memory allocator is much more easy and practical.

More Related