1 / 62

Memory Management for High-Performance Applications

Memory Management for High-Performance Applications. Emery Berger University of Massachusetts, Amherst. High-Performance Applications. Web servers, search engines, scientific codes C or C++ (still…) Run on one or cluster of server boxes. software. compiler. Needs support at every level.

ilya
Télécharger la présentation

Memory Management for High-Performance Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory Management for High-Performance Applications Emery Berger University of Massachusetts, Amherst

  2. High-Performance Applications • Web servers, search engines, scientific codes • C or C++ (still…) • Run on one or cluster of server boxes software compiler • Needs support at every level runtime system operating system hardware

  3. New Applications,Old Memory Managers • Applications and hardware have changed • Multiprocessors now commonplace • Object-oriented, multithreaded • Increased pressure on memory manager(malloc, free) • But memory managers have not kept up • Inadequate support for modern applications

  4. Current Memory ManagersLimit Scalability • As we add processors, program slows down • Caused by heap contention Larson server benchmark on 14-processor Sun

  5. The Problem • Current memory managersinadequate for high-performance applications on modern architectures • Limit scalability, application performance, and robustness

  6. This Talk • Building memory managers • Heap Layers framework [PLDI 2001] • Problems with current memory managers • Contention, false sharing, space • Solution: provably scalable memory manager • Hoard [ASPLOS-IX] • Extended memory manager for servers • Reap [OOPSLA 2002]

  7. Implementing Memory Managers • Memory managers must be • Space efficient • Very fast • Heavily-optimized code • Hand-unrolled loops • Macros • Monolithic functions • Hard to write, reuse, or extend

  8. Building Modular Memory Managers • Classes • Overhead • Rigid hierarchy • Mixins • No overhead • Flexible hierarchy

  9. A Heap Layer • Mixin with malloc & free methods • template <class SuperHeap>class GreenHeapLayer : • public SuperHeap {…};

  10. Example:Thread-Safe Heap Layer LockedHeap protect the superheap with a lock LockedMallocHeap

  11. Empirical Results • Heap Layers vs. originals: • KingsleyHeapvs. BSD allocator • LeaHeapvs. DLmalloc 2.7 • Competitive runtime and memory efficiency

  12. Overview • Building memory managers • Heap Layers framework • Problems with memory managers • Contention, space, false sharing • Solution: provably scalable allocator • Hoard • Extended memory manager for servers • Reap

  13. Problems with General-Purpose Memory Managers • Previous work for multiprocessors • Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92] • Impractical • Multiple heaps [Larson 98, Gloger 99] • Reduce contention but cause other problems: • P-fold or even unbounded increase in space • Allocator-induced false sharing we show

  14. Multiple Heap Allocator:Pure Private Heaps • One heap per processor: • malloc gets memoryfrom its local heap • free puts memoryon its local heap • STL, Cilk, ad hoc Key: = in use, processor 0 = free, on heap 1 processor 0 processor 1 x1= malloc(1) x2= malloc(1) free(x1) free(x2) x4= malloc(1) x3= malloc(1) free(x3) free(x4)

  15. Problem:Unbounded Memory Consumption • Producer-consumer: • Processor 0 allocates • Processor 1 frees • Unbounded memory consumption • Crash! processor 0 processor 1 x1= malloc(1) free(x1) x2= malloc(1) free(x2) x3= malloc(1) free(x3)

  16. Multiple Heap Allocator:Private Heaps with Ownership • free returns memory to original heap • Bounded memory consumption • No crash! • “Ptmalloc” (Linux),LKmalloc processor 0 processor 1 x1= malloc(1) free(x1) x2= malloc(1) free(x2)

  17. Problem:P-fold Memory Blowup • Occurs in practice • Round-robin producer-consumer • processor i mod P allocates • processor (i+1) mod P frees • Footprint = 1 (2GB),but space = 3 (6GB) • Exceeds 32-bit address space: Crash! processor 0 processor 1 processor 2 x1= malloc(1) free(x1) x2= malloc(1) free(x2) x3=malloc(1) free(x3)

  18. Problem:Allocator-Induced False Sharing • False sharing • Non-shared objectson same cache line • Bane of parallel applications • Extensively studied • All these allocatorscause false sharing! cache line processor 0 processor 1 x1= malloc(1) x2= malloc(1) thrash… thrash…

  19. So What Do We Do Now? • Where do we put free memory? • on central heap: • on our own heap:(pure private heaps) • on the original heap:(private heaps with ownership) • How do we avoid false sharing? • Heap contention • Unbounded memory consumption • P-fold blowup

  20. Overview • Building memory managers • Heap Layers framework • Problems with memory managers • Contention, space, false sharing • Solution: provably scalable allocator • Hoard • Extended memory manager for servers • Reap

  21. Hoard: Key Insights • Bound local memory consumption • Explicitly track utilization • Move free memory to a global heap • Provably bounds memory consumption • Manage memory in large chunks • Avoids false sharing • Reduces heap contention

  22. Overview of Hoard global heap • Manage memory in heap blocks • Page-sized • Avoids false sharing • Allocate from local heap block • Avoids heap contention • Low utilization • Move heap block to global heap • Avoids space blowup processor 0 processor P-1 …

  23. Summary of Analytical Results • Space consumption: near optimal worst-case • Hoard: O(n log M/m + P) {P « n} • Optimal: O(n log M/m)[Robson 70]: ≈ bin-packing • Private heaps with ownership: O(P n log M/m) • Provably low synchronization • n = memory required • M = biggest object size • m = smallest object size • P = processors

  24. Empirical Results • Measure runtime on 14-processor Sun • Allocators • Solaris (system allocator) • Ptmalloc (GNU libc) • mtmalloc (Sun’s “MT-hot” allocator) • Micro-benchmarks • Threadtest: no sharing • Larson: sharing (server-style) • Cache-scratch: mostly reads & writes (tests for false sharing) • Real application experience similar

  25. Runtime Performance: threadtest • Many threads,no sharing • Hoard achieves linear speedup speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)

  26. Runtime Performance: Larson • Many threads,sharing(server-style) • Hoard achieves linear speedup

  27. Runtime Performance:false sharing • Many threads,mostly reads & writes of heap data • Hoard achieves linear speedup

  28. Hoard in the “Real World” • Open source code • www.hoard.org • 13,000 downloads • Solaris, Linux, Windows, IRIX, … • Widely used in industry • AOL, British Telecom, Novell, Philips • Reports: 2x-10x, “impressive” improvement in performance • Search server, telecom billing systems, scene rendering,real-time messaging middleware, text-to-speech engine, telephony, JVM • Scalable general-purpose memory manager

  29. Overview • Building memory managers • Heap Layers framework • Problems with memory managers • Contention, space, false sharing • Solution: provably scalable allocator • Hoard • Extended memory manager for servers • Reap

  30. Custom Memory Allocation • Programmers often replace malloc/free • Attempt to increase performance • Provide extra functionality (e.g., for servers) • Reduce space (rarely) • Empirical study of custom allocators • Lea allocator often as fast or faster • Custom allocation ineffective, except for regions. [OOPSLA 2002]

  31. Fast Pointer-bumping allocation Deletion of chunks Convenient One call frees all memory Overview of Regions • Separate areas, deletion only en masse regioncreate(r) r regionmalloc(r, sz) regiondelete(r) • Risky • Accidental deletion • Too much space

  32. Why Regions? • Apparently faster, more space-efficient • Servers need memory management support: • Avoid resource leaks • Tear down memory associated with terminated connections or transactions • Current approach (e.g., Apache): regions

  33. Drawbacks of Regions • Can’t reclaim memory within regions • Problem for long-running computations,producer-consumer patterns,off-the-shelf “malloc/free” programs • unbounded memory consumption • Current situation for Apache: • vulnerable to denial-of-service • limits runtime of connections • limits module programming

  34. Reap Hybrid Allocator • Reap = region + heap • Adds individual object deletion & heap reapcreate(r) r reapmalloc(r, sz) reapfree(r,p) reapdelete(r) • Can reduce memory consumption • Fast • Adapts to use (region or heap style) • Cheap deletion

  35. Using Reap as Regions Reap performance nearly matches regions

  36. Reap: Best of Both Worlds • Combining new/delete with regionsusually impossible: • Incompatible API’s • Hard to rewrite code • Use Reap: Incorporate new/delete code into Apache • “mod_bc” (arbitrary-precision calculator) • Changed 20 lines (out of 8000) • Benchmark: compute 1000th prime • With Reap: 240K • Without Reap: 7.4MB

  37. Open Questions • Grand Unified Memory Manager? • Hoard + Reap • Integration with garbage collection • Effective Custom Allocators? • Exploit sizes, lifetimes, locality and sharing • Challenges of newer architectures • NUMA, SMT/CMP, 64-bit, predication

  38. Current Work: Robust Performance • Currently: no VM-GC communicaton • BAD interactions under memory pressure • Our approach (with Eliot Moss, Scott Kaplan):Cooperative Robust Automatic Memory Management LRU queue Virtual memory manager Garbage collector/ allocator memory pressure empty pages reduced impact

  39. Current Work: Predictable VMM • Recent work on scheduling for QoS • E.g., proportional-share • Under memory pressure, VMM is scheduler • Paged-out processes may never recover • Intermittent processes may wait long time • Scheduler-faithful virtual memory(with Scott Kaplan, Prashant Shenoy) • Based on page value rather than order

  40. Conclusion Memory management for high-performance applications • Heap Layersframework [PLDI 2001] • Reusable components, no runtime cost • Hoard scalable memory manager [ASPLOS-IX] • High-performance, provably scalable & space-efficient • Reap hybrid memory manager [OOPSLA 2002] • Provides speed & robustness for server applications • Current work: robust memory management for multiprogramming

  41. The Obligatory URL Slide http://www.cs.umass.edu/~emery

  42. If You Can Read This,I Went Too Far

  43. Hoard: Under the Hood get or return memory to global heap malloc from local heap, free to heap block select heap based on size

  44. Custom Memory Allocation • Replace new/delete,bypassing general-purpose allocator • Reduce runtime – often • Expand functionality – sometimes • Reduce space – rarely • Very common practice • Apache, gcc, lcc, STL, database servers… • Language-level support in C++ “Use custom allocators”

  45. Drawbacks of Custom Allocators • Avoiding memory manager means: • More code to maintain & debug • Can’t use memory debuggers • Not modular or robust: • Mix memory from customand general-purpose allocators → crash! • Increased burden on programmers

  46. Overview • Introduction • Perceived benefits and drawbacks • Three main kinds of custom allocators • Comparison with general-purpose allocators • Advantages and drawbacks of regions • Reaps – generalization of regions & heaps

  47. (1) Per-Class Allocators • Recycle freed objects from a free list a = new Class1; b = new Class1; c = new Class1; delete a; delete b; delete c; a = new Class1; b = new Class1; c = new Class1; • Fast • Linked list operations • Simple • Identical semantics • C++ language support • Possibly space-inefficient a b c

  48. end_of_array end_of_array end_of_array end_of_array end_of_array end_of_array (II) Custom Patterns • Tailor-made to fit allocation patterns • Example: 197.parser (natural language parser) d a b c char[MEMORY_LIMIT] a = xalloc(8); b = xalloc(16); c = xalloc(8); xfree(b); xfree(c); d = xalloc(8); • Fast • Pointer-bumping allocation • Brittle • Fixed memory size • Requires stack-like lifetimes

  49. Fast Pointer-bumping allocation Deletion of chunks Convenient One call frees all memory (III) Regions • Separate areas, deletion only en masse regioncreate(r) r regionmalloc(r, sz) regiondelete(r) • Risky • Accidental deletion • Too much space

  50. Overview • Introduction • Perceived benefits and drawbacks • Three main kinds of custom allocators • Comparison with general-purpose allocators • Advantages and drawbacks of regions • Reaps – generalization of regions & heaps

More Related