Modron

Modron Production world view of Garbage Collection in the J2ME and J2SE spaces Ryan_Sciampacone@ca.ibm.com

What this talk is about • Description of the lines in the sand we draw between the various parts of the memory manager in J9 • Allocation, garbage collection, free list management • Comparison of the collectors in the J2ME and J2SE spaces • Limitations of environment, and how they affect what you can do • Examination of some of the performance related issues associated to the decision that get made • For better or for worse (or inconsequence?)

A quick history lesson J9 was started as clean-room Java VM implementation for the embedded (J2ME) space • Small, Hotswap debugging, JVMPI, TCK compliant • Garbage collection was single threaded generational solution Fast forward… J9 continues to be clean room, but is also targeted for the desktop and server (J2SE) space • Keep things small, but features add size • Scalable collection strategy (CPU + Memory)

The problem space The original problem space starts with small devices • Hand-helds, cell phones and air conditioners • Limited OS support, sometimes none • Threading packages are suspect • Hardware is simplistic As we move forward, desktops and servers enter the picture, • Desktop machines • Web browsers • Development environments (Eclipse) • Server • Websphere

Breaking down the components The highest level concept that has differing properties is the Heap. • The problem space drives how the heap is organized • Virtual Memory? • Contiguous or non-contiguous? • Based on how the heap is divided (physically or virtually through collection strategies), each area is associated to a Segment. Heap segments

Breaking down the components A Memory Space describes the garbage collection strategy that is applied to the heap. • Mark/Sweep/Compact (single area – “flat”) • Generational semi-space copying collector • Is the top most container for the segments that divide the heap • Typical heap has only a single Memory Space • J9 supports multiple memory spaces in a single heap • “light weight processes” • The Memory Space responsibility is to identify what type of collection strategies it applies to the heap it is responsible for

Breaking down the components Memory Spaces are divided into Memory Subspaces that associates the different parts of the Memory Space with different collection strategies. • A Memory Subspace is responsible for handling allocation requests and failures, as well as garbage collection requests, made on different parts of the heap. Memory Space New Old segments Heap

Breaking down the components A Memory Subspace that can allocate from the heap uses a Memory Pool, which handles adding/removing from the free list • A Pool is only responsible for the management of the free list • Not responsible for garbage collection • Not responsible for object initialization • Handle any synchronization issues Memory Space New Old Bump Ptr Address Ordered List

Memory Space Breakdown “Flat” Memory Space Container class that groups regions of memory into a “space” Memory Space Focal point for allocation, failed allocate and collection Memory Sub Space Handles allocation for a particular memory space Memory Pool

Breaking down the components Memory Subspaces also have Collectors associated to them as part of the allocation failure handling process • Can call for collect if the associated pool fails the allocation request • Collections of a particular subspace are responsible for memory associated to it and all its child subspaces Global GC Generational Local GC New Old

Breaking down the components Expansion and contraction of the various Memory Sub Spaces is handled by Physical Arenas (PA) and Physical Sub Arenas (PSA). • PA are associated to the memory space and • PSA are responsible for communicating directly with the heap (and its governing PA) when allocating or releasing memory Memory Space New Old Physical Sub Arena Physical Sub Arena Physical Arena segments Heap

Breaking down the components Memory Space Memory Sub Space Physical Arena GC Physical Sub Arena Memory Pool segments Heap

VM Facilities Whenever I hear this: We’d love to write a collector for J9! It is usually followed by these two questions: • How do I “stop the world” so that I can collect? • How do I walk the stacks to find all references, or can we even do that?

VM Facilities – Stop The World J9 uses a co-operative suspend mechanism • Java threads are either actively mutating the heap or are external to the VM (e.g., JNI calls) • All threads are sent asynchronous messages to “suspend” through VM facilities • Threads external to the VM during a suspend request are locked from re-entering the VM • When all threads have been accounted for, the stop is successful Motivation: Not all embedded platforms have thread packages that work. Question: But why not use thread package facilities when available?

VM Facilities – Scanning stacks Part of the root set walk is finding all references on stacks. If you are unable to tell where references are on the stack, you can pessimistically decide that anything that looks like a reference is. Co-operative suspend model: Part of the agreement is to leave the stack in a well understood state • This includes being able to find all references at all times The collector can find all references through a stack walk

The Heap Lock SpecJBB2000 (http://www.spec.org/jbb2000/) Throughput Warehouses

The Heap Lock Possibly the single most important item for scaling! • Misleading term; the focus is on possible contention in acquiring memory for allocation in the heap. Simplest example: The bump pointer. Guarantee a compacted heap, easy to allocate the free entry • Inlining this allocate (JIT, VM) is also easy. Heap allocate ptr (Always compacting the entire heap may be slow, but there are ways to mitigate this)

The Heap Lock Problem with Bump Pointer: Does not handle contention well. • Many threads, many CPU’s – much looping trying to bump the pointer • CPU is busy, but doing nothing • Bus lock contention attempting the atomically change the value Reduce contention on the lock by going to it less:“batch” allocate more heap than we need each time

The Heap Lock Thread Local Heaps (TLH) Allocate a region of memory from the free list each time • Region is local to a thread only (no contention to allocate out of) • Reduces the time on lock per object Heap Thread local Works well for a true free list system (fragmented heap).

The Heap Lock Few key points to making TLH work, • Need to guarantee some form of minimum size on the free list • Too small, and you’ve gone back to single lock/single object • Too large, unnecessarily fragmenting the heap (dark matter) • Variable? Might make sense, if average object size changes • Vary the rate of consumption when allocating a TLH • Threads that allocate frequently grow their TLH consumption rate(Hungry threads get fed more) • Quiet threads keep a low consumption rate Minimum free list size is the only guarantee on what you’ll get back!

Object (Header) Overhead One of the most asked questions from customers: What is your object header size? Motivation for the question is really two things: • Reduce memory foot print (embedded space) • Reduce frequency of garbage collection (more efficient use of heap) The better question is: What is the average overhead per object wrt/ heap and total memory?

Object (Header) Overhead Heap factors: • Alignment? More than pointer width, chance for wastage • Monitor slot, hash code – Does it exist? Cost of creating it? Non-heap factors: • Meta level structures for completing collection • Mark maps • Card tables • Meta level data for completing collection • Reference object description • Object allocation map This does not even include the cost associated with execution!

The Collectors J9 has historically had a generational garbage collector • Two generations • New area collected by a semi-space copying collector • Old area collector by mark/sweep/compact (+ new area) • Always compacted, always stayed small • Non-contiguous address space for heap J9 continues to be generational + offers single generation collector • Two configurations, tiny and standard (parallel) • New area is optional • Compaction is optional

Global Collector There are 3 parts to the global collector, • Marking • Traces through root sets and objects to find all live objects in the system • Sweep • Finds all objects that are dead (or that have died previously) and forms the free list • Compact • Shuffle live objects in memory such that free entries form large contiguous chunks

Global Collector - Marking For small devices, the less extra data you carry around, the better • Already a small amount of memory Keep in mind the heap can be a series of malloc()’d pieces of memory • Why not allocate a big chunk? Can’t, or don’t want to. To actually mark an object as live, set a bit in the object header • Class slot (which is aligned) is as good as any Trace through marked objects by keeping a list what has been visited • This can be achieved by adding two slots to the Class type

Class A Class M Class X classLink classLink classLink instanceLink instanceLink instanceLink Global Collector - Marking Instance linking q classPointer classPointer classPointer Object A1 Object M1 Object X1 q q q classPointer Object A2

Global Collector - Marking Positives: • Overhead of 2 words per type in the system for tracing • +1 bit to declare the object as marked • Trace through an instance entirely in one shot Negatives: • The entire heap gets written to (caching) • Twice, once to mark and once to clear the “mark” bit • Not parallel friendly

Global Collector - Marking On larger machines, the heap is contiguous memory • A mark map is used to track objects • Single bit per slot on the heap (3.125% overhead) Heap mark map

Global Collector - Marking • Tracing live objects has 2 stages • Gathering all root references • Stacks, JNI references, constant tables, classes • Recursive scanning of live objects until no new objects remain • A marking thread uses a WorkStack • Push objects that it has successfully marked • Pop objects whose fields should be scanned • Objects have all their fields scanned at the same time • Cache-friendly technique

Next object to scan Header (object fields) Global Collector - Marking • The WorkStack uses an input/output system • Queue for items to process (object whose fields it has scanned) • Queue for objects it has found and successfully marked Input queue WorkStack Output queue

Global Collector - Marking • The WorkStack uses an input/output system • Queue for items to process (object whose fields it has scanned) • Queue for objects it has found and successfully marked Next object to scan Header Input queue WorkStack Output queue

WP (full) WP (full) WorkPackets WP (empty) WP (empty) WP (empty) Global Collector - Marking • The queue used by the WorkStack is called a WorkPacket • There are many WorkPacket queues in the collector • The WorkPackets object manages full/empty (to be processed, available for filling) WorkPacket objects WP (input) WorkStack WP (output)

Global Collector - Marking Of course… we only have finite resources. • You can run out of WorkPackets, so how do you handle overflow? • Take a full packet, and move its contents to an overflow “list” to be processed later • Overflow avoidance

Global Collector - Marking • Reference array splitting • Do not scan all references at once • Defer scanning to the output WorkPacket • API to push two elements: Array and Index Reference Array Header Input queue WorkStack Output queue

index Global Collector - Marking • Reference array splitting • Do not scan all references at once • Defer scanning to the output WorkPacket • API to push two elements: Array and Index Reference Array Header Input queue WorkStack Output queue

Global Collector - Marking Positives: • Parallel story for tracing • Also included a work sharing story • “Marking” is a more localized operation Negatives: • Destroyed part of the locality for tracing • But was it any better before? • We could improve this anyways • Memory overhead

Global Collector - Sweep Find everything that is dead (or has died previously) and try to add it to the free list. For the small collector this is easy: • Walk the heap memory, unmarking live objects and coalescing unmarked objects into the free list • Very single threaded approach For the large collector, this is almost as easy: • Walk the mark map, finding ranges of zeroes that might be free list candidates and process them This doesn’t appear to need any extra data structures.. Unless you parallelize it.

q Inner Free List Global Collection - Sweep Sweep Chunk

Leading free entry Projection Global Collection - Sweep Sweep Chunk • Each chunk records 3 things • Inner free list • Leading free entry • Projection Chunks are then connected

Global Collection - Sweep Connecting chunks has a few gotchas • Chunks that appear completely empty • Projections can span several chunks • Chunks that appeared to be completely free are in fact consumed

Local Collection “Objects die young” • Semi-Space copying collector for the nursery • Typically a fraction of total heap size • Objects are promoted to old space when a copy threshold is reached • The age is adaptive based on nursery population • Tenure quickly based on size (not available – not hard)

Local Collection Allocate Survivor Tenure

Wasted Space? Local Collection • Work is done in parallel • Must atomic install a forwarding pointer • Avoid contention when allocating memory for a flip or tenure • Can use a TLH style system • Problem: Can exceed the space available Allocate Survivor

Local Collection A remembered set is used to record objects in Old space that (potentially) contain references to new space • Set is grown by mutator, never shrunk • Collector is responsible for shrinking • Old objects appear only once in the remembered set

Local Collection Old Space New Space Remembered Set

Local Collection Relax contention on the Remembered Set with TLH style allocation Old Space New Space Remembered Set

Local Collection – The age debate How come you have more than one age group? Copying objects between spaces many times isn’t a waste of effort if the sum of the work is less than the cost of a garbage collect. • Cost to garbage collect • Cost to potentially compact • Cost in fragmentation of the old area (allocator) Remember: Not only does the scavenger offer short collection times, it acts as a form of incremental compactor (and that helps allocation times)

Modron

Modron

Presentation Transcript