Garbage Collection
Manual Memory Management
Before we go into how garbage collection works, it makes sense to look at why we even need it. Without garbage collection, the only automation on memory management we get is thanks to stack machines. Frames on the stack get created/destroyed automatically by most runtimes. However, data on the heap needs to be managed manually. This requires manual allocation and freeing in various places in the code. This gets especially problematic when multithreading gets introduced, or when various parts of code want to access the same piece of data and the ownership of said data is not clear.
More details on how manual memory management is done in C/C++ can be found here.
Hardware
A typical computer has the following layers of memory:
- CPU registers - closest to the ALU, access is within a single CPU cycle
- CPU Caches
- L1 - the fastest layer of cache (less than 2ns), lowest capacity
- L2 - slower than L1 (5ns), but with greater capacity
- L3 - slowest (15ns), but largest
- RAM (70ns)
- Disk (150_000ns)
L1 cache is split in half between instructions space and data.
CPU Caches
CPU caching is a leaky abstraction. Normally, it should work without developer even knowing about it. In most cases it does, however without knowing about some details of the cache implementation, we might end up with worse performance than we could have. Various ISAs implement special instructions that avoid going through the cache.
Data is transferred between RAM and CPU cache in Cache Lines. This is the minimal amount of data that is read or written from/to RAM. Its size is fixed and nowadays it’s usually something around 64 bytes.
It’s faster to access RAM sequentially, so taking more data than requested shouldn’t be that much of a cost. Still though, 8 transfers are required (assuming 64-bit wide access per transfer). There are ways to speed it up from executing code’s perspective. For example, the requested area of memory is read first, and then the rest required to fill the cache line. User code can continue execution as soon as the critical part is read, and the rest can be read asynchronously.
The classic example of why Cache Line matters is presented below:
- Good Loop
int n, m = 10_000;int[,] array = new int[n, m];for (int i = 0; i < n; ++i){ for (int j = 0; j < m; ++j) { array[j, i] = 10; }}
- Bad Loop
int n, m = 10_000;int[,] array = new int[n, m];for (int i = 0; i < n; ++i){ for (int j = 0; j < m; ++j) { array[i, j] = 10; }}
The array is aligned in memory in a way that single row’s columns are placed one after another. Therefore, the distance between two columns is closer than the distance between two rows. Knowing about the cache line, we can understand that multiple columns could be read into the cache line via a single memory read access. Therefore, the cache line could be reused by iterating the array in a proper way. Switching the row index constantly is bad, we’ll get much more cache misses, and the cache line needs to be reloaded often.
Cores and Cache Coherence
L3 cache is shared between cores. Each core executes its own thread (actually 2 of them nowadays, due to SMT) concurrently. L1 and L2 levels are separate for each core (but SMT threads share them per core). The issue that will come up is cache coherence when cache lines share the same memory areas between cores. Modifying data on one core requires synchronization with other cores for their cache lines to be up to date. A similar kind of problem arises as with cache line access, but kind of opposite this time. It’s desired to deal with different areas of memory on different cores. If one core has some area of memory in its cache line, it’d be best if no other core modifies that same area of memory.
Prefetching
CPU Cache can be filled with data via prefetching. It can be invoked manually or via some automated process that discovers RAM access patterns. Prefetching is about loading some region of memory into the cache. It’s not that simple though, because multiple processes are fighting for resources, and one prefetched data of one process might get overwritten by another process. Therefore, it’s crucial to not trigger prefetching too early.
NUMA
Until now, we’ve assumed a simple scenario where access to RAM is uniform across all CPUs (or cores).
In today’s world of huge computing clusters, servers typically have multiple CPUs and CPUs have multiple cores. In such arrangements, it’s typical for memory regions to be designated for specific CPUs/cores. It’s much faster for a CPU to access “its” RAM region than to access a region that was not assigned to it. The CPUs are actually grouped so that multiple of them have one “local” RAM area. This architecture is called NUMA (Non-Uniform Memory Architecture). More and more consumer-grade workstations use NUMA as well.
Operating Systems and Runtimes (or programs in general) may be NUMA-aware and they will allocate memory “close” to the CPU group they’re running on.
CPUs may communicate with each other to access distant memory. This is much slower than accessing the local region of memory of a given CPU.
This video explains the concept quite well.
Operating System
Programs do not have access to the whole memory installed in a computer, or even to the real addresses of cells. Instead, OS provides virtual memory. It’s transparent for the process, it does not need to know about that abstraction. Thanks to that:
- different processes have different pools of memory cells
- memory can be constrained or extended above the RAM capacity (via swap partition for example (or page file on Windows)).
Virtual Memory is a feature of CPUs via MMU (Memory Management Unit), provided in cooperation with the OS. The Virtual Memory is mapped to physical memory in the unit of pages (typically 4kB). Mapping every individual address would be bad on performance. The OS maintains page directories for each process. This allows to map virutal addresses to physical ones.
If the pages were bigger, performance could be improved (because less virtual-to-physical address translations would have to occur). However, it’d be at the cost of higher memory use.
A page may also be shared between processes, it’s useful for statically linked libraries or other big blobs of data. Such pages are Shared Pages.
When allocating some memory, physical memory is not reserved directly. It happens only during the first access to the allocated memory (it’s called lazy allocation). For performance critical scenarios, it might make sense to access allocated memory with “fake” reads, just to make sure memory is ready for actual work. Also because of that, on Linux, thread’s stack might not be a continuous memory region.
The kernel allows to modify memory pages via:
mmap
- direct pages manipulationbrk
/sbrk
- increases the heap size
Runtimes/allocators use those to manage memory.
Process Memory Categorization
- Virtual Memory - total size of the virtual address space reserved so far (
VIRT
intop
) - Resident Set Size Memory (RSS) - space of pages that currently reside in the physical memory. Some
of them might be shared between processes (
RES
intop
)- Private Resident Pages - all anonymous resident pages reserved for the process
- Shared Resident Pages - both file-backed and anonymous resident pages of the process that are
shared with other processes (
SHR
intop
).
- Private Memory - all private pages of the process. It is the reserved memory, but part of it might
not be yet actually allocated (to become resident) due to lazy allocation (
DATA
intop
). - Swapper Memory - stored on the swap partition
When looking for an answer “how much memory does this process consume?”, it’s best to look at the Private Resident Pages since it shows the actual use of RAM by the process.
Some different perspective on memory mapping can be found in the Program Execution article.
Putting it all together, a typical memory read access would look as follows (without going too much into low-level details, and assuming CPU cache is not skipped):
- Process requests a read of a specific address
- That address is a virtual address and gets translated to physical one (using page directory, MMU, TLB (a kind of cache for address translations)). It is a part of some page that got reserved for this particular process.
- The CPU is instructed to read a given physical address. It checks its cache. If the data is there, it may return it without going into the RAM. Otherwise, point 4.
- CPU will read not only the requested address but rather the entire cache line (64 bytes), and store it in memory. The requested bytes get returned to the process.
Automatic Memory Management
One of the first implementations of Garbage Collection was the one in LISP, it’s a pretty old concept.
There are some important terms that Garbage Collection introduces.
Mutator
This is the part that drives the program. It executes the code, mutates memory (since objects get allocated, or destroyed, or references between objects get established). The name was created by Edsger Dijkstra. Mutator provides three operations to the application:
- New(amount) - allocates a specified amount of memory
- Write(address, value) - writes value at specified address
- Read(address) - reads value at specified address.
Mutator needs to interact with the other two parts of the system:
- Allocator
- Collector
The interaction happens during the three previously mentioned operations. E.g., during the New(amount) operation Mutator would ask Allocator to to the job.
The concreate implementation of the Mutator abstraction is an OS thread.
Allocator
The allocator provides two operations:
- Allocate(amount) - allocates some amount of memory on the heap.
- Deallocate(address) - frees the specified address making it available for allocation. In automated memory management scenario this operation is not publicly available, it’s the role of the Collecotr to invoke it.
Collector
It runs the actual garbage collection. Collectors are suppose to remove “dead” objects, the ones that are not going to be used anymore. Since this task is not really doable (collecotr cannot predict the future), collectors rather depend on reachability of objects. Once some object is unreachable (code does not have any references to it), collector can mark it for deletion. One could imagine a graph of objects, where edges between nodes would represent references from one object to another. Nodes that are unreachable are the ones that could get removed.
The graph traversed by the collector needs to have roots. These are the starting points, and they are represented by:
- stack variables
- statically allocated objects
Reference Counting
It’s a very popular way to manage memory. Basically, it boils down to keeping track of a count of references that every object has. E.g. operation like
obj2 = obj1
would increase the counter of object that is references by obj1
and obj2
.
When the references count goes down to 0 (e.g. both obj1
and obj2
going out of scope),
it’d be a sign to the runtime that a given object is unreachable and it might be considered dead,
and ready for removal.
Reference Counting can be implemented on a library level, it does not have to be supplied by the runtime. Example of that could be GObject for C. Also, some smart pointer implementations use reference counting as its backbone.
Issues with reference counting:
- can be fooled by circular references
- might introduce significant overhead, especially if the reference counting happens as part of main thread.
- difficult multithreading handling
On a plus side:
- memory is freed fast as soon as reference count goes to 0 (in typical implementations at least, because there are also deferred reference counters)
- possiblity to implement without runtime support
References
Tracking Collector
Tracking Collectors work kind of separately from the main program thread. They monitor (track) the state of memory and act when they see fit. E.g., .NET uses a tracking collector, and in general it’s the most typical kind of GC.
Tracking Collector traverses the graph of objects from the mutator roots, which allows it to find true reachability. It’s not an easy task, since memory could grow rapidly, and also it’s being mutated all the time.
Tracking Garbage Collector works in two main phases:
- mark - GC determines which objects are needed and which aren’t
- collect- reclamation of memory
Mark Phase
GC traverses the objet graph and marks all objects that it visits. At the end, it can delete those objects that were not marked - they weren’t reachable, so they can be removed.
This process is easier to do if the GC process runs alone with user’s code awaiting for it to finish. This is problematic for user code though, because code will run slower. It’s called Stop the World. More advanced GCs run alongside the user code.
Collect Phase
This is the reclamation of memory from dead objects. There are a few ways to do that.
Sweep
In this approach, dead objects are just marked as free space. It’s easy and quick, but leads to fragmentation. Heap will have lots of small “holes” and eventually we’ll run out of big enough gaps to store new objects.
Compact
After deleting objects, other objecs get moved around to eliminate gaps. Obviously, it’s at the expense of perfromance.
One way of compacting is to just copy objects one by one into another area of memory, skipping those that were not marked as reachable. This requires quite a lot of space, because the old versions of objects are removed only after they get copied to new locations. Also, it puts a lot of pressure on memory due to lots of copy operations.
Another way is to move objects towards each other “in place”, filling the gaps between them. .NET uses such approach.
There are also hybrid approaches where these 2 or some different ways are combined. For example, some part of memory could be managed in one way, and the rest in another.
There are various types of tracking GCs. Some of them have been outlined below.
Conservative Garbage Collector
This type of GC is useful only when the runtime does not provide enough information for GC to implement something like a Tracking GC. It basically scans the memory cells looking for any value that looks like a pointer. These values are usually different from “normal” values, because they are large numbers. Once GC learns of all such values, it assumes that these are live references to some other objects. It can then go ahead and free up those regions of memory that do not seem to be referenced by anyone. Conservative GC cannot compact objects, because it doesn’t actually understand what is a pointer and what is not. Compaction would require existing pointers to be updated to new addresses.
This solution is very inefficient and definitely not ideal.
Precise Garbage Collector
Such GC has full information about objects memory layout. It can efficiently traverse the objects graph marking the visited object, as it was described above. .NET uses Precise GC.