Memory Management
.NET uses Garbage Collection to free up memory from unused objects, that is objects that are unreachable from the application root.
Types of Data
Value Types
Benefits of value types:
- less memory - data is stored directly, without a pointer (or actually without two pointers, because reference types have two pointers)
- faster access to fields because of direct access
- avoids allocations (unless boxed)
- no inheritance, so no overhead of devirtualization
- stored on the stack or register (or heap if boxed or part of something else on a heap)
Boxing
Every value type has a corresponding reference type. E.g., for int, it’d be System.Int32.
When boxing, runtime first creates a new instance of reference type (as usual, with header and
Method Table pointer), and then it copies data from value type.
Unboxing is cheaper, it’s just a copy of data from heap to stack.
Reference Types
They are composed of two parts:
- reference - an adress to data. The reference itself is copied by value, just like a value type.
- data - memory region pointed to by reference.
Reference is like a “classic” pointer, with some runtime safety on top. Reference Type is beneficial when a particular data outlives a stack frame. Then heap allocation is actually useful.
.NET runtime can infer that a given instance of reference type does not “escape” the stack frame and it might place it on stack, like a vlue type instance! However, it will be stored exactly like it would be stored on the heap - with Method Table pinter and header.
Reference Type Structure
Each object on the heap has:
-
header - stores any metadata that might be attached to a given object. E.g., lock on an object, cached value of
GetHashCode, some GC info. -
Method Table pointer - it’s an address to entry in Method Table structure about the type of an object. There are information such as listing of all methods, data for reflection, and others. The reference to any object actually points at the Method Table pointer!
┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐│ Header │ Method Table │ Data │ Data │ Data ││ │ Pointer │ │ │ │...└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘Therefore, sometimes we say that the Header lays on a negative index, because to get to it, you have to subtract from the reference’s pointer.
-
Data - the actual values (or references) that are part of the class. Even for empty classes (without fields) there will be one place (8B), because GC expects it.
The smallest object will take 24 bytes (on a 64-bit system) or 12 bytes (on a 32-bit system):
- 8 bytes for a Header
- 8 bytes for the Method Table pointer
- 8 bytes for Data
ref
For value types, this keyword allows us to pass value types by reference (so the same way as reference types do by default).
For reference types, it acts like a pointer to a pointer. We can chane the object that the passed reference variable points to!
null
It’s a reference that points at an address of 0. Wen we try to access that memory, OS raises exception (CLR catches that and rethrows). The whole first page of virtual memory is an illegal access space.
Data Locality
Structs are better in data locality. An array of structs has each item one next to the other. For classes, only references to items are next to each other, but accessing consecutive items requires:
- Moving to next reference
- Following the referenece’s pointer to get to actual item
When loading array into cacheline, for structs case we load the actual data. For classes case, we load references to items into cacheline. Each item can be in completely different place in memory. Program will be slower due to more cacheline loads (+ dereferencing).
Garbage Collector
High Level Overview
When garbage collection gets triggered, the following occurs (assumes Workstation, non-concurrent):
- Suspend all managed threads.
- The thread that triggered GC, starts to execute GC code.
- GC decides which generation to condemn (condmned generation + all generations below will be collected)
- Mark Phase
- Plan Phase
- Collection (Sweep or Compaction)
- Resume managed threads.
Regions
Memory is not stored in one huge bag. Instead, GC has a few regions and objects fall into them based on:
- their size
- their lifetime
Various GCs choose different parameters for memory partitioning, like objects mutability, type, and others. .NET uses mostly just size and lifetime. Regarding size, GC heap is split into 2 regions:
- SOH - Small Objects Heap - objects smaler than 85_000 bytes
- LOH - Large Objects Heap - objects larger or equal 85_000 bytes.
The 85000 bytes threshold can be increased via configuration.
SOH uses compacting, because moving small objects around is not that hard/expensive. LOH uses sweep collection by default. Howeverm we can invoke LOH compaction as well. The 85000 bytes threshold is computed with shallow object size. Therefore, LOH mostly constains large arrays and strings. Object with array field (even huge) stores just a reference to that array, so array size is not included in object’s size by GC.
Typically, SOH contains much more objects than LOH, because 85000 bytes is a rather big value. LOG gets collected only when Gen 2 is collected.
Generations
SOH has lots of objects normally, so it’s split further by lifetime of these objects, into generations.
- SOH
- Gen 0 - for ephemeral objects, like objects allocated in some method, that are no longer in scope after that method’s execution is over. Gen 0 deallocation is the cheapest, objects basically land on top of the heap, and are removed from there as soon as the method’s execution is over.
- Gen 1 - also for ephemeral objects. Objects that go untouched after at least one Gen 0 pass are Gen 1. This could be an object being assigned to some property of a class instance.
- Gen 2 - for long-lived objects that have stayed alive after a few Gen 1 passes. It could be a property of a static class, or some other objects that live through the whole lifetime of an app.
- LOH (Large Object Heap) - objects bigger than 85 kilobytes.
- POH - Pinned Objects Heap, a place for objects that should not be moved (also compacted), for various reasons, but mostly when these objects are exposed to unmanaged code.
Moving up the generations, the cost of GC becomes higher (Gen 0 is the cheapest). Therefore, whenever possible, it’s good to keep our objects in the Gen0/1 range, or, even better, use value types.
Layout
In .NET 7+, memory is split into Regions. In earlier versions, it was split into Segments. Each SOH generation has its own region. Also LOH, POH and NGCH have their own regions.
During runtime startup the following occurs:
- Runtime tries to allocate continuous area of memory (if possible).
- Runtime creates regions for each part of managed memory. POH and LOH are 8 times larger than SOH generation region size. (NGCH is created later, on demand, in other part of the address space).
- One page of each region is committed.
In Computer Science, the following observations have been made:
- weak generational hypothesis - most yung objects live short
- the longer the object lives, the more likely it is to continue living
Therefore, it makes sense to collect young objects more frequently than older objects. Objects change generations (up) via promotion. Different generation can be handled in various ways. For example, there could be separate memory regions for different generations. Any time a promotion occurs, an object would have to be copied to a different area of memory. That would be quite expensive though. Another approach would be to have one memory region, but with dynamic thresholds that separate generations. Each time promotion occurs, objects that are close to each other are promoted together (because they are placed in memory “chronologically” one next to the other). Threshold is moved “to the right” while objects stay in place. This simple description assumes no compaction, just sweep.
.NET uses a kind of the second approach that I described. There are 0, 1, and 2 generations. Each collection survival is a promotion. There are exceptions to that, sometimes promotion does not occur (it’s called a demotion then, although objects stay in the same gneration, they never go to younger generation).
To find memory leaks, it’s best to look at Gen 2 size. Objects that stay in memory for too long will eventually go to Gen 2 and its size will keep on increasing. An ever increasing size of Gen 2 is a sign of a potential problem.
Control Values
GC has varous config values that allow it to decide on various options, like whether to use sweep or compacting collection, or even how often to trigger collections.
Allocation Budget
Allocation Budget is the amount of memory that is allowed until GC of a given generation will occur. Each generation has different setting for it. For gen 1 and 2, promotion from lower gens is treated the same as allocation (there’s no way to allocate directly in gen 1 or 2, so promotion is the only way to “allocate” there).
Fragmentation is the total size occupied by “free objects”.
Allocation Budget changes between GCs. In general, the more objects survive a collection, the higher the new allocation budget. The reasoning is: if in a given generation not much had to be cleaned up, then probably we should wait longer until more unreaahble objects accumulate.
GC Triggers
Garbage Collection may be triggered in various ways:
- when theree’s not enough space in some generation for new alloation. It’s related to allocation budget as well.
- explicit call to GC.Collect(…).
- low memory notification - (only on Windows OS) - OS might inform apps that memory is low. Apps might ignore it or do something about it. .NET triggers collection.
- various internal events
- explicit call to GC.TryStartNoGCRegion - this method actually allows us to NOT run GCs in a given region of code exectuion. However, by default, when we run it, runtime will trigger GC just before starting to execute the no-GC code region. It can be disabled though.
When GC on a given generation gets triggered in any of the ways above, runtime might decide to also collect higer generation. This might occur when:
- some higher gen has exceeded ints allocation budget
- long time passed since some higher gen got collected last time. For gen1, it’s 10 seconds, for gen 2 it’s 100 seconds. It happens only in Workstation mode.
- low Card Table efficiency
- big fragmentation
- before OutOfMemoryException
Stop the World
When GC pauses all managed threads to do its job, it’s called “Stop the World” sometimes. It’s also called “Execution Engine (EE) Suspension”.
Managed threads are stopped, because GC can do its thing only when it’s safe to do so. A given thread may be in one of two states:
- cooperative - runtile must cooperate with the thread to get it to safe point for it to become preemptive
- preemptive - the thread is executing code that does not touch any GC-managed references (like non-managed code).
EE Suspension forces all threads to be preemptive. When a thread gets to preepmptive state, its state is stored in GC Info structure. Thanks to it, when resume occurs, each thread will continue right where it finished.
Safe points are during calls to other methods. There is actually a special GC poll method that forces a safe point to occur. It’s used in various strategic places in runtime code, but it might also get placed in app code by JIT, especially in loops of long/undefined iterations count. JIT might also emit code that is “fully interruptible” - in such code, not only method calls are safe, but every instruction invocation is safe. That’s pretty expensive though, because every instruction needs to have GC Info.
Mark Phase
As described in Garbage Collection article, a part of GC is Mark Phase. This is when GC looks for objects that are still alive. Mark Phase cannot just traverse the whole graph of objects, cause it would invalidate the whole idea of generations. Younger generations should be collected more often than older generations, as we established before. Traversing the whole graph would take us through a mix of generations. Objects from younger generations might point to older generation objects, and vice versa. If we marked only objects that are reachable from young generations, we would end up deleting objects that are still needed (but are pointed to only from older generatin objects).
So, in addition to traversing from young objects, GC keeps a RememberedSet. RememberedSet
contains references from older to younger generations. When collecting younger generation, both
traversal in young generation and RememberedSet are checked.
For young-to-old references, a similar issue would occur. Traversing objects in gen 1 would not mark
objects that are referenced only from gen 0. Another RememberedSet would be needed. It’s
non-trivial to have it, so instead the following assumption has been agreed on:
Collection in gen N collects also all previous generations! So, for example, a Gen 2 collection would also collect 0 and 1.
The following collections are possible then:
- only 0
- 0 and 1
- 0, 1, 2, LOH (full GC)
RememberedSet is updated anytime some older generation object has some reference added to
younger generation object. It’d be a huge overhead to check such condition everytime, so it’s heavily
optimized and done only when needed by JIT compiler. This check is done in the Write Barrier of
Mutator.
RememberedSet is implemented in a bit more convoluted way than just as a collection of data.
It’s implemented with the help of Card Tables. That’s because managing references
per every single older generation object would be an overkill in apps with lots of objects.
Instead, the memory of older generation is divided into 256 byte-wide regions (on 64-bit runtime)
and every region might contain some objects (0 or more).
Anytime a reference is written into some object in older generation, its region (card) is marked
dirty. When collection of younger generation occurs, both traversals in young gen and dirty cards
are checked to find all reachable objects. One card is 1 bit (dirty/clean) in memory.
Management of Card Table happens during Write Barrier. There are a few implementations of Write Barrier and runtime will use any of them depending on various conditions.
With Card Table technique, GC tracks groups of objects instead of tracking them individually. Still, there’ll be lots of cards for apps that use lots of memory. Therefore, runtime does not check cards individually. Instead, it creates words (and even bundles) of cards. Each bit in a bundle represents 1024 cards. So, when if a given bundle card is clean, we can skip 1024*256 bytes of memory from reference scanning! It saves a lot of time, which is important, because young generation collections must be fast.
When Mark Phase occurs, GC “knows” which gens will be collected, otherwise it’d have to go back to Mark Phase to check some potential higher gen.
Implementation Details
Objects get marked (as reachable) by modifyng the MethodTable pointer. The space for the pointer is actually bigger than necessary and two lowest bits are unused, so it’s safe to modify 1 bit of it. The modified bit will be reset back during the Plan Phase (to prepare it for potential next Mark Phase).
Roots
Objects marking starts from roots.
Local Variables Roots
Local variables are one of the most common roots. Some of them are short-lived (like variables
within some short method), or long-lived (like variables created in Main(...)). Anytime we create
some object in some method, we create a new Local Variable Root for Mark Phase.
Lifetime of local variables is controlled by GC Info struct. It contains a listing of all local variables that should stay alive at a given instruction.
Scope
Intuitively, we might think that a given method’s variables are “safe” from being collected while it’s executing. However, runtime might collect objects pointed to by loal variables as soon as they stop being used.
Example:
public MyObject DoSomething(){ var myObj = new MyObject(); myObj.Value = 38;
DoSomethingElse();
// myObj is not needed anymore, it might get GC'ed at this point!
var anotherObj = new MyObject();
return anotherObj;}myObj might be collected at the “safe point” after its last use. That safe point is the call to
DoSomethingElse(). Assuming typical case of method calls being safe points, that’s where GC might
occur.
Such behavior is called Eager Root Collection.
Cross-Generation References Roots
Another kind of root are references from older to younger generation(s). While marking, it goes through the previously described RememberedSet and Card Tables.
GC Handle Roots
Strong and Pinned handles are another kind of roots. They are called explicitly via GC.Alloc(...).
Plan Phase
It figures out whether GC should compact or if sweep is good enough. While doing so, it actually simulates the compaction scenario to see if it would yield big enough benefits. The results of that simulation are not wasted, both compaction and sweep approaches use the results to do their job.
Simplifying a bit, planner looks at managed heap(s) and marks groups of objects as either plugs or gaps:
- plug - a grouping of neighbouring objects that should be retained after GC
- gap - a grouping of neigbouring objects that should get collected
Then, it simulates the needed move operations to relocate plugs into free spots left by gaps (or other holes that are alrady there). It takes into account pinned objects, which have to stay in place.
A final result creates a pattern of PGPGPG…:
┌──────────────┬────────┬──────────────┬──────┬─────────────┬──────────┬───────┐│ Gap 1 │ Plug 1 │ Gap 2 │Plug 2│ Gap 3 │ Plug 3 │ Gap 4 │└──────────────┴────────┴──────────────┴──────┴─────────────┴──────────┴───────┘Each plug has its own plan of where it’d be moved. The plans are stored within the gaps that preceed a given plug! It’s efficient, because no additional memory gets allocated. Additionally, it’s optimizing CPU cache use since runtime is operating on the same part of memory while working on plugs/gaps.
A plan for one plug takes 24 bytes of data. It consists of:
- gap size
- plug relocation offset
- left and right child offsets (for Binary Search Tree)
Plans are organized into Binary Search Tree, that’s why each plan contains offsets to get to the left and right children of a given “plan node”. In practice, there are many plug trees. One global tree would be too big for efficient traversal. Each such tree covers 4kB of memory. Each such region is called a Brick. Bricks are put into Brick Table with all plug trees (SOH-only).
Pinned Objects
Pinned Objects introduce complications in the compaction planning. Here are just some of the problematic scenarios:
- a pinned plug might come right before some “normal” plug - there’s no gap to store the plan! As a compromise, the pinned plug gets extended to also cover the not pinned plug.
- a pinned plug might come right after some “normal” plug - the plan of the pinned plug is actually stored onto the plug before it, overwriting some (24) bytes there. Overwritten data gets stored in a remote location - a pinned plug queue, to be restored before thread(s) get resumed.
It’s worth to mention that pinned plugs’ plans contain offset of 0, for obvious reasons.
Demotion
We could imagine another case with pinner objects. After compaction, we could end up with situation like this:
| GENERATION 0 || |+---+-------+----+----------------------+----------+|###|///////|$$$$| |**********||###|///////|$$$$| |**********|+---+-------+----+----------------------+----------+| || #, /, $ - various compacted objects || * - pinned object |The generation 0 got collected, with compaction, now, normally, the objects would all get promoted to Generation 1. However, there’s a lot of empt space, which would make Generation 1 fragmented.
Therefore, in cases like that, a Demotion might occur.
| GENERATION 1 | GENERATION 0 || | |+---+-------+----+----------------------+----------+|###|///////|$$$$| |**********||###|///////|$$$$| |**********|+---+-------+----+----------------------+----------+LOH and POH
The described Brick Table approach is applied only to SOH. In LOH it’s simplified. First of all, LOH is almost never compacted. By design, LOH objects are huge. Moving them around is an expensive operation. However, when compaction does occur, the following facts are worth noting:
- each object becomes a separate plug - objects are big enough to be considered individually
- every object in LOH has 24 bytes of space prior to it - a space for plug plan
LOH compaction will occur when:
- GC is caused by the risk of OOM
- it’s called explicitly by user code
- not enough memory for allocation is available
- fragmentation is high
Sweep and Compact Phase
This phase is the heaviest during GC cycle.
During Sweep, gaps are turned into Free Objects. Any regions that become empty, get deleted. For POH and LOH, objects are scanned directly, without going through Brick Table.
During Compaction, objects are moved around, nad references to them get updated.
Allocation
In non-managed environments (like C/C++), an action to create an objet goes directly to the OS to get memory. In .NET, there’s runtime in between. The runtime will allocate a big chunk of memory from the OS before the app even needs it. That speeds up the process when the memory is actually needed. This is one of the “tricks” that managed environments use to be faster than non-managed ones.
.NET has two wasy of allocating memory:
- bump pointer allocation (faster)
- free-list
When allocating new objects, it happens on:
- SOH - only Gen 0
- LOH
- POH
Bump Pointer
Objects Zeroed memory Non-reserved┌────────────┬────────────────┬─────────────┬────────────────│████████████│ │░░░░░░░░░░░░░│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓.│████████████│ │░░░░░░░░░░░░░│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓.│████████████│ │░░░░░░░░░░░░░│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓.└────────────┴────────────────┴─────────────┴──────────────── └──────────Committed───────────┘ ↑ Allocation pointerWhen object needs to be allocated, the allocator pointer gets moved “to the right”, and the space on the left is the new allocated space. Runtime zeroes memory even before the allocation to speed it up. If some memory is not zeroed yet, it will happen ad-hoc. The whole commited space that is ready for allocations is called allocation context. When we need to grow the alloation context, we grow it by allocation quantum (8kB on SOH). Each managed thread has its own allocation context. Thanks to it, creating objects does not require synchronization (and is faster). These multiple allocation contexts might live in one region. When sweepin collection runs, we would end up with lots of holes of free space before allocation pointer(s). Compaction would resolve it, but it’s expensive. Instead, .NET creates allocation contexts in those holes! Compaction still occurs, but less often.
Free-list
In Free-list, we have to keep note of free spaces in memory. We do that in buckets. Each bucket is a threshold “less or equal x bytes”. Each “hole” in memory is assigned to some bucket. When allocating objects, we find the first bucket that has big enough spaces to fit it.
The free spaces managed in free-list are actually kind of objects themselves. They have a MethodTable pointer in their header area that points to entry of “free object”.
As mentioned, allocations for new objects happen only on Gen 0, LOH and POH. However, we also need to manage promotions between generations. It’s very similar to allocation of new objects though, so Gen 1 and Gen 2 have their own free-list buckets as well. Gen 0 and Gen 1 of SOH have each just one bucket; Gen 2 has 12 of them; LOH has 7; POH has 19.
SOH uses pointer bumper technique first, and fallbacks to free-list if the first method fails. If
there’s not enough memory, GC gets invoked. If still not enough memory, OutOfMemoryExcetpion gets
thrown.
LOH and POH use only free-list (GC and exception also occur).
Pinning
Pinning Objects means to make their location constant and disallow runtime to move it to some other location. Pinned objects fragment the heap, because they cannot be compacted. Pinning is useful mostly for P/Invoke scenarios where some unmanaged code operates on pointers.
There is also POH (since .NET 5). We can create arrays (only of structs without any references inside to simplify GC by avoiding traversal of POH during the Mark Phase).
GC Modes
.NET garbage collector may operate in one of two modes.
Workstation Mode
Gen 0 and Gen 1 are collected in the foreground thread, pausing user code. It’s blocking, but it’s relatively cheap (since it’s Gen 0/1). Collections are more often, but shorter (because the more often you GC, the less memory there’s to collect). There’s just one managed heap.
It’s good for desktop apps where GCs should be short to avoid blocking the UI. However, in some cases where an app relies havily on parallelism, Server Mode might be a better fit.
Server Mode
It is ideal for server-side apps. Each CPU core gets its own heap. Also, each core has its own GC thread running in the background. It’s non-blocking, but uses more memory (30-40% more) and CPU. It can be enabled via csproj. It can give you a huge boost of performance if utilized in scenarios where throughput is really important. Collections are less often, so they can take more time, resulting in longer pauses. However, considering that GC will run parallelly on a few threads, pause might actually be shorter than in the Workstation mode.
Submodes
Each of the two modes above can also be configured to work in one of the two submodes below. Therefore, we could say that there are 4 ways of GC operation in .NET.
- Non-concurrent - whole GC operation happens while app threads are paused.
- Concurrent - part of GC operations happen concurrently with app threads, in dedicated GC thread. Compaction is never concurrent though.
The Non-Concurrent submode is sometimes called Background Mode. So, for example, we could be using “Background Server Mode”.
While Background GC is executing, a Foreground GC might still occur! It takes precedence over Background GC, pausing it.
The simplest GC mode of operation is Non-concurrent Workstation mode.
Collection
There are two main ways of how memory gets collected:
- Sweep - objects that are considered dead are removed. The “holes” created by that are potential places for now objects to be allocated in the future. However, when lots of small holes stay in memory, it’s difficult to allocate anything there. This obviously leads to fragmentation where even though we might have lots of free space, it’s impossible to put any objects there. It’s fast though.
- Compacting - live objects are moved together, removing any holes in between them. It’s slower, because moving objects requires updating all references to them. However, it leads to better memory usage and less fragmentation.
Decision whether to use one or the other is made during the Plan Phase.
Ideally, Sweep + Compaction would result in best memory utilization.
SOH may ue Sweep or Compacting, GC has various heuristics to decide which one to use in a given case. We might also manually force either one of them.
LOH uses Sweep by default, but it will also Compact when memory is low. Manually, both can be invoked. POH uses only Sweep, since, by design, Compaction would invalidate pointers to POH objects.
Algorithm
Collection is triggered on a gen generation. When generation n GC is invoked, generation n, and all generations below will be collected.
GC first marks reachable objects, then it decides wheteher it will use Sweep or Compaction.
After collection of gen 0, it will be empty, since objects will either get removed or promoted to higher generation.
Higher generations get scanned using Card Tables to find references to lower generations.
Regions
With Regions layout, each generation as its own “region” of memory. To promote some generation fully, a given region just becomes a higher generation region. A new region then needs to be created for the collected generation. That’s Sweep though, for Compaction, objects might be moved around, and a given region might stay in the same generation.
Segments
When Segments are used, the simple act of moving the bumper pointer promotes a given set of objects to higher gen. Some objects might also be moved to fill holes in higer generation (that would happen only during Compaciton though).
Tips
Pooling Objects
When dealing with lots of potentially expensive objects creations, it could be a better idea to reuse pooled objects instead. That way, GC will have less work to do, since there will be less objects to deallocate. Microsoft provides ObjectPool for this (and ArrayPool specifically for arrays).
Good candidates for pooling are:
- StringBuilder
- byte array
So, instead of creating an object every time, you crete a pool of them (like hundred, or a thousand, depending on your needs), and request/return objects from/to the pool during the lifetime of the app.
Object Pool is also useful if object creation is expensive, since you can create a pool of reusable objects on startup and just use them afterwards.
Empty Collections
Whenever an empty collection is needed, do not create a new object, use
Array.Empty<T> instead, which skips unnecessary allocation.
Stackallock
We are able to allocate objects on the stack instead of the heap.
var array = stackallow new byte[500]();A few facts to keep in mind:
- can’t be done in async methods
- shouldn’t be used to allocate too much data, otherwise stack overflow will occur
- it will always be faster than allocating objects on the heap