sharded_slab

Module implementation

source
Expand description

Notes on sharded-slab’s implementation and design.

§Design

The sharded slab’s design is strongly inspired by the ideas presented by Leijen, Zorn, and de Moura in Mimalloc: Free List Sharding in Action. In this report, the authors present a novel design for a memory allocator based on a concept of free list sharding.

Memory allocators must keep track of what memory regions are not currently allocated (“free”) in order to provide them to future allocation requests. The term free list refers to a technique for performing this bookkeeping, where each free block stores a pointer to the next free block, forming a linked list. The memory allocator keeps a pointer to the most recently freed block, the head of the free list. To allocate more memory, the allocator pops from the free list by setting the head pointer to the next free block of the current head block, and returning the previous head. To deallocate a block, the block is pushed to the free list by setting its first word to the current head pointer, and the head pointer is set to point to the deallocated block. Most implementations of slab allocators backed by arrays or vectors use a similar technique, where pointers are replaced by indices into the backing array.

When allocations and deallocations can occur concurrently across threads, they must synchronize accesses to the free list; either by putting the entire allocator state inside of a lock, or by using atomic operations to treat the free list as a lock-free structure (such as a Treiber stack). In both cases, there is a significant performance cost — even when the free list is lock-free, it is likely that a noticeable amount of time will be spent in compare-and-swap loops. Ideally, the global synchronzation point created by the single global free list could be avoided as much as possible.

The approach presented by Leijen, Zorn, and de Moura is to introduce sharding and thus increase the granularity of synchronization significantly. In mimalloc, the heap is sharded so that each thread has its own thread-local heap. Objects are always allocated from the local heap of the thread where the allocation is performed. Because allocations are always done from a thread’s local heap, they need not be synchronized.

However, since objects can move between threads before being deallocated, deallocations may still occur concurrently. Therefore, Leijen et al. introduce a concept of local and global free lists. When an object is deallocated on the same thread it was originally allocated on, it is placed on the local free list; if it is deallocated on another thread, it goes on the global free list for the heap of the thread from which it originated. To allocate, the local free list is used first; if it is empty, the entire global free list is popped onto the local free list. Since the local free list is only ever accessed by the thread it belongs to, it does not require synchronization at all, and because the global free list is popped from infrequently, the cost of synchronization has a reduced impact. A majority of allocations can occur without any synchronization at all; and deallocations only require synchronization when an object has left its parent thread (a relatively uncommon case).

§Implementation

A slab is represented as an array of MAX_THREADS shards. A shard consists of a vector of one or more pages plus associated metadata. Finally, a page consists of an array of slots, head indices for the local and remote free lists.

┌─────────────┐
│ shard 1     │
│             │    ┌─────────────┐        ┌────────┐
│ pages───────┼───▶│ page 1      │        │        │
├─────────────┤    ├─────────────┤  ┌────▶│  next──┼─┐
│ shard 2     │    │ page 2      │  │     ├────────┤ │
├─────────────┤    │             │  │     │XXXXXXXX│ │
│ shard 3     │    │ local_head──┼──┘     ├────────┤ │
└─────────────┘    │ remote_head─┼──┐     │        │◀┘
      ...          ├─────────────┤  │     │  next──┼─┐
┌─────────────┐    │ page 3      │  │     ├────────┤ │
│ shard n     │    └─────────────┘  │     │XXXXXXXX│ │
└─────────────┘          ...        │     ├────────┤ │
                   ┌─────────────┐  │     │XXXXXXXX│ │
                   │ page n      │  │     ├────────┤ │
                   └─────────────┘  │     │        │◀┘
                                    └────▶│  next──┼───▶  ...
                                          ├────────┤
                                          │XXXXXXXX│
                                          └────────┘

The size of the first page in a shard is always a power of two, and every subsequent page added after the first is twice as large as the page that preceeds it.


pg.
┌───┐   ┌─┬─┐
│ 0 │───▶ │ │
├───┤   ├─┼─┼─┬─┐
│ 1 │───▶ │ │ │ │
├───┤   ├─┼─┼─┼─┼─┬─┬─┬─┐
│ 2 │───▶ │ │ │ │ │ │ │ │
├───┤   ├─┼─┼─┼─┼─┼─┼─┼─┼─┬─┬─┬─┬─┬─┬─┬─┐
│ 3 │───▶ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
└───┘   └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘

When searching for a free slot, the smallest page is searched first, and if it is full, the search proceeds to the next page until either a free slot is found or all available pages have been searched. If all available pages have been searched and the maximum number of pages has not yet been reached, a new page is then allocated.

Since every page is twice as large as the previous page, and all page sizes are powers of two, we can determine the page index that contains a given address by shifting the address down by the smallest page size and looking at how many twos places necessary to represent that number, telling us what power of two page size it fits inside of. We can determine the number of twos places by counting the number of leading zeros (unused twos places) in the number’s binary representation, and subtracting that count from the total number of bits in a word.

The formula for determining the page number that contains an offset is thus:

WIDTH - ((offset + INITIAL_PAGE_SIZE) >> INDEX_SHIFT).leading_zeros()

where WIDTH is the number of bits in a usize, and INDEX_SHIFT is

INITIAL_PAGE_SIZE.trailing_zeros() + 1;