The kernel heap

Now we get to write our main kernel memory allocator - kmalloc().

This is the allocator that will be used liberally around the kernel for dynamically allocating many kinds of objects and buffers. As such it has to deal with allocation sizes ranging from tiny to very large.

The input sizes to kmalloc can almost be split into two groups:
  1. Small objects. These are allocated and freed often, and there may be many of them. Many objects are the same or similar size, so aggressive reuse of memory is useful for keeping required physical memory down.
  2. Large objects, such as buffers. These are often allocated more sporadically, live longer, and requests are not often similarly sized.

Because the usage pattern of small and large objects is so different, it is useful to split kmalloc up into two sub-allocators - one tailored for small objects with lots of memory reuse and another for large objects with less reuse. We’ll deal with this allocator first.

src/include/vmspace.h
#ifndef VMSPACE_H
#define VMSPACE_H

#include "stdint.h"
#include "adt/buddy.h"

“vmspace” - another buddy based allocator

So for this large object allocator, we need to:
  1. Carve up part of the kernel virtual address space and hand it out to callers.
  2. Allocate physical memory for the bits of address space given to callers.

The first requirement sounds kind of similar to what we needed for our physical memory manager. In fact, it’s the same! So we can just reuse the “buddy” allocator we created for the PMM.

src/include/vmspace.h
typedef struct vmspace {
  uintptr_t start;
  uintptr_t size;
  buddy_t allocator;
  spinlock_t lock;
} vmspace_t;

int vmspace_init(vmspace_t *vms, uintptr_t addr, uintptr_t sz);
uintptr_t vmspace_alloc(vmspace_t *vms, unsigned sz, int alloc_phys);
void vmspace_free(vmspace_t *vms, unsigned sz, uintptr_t addr, int free_phys);

/* Allows accessing the singleton vmspace allocated for kernel heap use. */
extern vmspace_t kernel_vmspace;

#endif

Instead of pinning the allocator to just one specific range, let’s make a simple ADT for it so we can reuse it again if necessary.

src/vmspace.c
#include "assert.h"
#include "hal.h"
#include "vmspace.h"

int vmspace_init(vmspace_t *vms, uintptr_t addr, uintptr_t sz) {
  /* FIXME: Assert starts and finishes on a page boundary! */
  range_t r;
  r.start = addr;
  r.extent = sz;

  vms->start = addr;
  vms->size = sz;
  spinlock_init(&vms->lock);

  size_t overhead = round_to_page_size(buddy_calc_overhead(r));
  size_t npages = overhead >> get_page_shift();
  uintptr_t start = r.start + r.extent - overhead;

  int ok = map(start,
               alloc_pages(PAGE_REQ_NONE, npages), npages,
               PAGE_WRITE);
  assert(ok == 0 && "map() failed in vmspace_init!");

  r.extent -= overhead;

  buddy_init(&vms->allocator, (uint8_t*)start, r, /*start_freed=*/0);

  /* FIXME: Can just use '1' to the start_freed argument above? */
  buddy_free_range(&vms->allocator, r);

  return 0;
}

Let’s begin writing this simple wrapper about the buddy allocator.

We start by initialising the buddy allocator inside a vmspace object. That involves calculating the buddy bitmap overhead and reserving space for them at the end of the vmspace range. We also need to allocate physical pages as backing memory.

src/vmspace.c
uintptr_t vmspace_alloc(vmspace_t *vms, unsigned sz, int alloc_phys) {
  /* FIXME: Assert sz is page aligned. */
  spinlock_acquire(&vms->lock);

  uint64_t addr = buddy_alloc(&vms->allocator, sz);

  if (alloc_phys && addr != ~0ULL) {
    size_t npages = sz >> get_page_shift();
    uintptr_t phys_pages = alloc_pages(PAGE_REQ_NONE, npages);
    assert(phys_pages != ~0UL && phys_pages != ~0ULL && "Out of memory!");
    int ok = map(addr, phys_pages, npages, alloc_phys);
    assert(ok == 0 && "vmspace_alloc: map failed!");
  }

  spinlock_release(&vms->lock);
  return addr;
}

void vmspace_free(vmspace_t *vms, unsigned sz, uintptr_t addr, int free_phys) {
  spinlock_acquire(&vms->lock);

  if (free_phys) {
    unsigned pgsz = get_page_size();
    for (unsigned i = 0; i < sz; i += pgsz) {
      uint64_t p = get_mapping(addr + i, NULL);
      assert(p != ~0ULL &&
             "vmspace_free asked to free_phys but mapping did not exist!");
      free_page(p);
      unmap(addr + i, 1);
    }
  }

  buddy_free(&vms->allocator, addr, sz);

  spinlock_release(&vms->lock);
}

Next we have allocation and release of memory. This just involves calling the buddy allocator and then allocating physical backing memory if requested.

Slab allocator

Now that we have a “large object” allocator, let’s focus on the “small object” allocator.

There are many types of allocator we can use for small objects - a fairly standard allocator is dlmalloc which is reused for many hobby kernels.

The Solaris kernel guys developed an allocator called the Slab allocator which was later adopted into Linux, and was the main kernel allocator until they added the “SLUB” allocator (which is SLAB-based).

The Slab scheme is fairly elegant and compact, so we’re going to implement it (or a variant of it) for our small object allocator.

Slab basics

A slab allocator sits on top of another memory allocator. It requests large amounts of memory (a slab) at a time, then carves those slabs up itself.

Slabs belong to a cache. All allocation requests to a cache return the same size of memory. In order to allocate different sizes of object, different caches must be made (one for each size of object you want to return).

The original Solaris Slab allocator extended this concept. Not only was each cache dedicated to one size of object, but it could also be dedicated to one type of object - you’d have a cache for your filesystem handles, one for your thread handles and other high-churn, small objects. The advantage of this is that the cache can return an object that has already been initialised - instead of memsetting or setting struct members to a default value after calling the allocator, the allocator can do it for you more efficiently!

So you create one or more caches, and a cache maintains a list of slabs it has gained from the underlying large memory allocator. It takes those slabs and carves them up into N objects of the same size. At the end of the slab, it allocates space for a simple bitmap which allows it to track which of the objects in this slab are allocated. If none are allocated, it can return the slab to the large memory allocator.

src/include/slab.h
#ifndef SLAB_H
#define SLAB_H

#include "vmspace.h"

/* 8KB slab sizes */
#define SLAB_SIZE 0x2000

typedef struct slab_cache {
  unsigned size;                /* The size of objects this cache returns, in bytes. */
  void *init;                   /* Optional initializer that can be applied to every
                                   returned object. */
  struct slab_footer *first;    /* First slab footer in a linked list - all
                                   of these slabs are full or partially full. */
  vmspace_t *vms;               /* Parent "large object" allocator. */

  spinlock_t lock;
} slab_cache_t;

/* Create a new slab cache. Use 'vms' as the parent "large object" allocator,
   and return objects of 'size' bytes. If 'init' is non-NULL, all objects will
   have the value located at 'init' upon allocation. */
int slab_cache_create(slab_cache_t *c, vmspace_t *vms, unsigned size, void *init);
int slab_cache_destroy(slab_cache_t *c);
void *slab_cache_alloc(slab_cache_t *c);
void slab_cache_free(slab_cache_t *c, void *obj);

#endif

Slab implementation

We’ll obviously build our slab allocator on top of the “vmspace” allocator we just defined. The API is simple, and involves creating, allocating from, freeing from and destroying caches.

src/slab.c
#include "assert.h"
#include "hal.h"
#include "slab.h"
#include "string.h"
src/slab.c
typedef struct slab_footer {
  struct slab_footer *next;
} slab_footer_t;

/* Mask for finding the start of a slab. */
#define SLAB_ADDR_MASK ~(SLAB_SIZE-1)
/* Given an object size, calculate the size of the bitmap required. This will
   be an overestimate. */
#define BITMAP_SIZE(obj_sz) ((SLAB_SIZE / obj_sz) / 8 + 1)

/* Given an address inside a slab, find the slab's start address. */
#define START_FOR_PTR(f) ((uintptr_t)f & SLAB_ADDR_MASK)
/* Given an address inside a slab, find the slab's footer. */
#define FOOTER_FOR_PTR(x) ((void*)((START_FOR_PTR(x) +                  \
                                    SLAB_SIZE - sizeof(slab_footer_t))))
/* Given an address inside a slab, find the slab's bitmap start address */
#define BITMAP_FOR_PTR(x, obj_sz) ((uint8_t*)FOOTER_FOR_PTR(x) -       \
                                   BITMAP_SIZE(obj_sz))
/* Return the bitmap entry index for 'obj'. */
#define BITMAP_IDX(obj, obj_sz) (( (uintptr_t)obj -                     \
                                   START_FOR_PTR(obj) ) / obj_sz )

A slab is divided into three main parts - the main content, the allocation bitmap and the footer which consists of a pointer to the footer of the next slab (linked list).

../../../doc/slab-layout.svg

To simplify our allocator, we can make the assumption that the parent allocator (vmspace) only returns chunks of memory that are naturally aligned. So if we ask for an 8KB chunk of memory, the resulting pointer will be 8KB-aligned.

We can use this to easily calculate the start of a slab from any pointer inside it. If you have a pointer to a slab footer or one of the objects inside, we can just perform a simple mask and get the start of the slab. This then lets us calculate the address of the bitmap and footer.

src/slab.c
static unsigned num_objs_per_slab(unsigned obj_sz) {
  int n = SLAB_SIZE / obj_sz;
  int overhead = BITMAP_SIZE(obj_sz) + sizeof(slab_footer_t);
  return n - overhead / obj_sz - 1;
}

The number of objects in a slab is actually slightly less straightforward than might be expected.

We can calculate the number of objects that will fit in a slab (SLAB_SIZE / object_size), and that tells us how large our bitmap needs to be.

But now that we have a bitmap of nonzero size, fewer objects will fit in the remaining space. We could spend effort to calculate the maximum possible number of objects, or we could take the easy way out and simply not bother. This means that potentially we are wasting some space, but we might also be gaining execution time (as iterating to find the best size is time consuming!)

src/slab.c
static void mark(slab_cache_t *c, void *obj, bool used) {
  unsigned idx = BITMAP_IDX(obj, c->size);

  unsigned byte = idx >> 3;
  unsigned bit = idx & 7;
  uint8_t *ptr = BITMAP_FOR_PTR(obj, c->size) + byte;
  if (used)
    *ptr |= 1 << bit;
  else
    *ptr &= ~(1 << bit);
}

static bool all_unused(slab_cache_t *c, slab_footer_t *f) {
  uint8_t *p = BITMAP_FOR_PTR(f, c->size);

  for (unsigned i = 0; i < BITMAP_SIZE(c->size); ++i, ++p) {
    if (*p != 0) {
      unsigned nobjs = num_objs_per_slab(c->size);
      /* Something is nonzero. If we're more than 8 bits
         from the end of the bitmap, then we know for
         definite that there is an unused object. */
      if ((i+1) * 8 < nobjs)
        return false;

      /* Otherwise, we need to look a bit more carefully. */
      uint8_t val = *p;
      for (unsigned j = i * 8; j < nobjs; j += 1) {
        if (val & 1) return false;
        val >>= 1;
      }
    }
  }
  return true;
}

Now we get on to our bitmap manipulation helper functions. We have one to change the state of a bit (mark) and another to check if all bits in the bitmap are clear.

src/slab.c
static int lsb_clear(uint8_t byte) {
  int i = 0;
  while ((byte & 1) == 1) {
    ++i;
    byte >>= 1;
  }
  return i;
}

static void *find_empty_obj(slab_cache_t *c) {
  slab_footer_t *f = c->first;
  while (f) {
    uint8_t *p = BITMAP_FOR_PTR(f, c->size);

    for (unsigned i = 0; i < BITMAP_SIZE(c->size); ++i, ++p) {
      /* Are all bits set? If so, there's definately nothing
         empty here. */
      if (*p != 0xFF) {
        unsigned idx = i * 8 + lsb_clear(*p);
        if (idx >= num_objs_per_slab(c->size))
          /* No free objects in this slab :( */
          break;
        return (void*)(START_FOR_PTR(f) + c->size * idx);
      }
    }

    f = f->next;
  }
  return NULL;
}

The final helper function attempts to find an empty object. This is just searching the bitmap of each slab in turn, looking for a clear bit.

src/slab.c
int slab_cache_create(slab_cache_t *c, vmspace_t *vms, unsigned size, void *init) {
  c->size = size;
  c->init = init;
  c->first = NULL;
  c->vms = vms;
  spinlock_init(&c->lock);
  return 0;
}

int slab_cache_destroy(slab_cache_t *c) {
  slab_footer_t *s = c->first;
  while (s) {
    slab_footer_t *s_ = s->next;
    vmspace_free(c->vms, SLAB_SIZE, START_FOR_PTR(s), /*free_phys=*/1);
    s = s_;
  }
  c->first = NULL;
  return 0;
}

Then we get to the API functions. Cache creation and destruction is obvious and uninteresting.

src/slab.c
void *slab_cache_alloc(slab_cache_t *c) {
  spinlock_acquire(&c->lock);

Ah, allocation. This is slightly more fun.

src/slab.c
  void *obj;
  if ((obj = find_empty_obj(c)) == NULL) {
    slab_footer_t *f = c->first;

    uintptr_t addr = vmspace_alloc(c->vms, SLAB_SIZE, /*alloc_phys=*/PAGE_WRITE);

    /* Initialise the used/free bitmap. */
    memset((uint8_t*)BITMAP_FOR_PTR(addr, c->size), 0, BITMAP_SIZE(c->size));

    c->first = FOOTER_FOR_PTR(addr);
    c->first->next = f;

    obj = (void*)START_FOR_PTR(c->first);
  }

We need to allocate a new object. There can be two cases here - either all of our slabs are full, in which case we need to get a new slab, or there is at least one free object in at least one slab.

Creating a new slab simply involves calling our vmspace allocator and adding the result to the slab list, as well as initializing its allocation bitmap.

src/slab.c
  if (c->init)
    memcpy(obj, c->init, c->size);
  mark(c, obj, true);

  spinlock_release(&c->lock);
  return obj;
}

If the user specified an initial state for an object in the cache, copy it over now and mark the object as allocated.

src/slab.c
void slab_cache_free(slab_cache_t *c, void *obj) {
  spinlock_acquire(&c->lock);
  assert(c->first && "Trying to free from an empty cache!");

  slab_footer_t *f = FOOTER_FOR_PTR(obj);

  mark(c, obj, false);

  if (all_unused(c, f)) {
    slab_footer_t *f2 = c->first;
    if (f2 == f) {
      c->first = NULL;
    } else {
      while (f2->next != f)
        f2 = f2->next;
      f2->next = f->next;
    }
    vmspace_free(c->vms, SLAB_SIZE, START_FOR_PTR(f), /*free_phys=*/1);
  }
  spinlock_release(&c->lock);
}

Freeing is essentially the same in reverse. We need to mark the object as unused in its slab’s bitmap, then we see if we can free the slab completely.

If all objects in the slab are now unused, we can unlink it from the slab list and send the memory back to the vmspace allocator.

And that’s all there is to a (very simple, unoptimised) slab allocator!