Physical memory management

In a kernel, you have multiple memory managers. You have a MM for managing use of the physical address space, a manager for use of the virtual address space, and a manager for controlling the mappings between virtual and physical.

This chapter is going to focus on the first - physical memory management.

Physical Memory management basics

Physical memory managers usually work on a courser granularity than one byte. The usual granularity is the page size, which is a minimum of 4096 bytes (2:sup:12).

So the input to the PMM is a set of pages that are free. Its API is simple - a user can request a free page, and can return a page as no longer used.

Actually it’s slightly more complex than that; there are certain components that require only physical memory in certain locations. An example is 16-bit i386 emulation (perhaps for a VBE driver) - this requires memory below the 1MB boundary. Many DMA engines only support memory below 4GB. So we have three categories of physical memory - below 1MB, 1MB..4GB and > 4GB.

Common algorithms

There are two common routes for implementing a PMM. I’m going to discuss both of them, then explain the reason we’re not going to use either :)

Note during these that the ideal is O(1) for allocate and free, and O(n) for space requirement.

Bitmaps

The idea is that you create a bitmap (array of 1-bit values) for the entirety of your usable physical space. One value represents that the page is free (available for use), the other represents that it is unavailable.

Time complexity

A bitmap search is worst-case linear with respect to size - O(n). Taking the naive approach of searching from the beginning of the bitmap each time and allocating the first encountered free page, it will take the worst-case time as the average time.

Allocate: O(n). Free: O(1)

Space complexity
A bitmap must be created that spans from the lowest possible address to the highest possible address. This means that there will potentially be lots of dead space, as memory maps rarely bunch up all available RAM together in the low end of the address space. There are often large holes for which we’re allocating bitmap space that is unused. In the worst case that we have available RAM at 0xFFFFF000, we’d need to allocate a 4MB bitmap. Not only is this large, it is uncorrelated to the amount of physical RAM in the system and can hurt cache performance. O(n)

Stack

The idea here is to keep a push-down stack of all known free pages. On allocate you pops a page off the stack, on free you push the page onto the stack.

Time complexity

A stack has constant time complexity in all operations.

Allocate: O(1) Free: O(1)

Space complexity
The stack only takes up as much space as is required to store its contents, so altough it is O(n) the same as the bitmap, the n is smaller as it is related to the amount of available memory rather than the position of that memory in the address space.
Drawbacks

The stack based approach is clearly ideal in allocation, free and space complexity, so what are some negatives?

A stack suffers from fragmentation. The order in which pages are returned from the allocator is the inverse order in which they were freed. Generally this doesn’t matter, but there are a couple of situations where it does:

  • Old DMA (direct memory access) devices may require a contiguous region of physical memory to communicate with the processor. Newer DMA controllers have “scatter-gather” semantics that allow them to take a noncontiguous list of pages to write to, but old ones may not.
  • Coalescing into larger pages. We’ve mentioned that the basic page size is 4KB, but many CPUs support larger page sizes which, if used, are more efficient (fewer TLB entries). If it is impossible to get contiguous memory spanning multiple pages from the physical allocator, it is impossible to use a larger page size.

It may end up using more space - to store the fact that page n is free, it requires 32 bits whereas the bitmap requires just one bit.

src/pmm.c
#include "hal.h"
#include "assert.h"
#include "stdio.h"
#include "mmap.h"
#include "adt/buddy.h"
#include "string.h"

#ifdef DEBUG_pmm
# define dbg(args...) kprintf("pmm: " args)
#else
# define dbg(args...)
#endif

#define MIN(x, y) ( (x < y) ? x : y )
#define MAX(x, y) ( (x > y) ? x : y )

Buddy allocation

This is a more versatile allocator that can be used for multiple purposes, not just physical allocation (which has a subset of the requirements of a general-purpose allocator).

The idea is that the address space is represented by a binary tree. Each node in the tree can be split into two “buddies” - buddies are siblings in the tree.

The root tree node covers all of the available space. Its children cover half each, and their children half of that and so on.

The tree is kept as compacted as possible at all times, so initially there is just the one root node. On an allocation request, the tree is divided enough so that the minimum node size that will cover the allocation request is reached.

On a free request, the node indicated is marked free, then iteratively up the tree back to the root its buddy is checked if it too is free. If so, both buddies are destroyed and their parent marked as free instead, collapsing the tree back up again.

Time complexity

Allocation in the buddy world requires potentially log:sub:`2`(n) tree node traversals, the same with freeing. The leading constant however is very fast.

Allocation: lg(n) Free: lg(n)

Space complexity
The buddy allocator requires several bitmaps to operate - these are dependent upon the address space size it operates over, so it is similar in requirement to the bitmap allocator mentioned previously - O(n).

Implementation

Our physical memory manager will be based upon a buddy allocator. Although the stack allocator has better theoretical allocate and free performance (and memory usage), it only allows allocation/freeing of one page at a time, and does not have any way to request multiple adjacent pages, which we may want to do for a DMA buffer or to use larger-sized pages (such as 1MB or 2MB pages).

src/adt/buddy.c
#include "assert.h"
#include "hal.h"
#include "math.h"
#include "adt/buddy.h"
src/adt/buddy.c
#define BUDDY(x)     (x ^ 1)
#define INC_ORDER(x) (x << 1)
#define DEC_ORDER(x) (x >> 1)

Firstly, let’s define some helper macros that manipulate tree node indices.

To find the buddy of a node ‘x’, it is the adjacent node. So that’s simply x xor 1.

Given a node in one depth level (order) of the tree, you can find its first child by multiplying it by two, and its parent by dividing by two.

src/adt/buddy.c
size_t buddy_calc_overhead(range_t r) {
  size_t accum = 0;
  for (unsigned i = MIN_BUDDY_SZ_LOG2; i <= MAX_BUDDY_SZ_LOG2; ++i)
    /* Add one here to be conservative in case the division by 8 had
       a remainder. */
    accum += (r.extent >> i) / 8 + 1;
  return accum;
}

Users of the buddy allocator will need to provide storage for the bitmaps it uses.

We provide a function to inform the user how large this storage needs to be. This depends on the range of addresses being covered, and equates to the sum how many bits are needed to cover the address range when subdivided into different sizes.

We maintain bitmaps for sizes ranging from 2:sup:MIN_BUDDY_SZ_LOG2 to 2:sup:MAX_BUDDY_SZ_LOG2.

src/adt/buddy.c
int buddy_init(buddy_t *bd, uint8_t *overhead_storage,
               range_t r, int start_freed) {
  bd->start = r.start;
  bd->size  = r.extent;

  for (unsigned i = 0; i < NUM_BUDDY_BUCKETS; ++i) {
    unsigned nbits = bd->size >> (MIN_BUDDY_SZ_LOG2 + i);
    bitmap_init(&bd->orders[i], overhead_storage, nbits);
    overhead_storage += nbits / 8 + 1;
  }

  if (start_freed != 0)
    buddy_free_range(bd, r);

  return 0;
}

Next we can initialise a buddy allocator. We use the overhead_storage argument to store our bitmaps in, and allow the user to choose if the memory should start as being free or allocated.

Note that I have defined a simple bitmap ADT, but am not going to show the implementation - that is left as an exercise to the reader!

src/adt/buddy.c
uint64_t buddy_alloc(buddy_t *bd, unsigned sz) {

Now we get to the meat and bones of the buddy allocator - allocation!

src/adt/buddy.c
  unsigned log_sz = log2_roundup(sz);
  if (log_sz > MAX_BUDDY_SZ_LOG2)
    panic("buddy_alloc had request that was too large to handle!");

  unsigned orig_log_sz = log_sz;

Firstly we find the smallest power of 2 that will hold the allocation request, and take the log base 2 of it.

src/adt/buddy.c
  /* Search for a free block - we may have to increase the size of the
     block to find a free one. */
  int64_t idx;
  while (log_sz <= MAX_BUDDY_SZ_LOG2) {
    idx = bitmap_first_set(&bd->orders[log_sz - MIN_BUDDY_SZ_LOG2]);
    if (idx != -1)
      /* Block found! */
      break;
    ++log_sz;
  }

  if (idx == -1)
    /* No free blocks :( */
    return ~0ULL;

Then we try and find a free block of this size. This involves searching in the right bitmap for an set bit. If there are no set bits, we increase the size of the block we’re searching for.

src/adt/buddy.c
  /* We may have to split blocks to get back to a block of
     the minimum size. */
  for (; log_sz != orig_log_sz; --log_sz) {
    int order_idx = log_sz - MIN_BUDDY_SZ_LOG2;

    /* We're splitting a block, so deallocate it first... */
    bitmap_clear(&bd->orders[order_idx], idx);

    /* Then set both its children as free in the next order. */
    idx = INC_ORDER(idx);
    bitmap_set(&bd->orders[order_idx-1], idx);
    bitmap_set(&bd->orders[order_idx-1], idx+1);
  }

Now, if we couldn’t get a block of the size we wanted, we’ll have to split it down to the right size.

src/adt/buddy.c
  /* Mark the block as not free. */
  int order_idx = log_sz - MIN_BUDDY_SZ_LOG2;
  bitmap_clear(&bd->orders[order_idx], idx);

  uint64_t addr = bd->start + ((uint64_t)idx << log_sz);
  return addr;
}

static int aligned_for(uint64_t addr, uintptr_t lg2) {
  uintptr_t mask = ~( ~0ULL << lg2 );
  return (addr & mask) == 0;
}

By this point we have a block that is free. We should now mark it as allocated then calculate the address that actually equates to.

src/adt/buddy.c
void buddy_free(buddy_t *bd, uint64_t addr, unsigned sz) {
  uint64_t offs = addr - bd->start;
  unsigned log_sz = log2_roundup(sz);
  unsigned idx = offs >> log_sz;

  while (log_sz >= MIN_BUDDY_SZ_LOG2) {
    int order_idx = log_sz - MIN_BUDDY_SZ_LOG2;

    /* Mark this node free. */
    bitmap_set(&bd->orders[order_idx], idx);

    /* Can we coalesce up another level? */
    if (log_sz == MAX_BUDDY_SZ_LOG2)
      break;

    /* Is this node's buddy also free? */
    if (bitmap_isset(&bd->orders[order_idx], BUDDY(idx)) == 0)
      /* no :( */
      break;

    /* FIXME: Ensure max(this, buddy) wouldn't go over the max extent
       of the region. */

    /* Mark them both non free. */
    bitmap_clear(&bd->orders[order_idx], idx);
    bitmap_clear(&bd->orders[order_idx], BUDDY(idx));

    /* Move up an order. */
    idx = DEC_ORDER(idx);
    ++log_sz;
  }

}

Freeing is actually easier, as we never have to worry about splitting blocks.

Note that this function can only free addresses and sizes that are correctly aligned, so it’s only really safe to call this with addresses returned from buddy_alloc().

We simply mark the incoming block as free, then while we are not at the top level of the tree, see if the buddy is also free. If so, we mark them both as unavailable and move up the tree one level.

src/adt/buddy.c
void buddy_free_range(buddy_t *bd, range_t range) {
  uintptr_t min_sz = 1 << MIN_BUDDY_SZ_LOG2;

Finally we have a helper function which will take a range of memory and mark it all as free.

To do this, we need to chunk it up into correctly aligned blocks of maximal size.

src/adt/buddy.c
  /* Ensure the range start address is at least aligned to
     MIN_BUDDY_SZ_LOG2. */
  if (aligned_for(range.start, MIN_BUDDY_SZ_LOG2) == 0) {
    if (range.extent < min_sz)
      return;

    uint64_t old_start = range.start;
    range.start &= ~0ULL << MIN_BUDDY_SZ_LOG2;
    range.start += min_sz;
    range.extent -= range.start - old_start;
  }

Firstly, we use a helper function to check if the range’s start address is aligned to a multiple of the smallest block size. If not, we adjust it so that it is.

src/adt/buddy.c
  while (range.extent >= min_sz &&
         aligned_for(range.start, MIN_BUDDY_SZ_LOG2)) {

    for (unsigned i = MAX_BUDDY_SZ_LOG2; i >= MIN_BUDDY_SZ_LOG2; --i) {
      uintptr_t sz = 1 << i;
      uint64_t start = range.start - bd->start;

      if (sz > range.extent || aligned_for(start, i) == 0)
        continue;

      range.extent -= sz;
      range.start += sz;
      buddy_free(bd, start + bd->start, sz);
      break;
    }

  }
}

Now, we iteratively work through the range trying to allocate the largest block we can.

src/pmm.c
extern range_t  early_ranges[64];
extern unsigned early_nranges;
extern uint64_t early_max_extent;

static spinlock_t lock = SPINLOCK_RELEASED;
static buddy_t allocators[3];

static range_t split_range(range_t *r, uint64_t loc) {
  range_t ret;

  if (r->start >= loc) {
    ret.extent = 0;
  } else if (r->start + r->extent <= loc) {
    ret = *r;
    r->extent = 0;
  } else {
    ret.start = r->start;
    ret.extent = loc - r->start;

    r->start = loc;
    r->extent = r->extent - ret.extent;
  }

  return ret;
}

uint64_t alloc_page(int req) {
  return alloc_pages(req, 1);
}

uint64_t alloc_pages(int req, size_t num) {
  dbg("alloc_pages: get lock\n");
  spinlock_acquire(&lock);
  dbg("alloc_pages: got lock\n");
  uint64_t val = buddy_alloc(&allocators[req], num * get_page_size());
  dbg("alloc_pages2: returning %x\n", val & 0xFFFFFFFF);
  if (val == ~0ULL && req == PAGE_REQ_NONE)
    val = buddy_alloc(&allocators[PAGE_REQ_UNDER4GB], num * get_page_size());
  dbg("alloc_pages: returning %x\n", val & 0xFFFFFFFF);

  spinlock_release(&lock);
  return val;
}

int free_page(uint64_t page) {
  return free_pages(page, 1);
}

int free_pages(uint64_t pages, size_t num) {
  spinlock_acquire(&lock);

  int req = PAGE_REQ_NONE;
  if (pages < 0x100000)
    req = PAGE_REQ_UNDER1MB;
  else if (pages < 0x100000000ULL)
    req = PAGE_REQ_UNDER4GB;

  buddy_free(&allocators[req], pages, num * get_page_size());

  spinlock_release(&lock);
  return 0;
}

int init_physical_memory() {
  assert(pmm_init_stage == PMM_INIT_EARLY &&
         "init_physical_memory_early must be called first!");

  range_t rs[3];
  rs[PAGE_REQ_UNDER1MB].start = 0x0;
  rs[PAGE_REQ_UNDER1MB].extent = MAX(MIN(early_max_extent, 0x100000), 0);

  rs[PAGE_REQ_UNDER4GB].start = 0x100000;
  rs[PAGE_REQ_UNDER4GB].extent = MAX(MIN(early_max_extent, 0x100000000ULL) -
                                     0x100000, 0);

  rs[PAGE_REQ_NONE].start = 0x100000000ULL;
  rs[PAGE_REQ_NONE].extent = (early_max_extent > 0x100000000ULL) ?
    early_max_extent - 0x100000000ULL : 0;

  size_t overheads[3];
  overheads[0] = buddy_calc_overhead(rs[PAGE_REQ_UNDER1MB]);
  overheads[1] = buddy_calc_overhead(rs[PAGE_REQ_UNDER4GB]);
  overheads[2] = buddy_calc_overhead(rs[PAGE_REQ_NONE]);

  size_t bitmap_sz = overheads[0] + overheads[1] + overheads[2];
  size_t bitmap_sz_pages =
    round_to_page_size(bitmap_sz) >> get_page_shift();

  for (unsigned i = 0; i < bitmap_sz_pages; ++i)
    assert(map(MMAP_PMM_BITMAP + i * get_page_size(),
               early_alloc_page(), 1, PAGE_WRITE) == 0);

  int ok;
  ok  = buddy_init(&allocators[PAGE_REQ_UNDER1MB],
                   (uint8_t*) MMAP_PMM_BITMAP,
                   rs[PAGE_REQ_UNDER1MB], 0);
  ok |= buddy_init(&allocators[PAGE_REQ_UNDER4GB],
                   (uint8_t*) (MMAP_PMM_BITMAP + overheads[0]),
                   rs[PAGE_REQ_UNDER4GB], 0);
  ok |= buddy_init(&allocators[PAGE_REQ_NONE],
                   (uint8_t*) (MMAP_PMM_BITMAP + overheads[0] +
                               overheads[1]),
                   rs[PAGE_REQ_NONE], 0);
  if (ok != 0) {
    dbg("buddy_init failed!\n");
    return 1;
  }

  for (unsigned i = 0; i < early_nranges; ++i) {
    if (early_ranges[i].extent == 0)
      continue;

    range_t r = split_range(&early_ranges[i], 0x100000);
    if (r.extent > 0)
      buddy_free_range(&allocators[PAGE_REQ_UNDER1MB], r);

    r = split_range(&early_ranges[i], 0x100000000ULL);
    if (r.extent > 0)
      buddy_free_range(&allocators[PAGE_REQ_UNDER4GB], r);

    if (early_ranges[i].extent > 0)
      buddy_free_range(&allocators[PAGE_REQ_NONE], early_ranges[i]);
  }

  pmm_init_stage = PMM_INIT_FULL;

  return 0;
}