Memory

Published on
83 min read––– views

Introduction

Not that this wasn't explained to me in undergrad, but I don't think I was paying attention. This post covers two, maybe three-ish, depending on how you count, memory regions.

1 | Stack

The stack is fast, and gets its name from the data structure which obeys Last In First Out behavior. A stack is fixed, and finitely sized (implying that we can run out of it which gives rise to the infamous Overflow).

When a program is executed, the operating system allocates an execution stack. When a program tries to write to memory outside the stack, and the other legal memory regions we'll discuss later in section 2, then we get a Segmentation Fault.

Why a Stack?

Consider the following program:

int main() {
	uint8_t  i = 10;
	uint16_t j = 40000;
	uint8_t  k = 200;
	uint64_t l = 15;
}

If we place data haphazardly into the program's fixed memory buffer, with i going to memory address 0x0001, the double-word variable j occupying 0x0005 and 0x0006, and so on:

even though the total amount of available space might be enough to contain all the data, this external fragmentation prevents us from storing all the program data since there isn't a large enough contiguous chunk of memory for the 8-word width variable l.

When this occurs, the program would have to request another chunk of memory from the OS which is a prohibitively expensive process as we'll see shortly. If we treat the fixed memory region as a stack instead of a random bag, appending data to the top, we can fit all the program's data into the single initially allocated chunk with room to spare:

Petitioning the OS for more memory is slow, hence we prefer to avoid the heap and slower memory devices when at all possible.

Programs are fenced into their own memory regions and even if there's enough contiguous space in that program's stack for another program's requested buffer, the OS will respect the sovereignty of each program's stack and instead allocate another region for the second program. But, memory is finite, so when the OS exhausts main memory & Shortbus Johny comes along requesting a memory region for his program, modern systems will treat any available disk storage as an extension of main memory – albeit painfully slow memory.

Memory devices have varying sizes and speeds:

nameregisterscachemain memorydisk storage
typical size< 1 kb< 16 Mb< 64 GB> 100 GB
implementationcustom memory with multiple ports, CMOSon- or off-chip CMOS SRAMCMOS DRAMmagnetic disk
Access time (ns)0.25 - 0.50.5 - 2580 - 2505,000
Bandwidth (Mb/s)20,000 - 100,0005,000 - 10,0001,000 - 5,00020 - 150
Managed bycompilerhardwareOSOS
Backed bycachemain memorydiskCD or tape

1

So when Johny Shortbus's program initializes, the OS relegates its stack to the dark corners of virtual memory which, though effectively infinite in size (we'll see how this is possible in section 3), is penalized in performance.

Even just reading from main memory incurs a performance hit.

int c = a + b

The preceding C code has the following pseudo-assembly

ld  r0, [a];     load a to register 0
ld  r1, [b];     load b to register 1
add r2, r0, r1;  add r0 and r1, storing resulting in r2
str r2, [c];     store the result in c

Each of those memory ld/str instructions accesses memory which is "far away" from the CPU. To get around the comparatively costly memory accesses, most systems have dedicated hardware caches within the CPU which store copies of regions of memory, optimistically hoping (and predicting2) that data in those memory regions will be accessed again shortly and can thus be retrieved quickly from the cache. When the optimistic assumption is fulfilled is called a cache hit. When the desired value is not cached, the CPU still has to make the arduous journey across the memory bus to pull the data from memory (or worse, an external storage device). This is called a cache miss.

Programs can optimize for cache hits by writing data compactly such that as much of their data as possible can be retrieved within a single chunk of contiguous memory that might fit into a CPU cache, rather than multiple round trip locations to disparate memory locations.3

The stack solves the need for spacial locality by enforcing the aforementioned LIFO access pattern constraints. When a program requires a fixed amount of contiguous memory from the OS, the size of the stack is specified by the compiler used.4 Stacks are usually on the order of –at most– a few megabytes, which is sufficient to hold hundreds of thousands of 64-bit integers.

1  Mb=1,048,576  bytes×1 uint64_t8 bytes=131,072 uint64_ts1 \; \text{Mb} = 1,048,576 \; \text{bytes} \times \frac{1 \text{ uint64\_t}}{8 \text{ bytes}} = 131,072 \text{ uint64\_t}'s

If that's not enough, then I think you're running into a skill issue.

we did literally go to the moon without a heap.5

When a program is executed, the OS searches for a contiguous region of memory to be its stack, and returns the address of the stack origin. This effectively communicates to the program the size of the stack (26 bytes, in this example). The main function of a program is the typical entry point to the stack.

// Program A
int main() {
	uint8_t  a = 12;
	uint16_t b = 100;
	uint16_t c = (uint16_t) a + b;
}

In modern systems, the value of the stack pointer is copied into two dedicated registers since we know we'll be referencing them frequently. One remains fixed, to track the stack origin, and the other changes with the stack pointer as values are pushed onto the stack.

As each value in the main function (or in a function invoked in the main function) is encountered, it is pushed onto the stack at the address of SP + 1 and the Stack Pointer is incremented by the sizeof the value pushed onto the stack. After our main function has executed, before the program terminates, its execution stack and registers will look like this:

This process is significantly faster and more straightforward than heap allocation which is a multi-step process involving:

  1. paging the OS for memoery
  2. waiting for the OS to find memory
  3. reconciling fragmentation

When a function is called, all its local variable are pushed onto the stack. The CPU records the starting location of the sub-program's stack (called a stack frame) indicated by its own SP so that the CPU can restore to this point after the function terminates.

When a function returns a value, a return link is first pushed onto the calling-function's stack frame to earmark the location of the callee-function's return value. This is why many systems-level programming language require that functions specify their return type, so that a return link of appropriate width can be allocated on the stack before the function is executed:

uint16_t compute_two() {
	uint8_t a = 1;
	uint8_t b = 1;
	return (uint16_t) a + (uint16_t) b;
}

int main() {
    uint8_t  a = 12;
	uint16_t b = 100;
	uint16_t c = (uint16_t) a + b;
	uint16_t two = compute_two();
	uint16_t d = 10;
}

When the program is executed, the return link of the compute_two function is pushed to the stack, followed by a new stack frame for the function.

Limitations of the Stack

The fixed nature of the stack prevents us from using dynamically sized data structures. Arrays on the stack must be fixed in size; if we were to extend an array, we'd run the risk of overwriting any data that exists above it on the stack (again, hence the LIFO restriction on data access to the stack). Additionally, any arrays (or large datastructures) must also be << smaller than the fixed size of the stack itself, lest we Stack Overflow.

This also provides some insight into the cause of stack overflow exceptions caused by recursive functions. In the absence of a base case, or even with asymptotically unreachable base cases w.r.t the stack size, as we push more and more stack frames on the stack, eventually we'll overflow.

In general, compact data organization minimizes fragmentation and maximizes the likelihood of cache hits. The stack is a one-time (and therefore fast) allocated region of memory, and its LIFO interface makes it easy to handle function calls and local data. Its main drawback is its fixed size, and the inability to handle dynamically sized data types.

For big and/or dynamic data, we turn to the heap.

2 | Heap

The OS functions as an intermediary between applications and hardware (namely for this post, hardware means storage devices). Consider the following program:

#include <stdio.h>

int main() {
	char a[] = "hello";
	char b[] = "world";
	printf("%s %s", a, b);
}

the strace command allows us to intercept any code executed by a given process to see how how our its translated to lower level system calls to the operating system. The strace of a program like the one presented above shows that printf expands to a few functions which format the string arguments, then call print which itself translates to a write(2):

ssize_t write(int fd, const void *buf, size_t nbyte);

which itself wraps a syscall to write to the standard output (fd=1):

syscall(__NR_write, 1, "hello world", 12)

A syscall is an OS-level API to do those things that programs are not able to do themselves directly.

Syscalls can be expensive because they require us to:

  1. capture the state of our process known as the execution context which is represented in CPU registers,
  2. store them in memory,
  3. yield to the OS to perform the syscall,
  4. store the syscall's result in memory,
  5. restore the execution context of our process,
  6. and finally retrieve the result over the syscall.

Even in this oversimplification, we see that several memory accesses are required for something as simple as e.g. I/O.

Context switching like this, between threads or processes, or even just large and fragmented swaths of code is resource expensive & slow.

Requests for memory are system calls (brk, sbrk, mmap, to name a few common ones). Stacks are fast because they necessitate only a single syscall/pair of context switches upon process startup to initialize the program's allocated chunk.

A process's memory layout resembles the following:

It's worth noting that a program's arguments and all the environment variables are reachable via stack overflow (in C, at least), which throws some people for a loop.6

Dynamic Memory API

The API to request and free dynamic chunks of memory in/from the heap consists of malloc() and free().

malloc returns a chunk, if one is available, of memory at least as big as the amount being requested. Heap space is "virtually" unlimited (see section 3), since we can request as much as we want, whereas the stack is finite.

The heap solves the first limitation of the stack, that is – the inability to store large amounts of dynamically-sized data. Each byte of memory is addressable via numeric values, and it's typically safe to assume that modern systems' addressable space is expressible as a single word integer of fixed-size known at compile time. The definition of a word is system-dependent. getconf(1) can be used to tell us information about our system:

$ getconf WORD_BIT
32

which means my computer has 32 bit = 4 byte = int-sized words, which is standard. This means that I can reference 23212^{32}-1 distinct addresses with a single word, where each cell itself can store a single word's worth of data.

With this in mind, rather than pushing data directly to the stack, we can instead opt to store data on the heap, and just store the starting address of that data on the stack:

Numeric values representing memory addresses are called pointers. Note that the pointer is just the memory address. It doesn't contain any information about the size or width of the data that begins at that address.

Free Lists

Our program retains some record about the locations in the heap that are available:

availableoccupied
[0x4000, 0x4004]
[0x4005, 0x4008]
[0x4009, 0x400F]

Storing memory in the heap doesn't always require paging the OS for more memory chunks. If there's enough contiguous leftover space in the free list from previous allocations, then we can just insert & mark those regions of the heap as occupied.

E.g. a uint48_t can be defined as follows:

struct uint48_t {
	uint64_t value: 48;
} typedef uint48_t uint48_t;

and allocated on the current heap without requiring any additional chunks from the OS:

and update the freelist accordingly:

availableoccupied
[0x4000, 0x4004]
[0x4005, 0x4008]
[0x4009, 0x400E]
[0x400E, 0x400F]

However, despite the fact that we have enough total free space in the heap for another uint48_t, we've once again run into the problem of fragmentation since the 6 free bytes required are not contiguous. Unlike on the stack, where only the "topmost" datum can be popped which enforces compactness, there are no such guarantees within the heap. So we run into many scenarios where

availablerequested\sum \text{available} \geq \text{requested}

just not contiguously so we must request another page of memory from the OS:

Even if we take pains to insert data in a stack-like manner to the heap, there are no guarantees about the order of removal of those regions, and so over the course of a program's lifespan as memory is allocated and freed, external fragmentation is nearly unavoidable, whereas it's impossible on the stack.

We can make use of these fragmentation gaps by exhausting the freelist first before requesting an additional memory chunk extension from the OS, and even going so far as to organize those regions into buckets grouped by their size for faster availability lookups & retrieving from our free list according to some strategy such as first-fit or best-fit to optimize for lookup time or fragmentation, respectively.

However, regardless of allocation strategy, fragmentation is still possible since we're entirely dependent on the number of requested bytes.

The API we commonly use to manage memory in this way consists of the aforementioned

void* malloc(size_t size);

which checks the heap's freelist for a free region of at least size bytes, updating the freelist accordingly, and returning a pointer to the starting address of the free region.

The other most common memory management function is the inverse of malloc:

 void free(void *ptr);

which marks the region as free and also attempts to coalesce and surrounding free blocks

e.g. if we had a dynamically allocate variable

uint64_t* b = malloc(sizeof(uint64_t)); // returns a pointer to 0x4005
// do some stuff with b

our freelist would then resemble:

availableoccupied
[0x4000, 0x4004]
[0x4005, 0x4008]
[0x4009, 0x400E]
[0x400E, 0x400F]
// having done some stuff with b
free(b);

The free list would move the block [0x4005, 0x4008] from right to left:

availableoccupied
[0x4000, 0x4004]
[0x4005, 0x4008]
[0x4009, 0x400E]
[0x400E, 0x400F]

and coalesce it with the neighboring block [0x4000, 0x4004]:

availableoccupied
[0x4000, 0x4008]
[0x4009, 0x400E]
[0x400E, 0x400F]

Whatabout dynamically-sized values? Like pushing to an ArrayList which is already at capacity? What's stopping us from maintaining the principle of compactness on the stack by popping everything above our ArrayList off the stack into memory, appending to the ArrayList, and then re-pushing everything back on?

We would need to temporarily allocate memory on the order of the size of the stack, which would be expensive, write to the previously-occupied area of interest on the stack which is not free, and finally push everything back from the heap onto the stack, which is most certainly not a expeditious sequence of operations. This may still require a syscall as well to allocate the temporary swap/staging area. All together, it's a 2O(n)2 * O(n) procedure in the worst case where we have an ArrayList of size 1, and an almost-completely full stack of n1n-1 elements on top of that, and wish to append to the ArrayList. We incur the cost of popping everything else off the stack, to some swap space on the order of half-a-dozen megabytes (good luck finding that in your freelist, dork), and then re-writing n2n-2 bytes back to the stack.

Note that the heap doesn't inherently solve this problem because we may still run into the same issue when extending an array contiguously, which may runover into neighboring data. Linked lists get around this by alleviating the burden of finding contiguous memory and potential penalty of syscalling to get it no such available block exists. The drawback is that such a data structure no longer allows for constant time indexing, and though such linked list nodes may fall neatly into existing gaps in the heap, their distribution throughout memory (subject to the malloc freelist strategy), will likely hinder caching as well. Vectors blend the benefits of both by linking contiguous arrays.

Limitations of the Heap

Performance wise:

  • We can make dynamic memory allocations to the heap by exhausting a freelist
  • dynamic memory requests may still incur the penalty of a syscall if no sufficiently large memory chunks exists

Runtime Risks

  • (people who are bad at) dynamic memory allocations are prone to memory leaks
  • null pointer errors
  • dangling pointers

And the heap can be dreadfully slow if used poorly.

3 | Virtual Memory

The most sophisticated solution to the problem of space, fragmentation, and security is virtual memory.

Most "pre-modern" CPUs could support at least 4 GB of RAM, derived from an architecture possessing 32-bit registers and 1-byte addresses.

232  addresses=232  bytes=4×210×210×210  bytes=4×210×210  kilobytes=4×210   megabytes=4  gigabytes\begin{aligned} 2^{32}\text{\;addresses} &= 2^{32} \text{\;bytes} \\ &= 4 \times 2^{10} \times 2^{10} \times 2^{10} \text{\;bytes} \\ &= 4 \times 2^{10} \times 2^{10} \text{\;kilobytes} \\ &= 4 \times 2^{10}\text{\; megabytes} \\ &= 4 \text{\;gigabytes} \\ \end{aligned}

Modern 64-bit architectures imply an addressable space of 2642^{64} bytes which is roughly 16 million terabytes (the theoretical limit of 64-bit architecture). For the following examples, we'll consider a 32-bit architecture with 2 GB of RAM, or 2312^{31} bytes of addressable space.

For our virtual memory inquest, we want to understand what happens when a program (some unfathomably resource intensive application, like Google Chrome) tries to access data outside of its [0x00000000, 0x7FFFFFFF] bounds.

The program will crash because this memory address is invalid.

Previously, we discussed memory fragmentation internal to a single process, but the same pitfall is possible at the OS level, with multiple programs entering into and exiting out of existence:

Despite the fact that, at time of attempted execution, having terminated uTorrent, there is enough free RAM to launch IntelliJ, the memory may not be contiguous! Writing programs which manage variables in split memory regions is prohibitively difficult. A global array of all program data (like the stack, for statically allocated applications) wouldn't be indexable without some sort of complicated indirection. Spoiler alert: that layer of complicated indirection is in fact virtual memory, but it's shoved down inside the OS so that client's of the OS don't have to worry about it.

Additionally, with respect to security concerns, simply sharing memory between applications is not an option because that would allow for the possibility of e.g. Chrome overwriting addresses needed by uTorrent and vice-versa, which could corrupt uTorrent's state and ruin my viewing experience of Die Hard 2.

Virtual Memory solves all of these issues by creating a map for each program's memory to a section of physical RAM.

this sort of layout begs the question: what happens when a program tries to access memory that does not map to a physical address due to hardware constraints? The map might indicate that that address has been offloaded to disk. The OS would then evict the least-recently-accessed data from physical memory, load the requested value from disk to memory, and update the virtual memory mapping. Disk space used as VRAM in this manner is called swap.

When data is not in RAM and must be retrieved from disk, this is called a page fault. If it. sounds sub-optimal, that's because it is. Recall that this is slow as fuck since it consists of disk I/O + updating the map + simply accessing the desired memory.

// TODO: diagram?

Still, a 1000x slower operation is still infinitely preferable to the alternative of crashing out. Intuitively, the number of page faults that might occur over any given period of time is inversely proportionate to the amount of physical RAM we're working with:

RAM1#page faults|\text{RAM}| \propto \frac{1}{\# \text{page faults}}

So, by leveraging disk space with clever mappings tracking which areas of memory have been relegated to the slower surplus storage, we've solved the issue of space.

What about fragmentation? Virtual memory maps available chunks into virtually contiguous space with a linear overhead cost of a VRAM mapping lookup.

With respect to security, multiple programs are now able to write to the same virtual address provided that they're mapped to different physical locations:

Complete isolation is still not optimal though, since we'd like to be able to share memory regions between programs in some cases, just not as the rule. For example, we ought to be able to arbitrarily access /usr/local/bin or libraries like glibc from any program without faulting to disk, or worse, making a network call to clone the whole easily-shared resource between programs.

Virtual Memory Implementation

These intermediate maps between a process's memory addresses and physical storage locations are called Page Tables.

Page tables consist of 1-word entries for each word in the physical address space. In our 4 GB physical RAM example, we have 2322^{32} addresses for each byte, which is 2302^{30} words on e.g. 32-bit = 4 byte words, meaning each page table contains about one billion entries. That means we need 4 GB per table, per program...

This is obviously untenable, so rather than mapping each individual byte, we divide memory into chunks, or pages like so:

You can view your machine's actual page size like so:

$ getconf PAGESIZE 
16384 # bytes

The tradeoff here is that we now swap whole pages of data in and out of main memory, rather than individual words at a time – but this is generally okay and even beneficial for caching.

Indexing into a physical address from a virtual address is as simple as adding a physical address offset to the virtual address:

Let's take a look at a more comprehensive example. Suppose we have a 32-bit virtual address space, and a 1GB, or 30-bit, physical address space. If we assume a page size of 4kb, we need 12 bits to disambiguate our page table entries:

To map the bits from 31:13, the CPU queries the page table for the remaining 19 bits of physical address, and appends bits 12:0 as the offset into that page.

Page Faults

Page faults occur when a page table entry maps to a disk location rather than a physical address (in main memory). For example, if virtual addresses in Page 4 have not been accessed recently and therefore have been offloaded to disk, when the CPU tries to access a virtual address on that page, a page fault exception is raised by the CPU's Memory Management Unit. The OS then chooses which page in physical address space to evict (least-recently-used). If that page has been changed since it was last loaded from disk, it is said to be "dirty," and must be re-written back to disk. Otherwise, the OS can simply overwrite it directly with the data corresponding to the requested virtual page. The OS then updates the page table to reflect that the evicted page is no stored on disk, and the request virtual page will now map to the physical address space of the now-evicted page.

Because faults are slow, the CPU typically executes some other task in the meantime until the OS signals that the virtual request virtual page is now in RAM.

Cost

We can now consider the cost of virtual memory. Especially in light of the previous infatuation with speed necessitated by the stack, this indirection seems awfully slow. For every memory access, we now have to find the physical mapping in the page table in RAM (1), translate the address, and find the actual device address (2). To optimize the cost of doubling the number of memory accesses for each over-the-table access in the optimal case where the physical address page hasn't been offloaded to disk, most CPUs have an on-board cache called a Translation Lookaside Buffer which caches translations from virtual to physical addresses.

This cache is very small, but very fast since it only needs to store a few thousand program's translations at a time. As such, accessing it takes about 1 clock cycle in the average case. In the event that there's a cache miss in the TLB, a few dozen cycles are needed to retrieve the requested translation from the page table in memory. Cache misses are infrequent (typically occurring once on startup). Modern CPUs have multiple TLBs, just like they have multiple layers of normal caches.

In the above example, the TLB is empty, so the CPU falls through to the page table to retrieve the physical page number. The offset is copied from the last 12 bits of the virtual address as usual, and the TLB is updated with the contents of the page table. The next time the CPU tries to access data on that page, the physical page number will be loaded directly from the TLB, saving us the cost of a main memory access. In the event that the TLB is full and the request virtual page number is not cached, just like any other scenario covered thus far, the OS will evict the oldest entry and replace it with the request mapping.

More Abstraction

A cursory glance at ps -e | wc -l indicates that I currently have 431 programs running at the moment, each requiring their own 4 MB page tables, that make it out to seem as though a non-trivial amount of my physical memory (about 1.7 GB) is being occupied by housekeeping overhead. But the MMU can't store page table entries on disk, otherwise the CPU wouldn't have any references to them in main memory anymore, so where are they being stored?

We can repeat the previous process of abstraction again (and again), storing chunks of pages, rather than pages of bytes. 1 chunk can stores 4kb worth of pages, and 1024 chunks ca cover all 2202^{20} possible page table entries. If we introduce a second level of page table mappings from virtual addresses to the address where its page table lives, constraining the physical addresses. This introduces another linear cost lookup for each layer of tabular abstraction, but saves space.

The Linux operating system uses 5 levels of page tables, allowing us to theoretically address 64 TB of physical RAM. The implementation is similar to the trivial case, using the first 10 bits of the virtual address to index into the 1st level page table to tell us the index of the 2nd level page table, and using the remaining 12 bits as the final offset into the physical page.

This allows us to offload higher-order page table onto disk, so long as we always keep the first level page table in memory.

4 | Yacine's Gambit

Now the dream of dodging all this overhead is achieved in some areas of embedded systems programming, colloquially known as Yacine's Gambit. And it is possible to avoid dynamic memory as much as possible by greedily allocating a maximally-sized "arena" onto the stack on application startup, and then implementing your own malloc API to query memory from this pool on the stack instead of the heap. Of course, as soon as we exhaust our arena, we'll be forced to crash (which is, as Professor Godmar would put it "strictly verboten") or begin the arduous journey to the heap.

Implementing our own memory management API gives further insight into the lower levels of systems programming though and can be fun! (if you're the type of sadist who enjoys C programming).

We'll define our memory management interface in a header file called mm.h as follows:

extern int mm_init (void);
extern void *mm_malloc (size_t size);
extern void mm_free (void *ptr);
extern void *mm_realloc(void *ptr, size_t size);

We'll implement those function in mm.c, but first we need to model the chunks of memory we want to dole out with our management API. We'll implement a simple freelist consisting of free blocks and allocated blocks:

/* Unallocated memory block */
struct f_block {
    /* offset 0, at address 12 mod 16 */
    struct boundary_tag header;
    /* points to the neighboring blocks in its size class */
    struct list_elem elem;
};

/* Allocated memory block */
struct a_block {
    struct boundary_tag header; /* offset 0, at address 12 mod 16 */
    char payload[0];            /* offset 4, at address 0 mod 16 */
};

7

Each block will contains a boundary tag indicating the size of the block and which class of block it is:

/* Used for headers and footers of blocks to indicate size and if inuse */
struct boundary_tag {
    int inuse:1;        // inuse bit
    int size:31;        // size of block, in words
};

We also initialize a specific boundary tag called FENCE to demarcate the region allocated to our memory management API:

/* FENCE is used for heap prologue/epilogue. */
const struct boundary_tag FENCE = {
    .inuse = 1,
    .size = 0
};

This implementation uses a segregated free list approach with n=9 size classes broken up into 24,25,...,2122^{4}, 2^{5}, ..., 2^{12} word divisions, where a word is equivalent to the size of a boundary_tag (4 bytes).

We'll jot those, and some other working config variables down:

/* the number of size classes for blocks */
#define N_SIZE_CLASSES 9
/* Word and header/footer size (4 bytes) */
#define WSIZE sizeof(struct boundary_tag)
/* Minimum block size in words */
#define MIN_BLOCK_SIZE_WORDS 8
/* Extend heap by this amount of words */
#define CHUNKSIZE  (1<<10)

We'll also need some helper methods and definitions:

/* Takes two size_t arguments and returns the larger of the two. */
static inline size_t max(size_t x, size_t y) {
    return x > y ? x : y;
}

/* Takes a size_t argument and returns the next size_t (greater than or equal
 *  to the provided size) that is evenly divisible by 16.
 */
static size_t align(size_t size) {
  return (size + ALIGNMENT - 1) & ~(ALIGNMENT - 1);
}

/* Global variables */

/* boolean representing if the heap has been initialized */
static bool heap_init = false;
/* array of free lists of different sizes */
static struct list free_lists[N_SIZE_CLASSES];
/* size in words of the respective size classes */
static size_t sizes[N_SIZE_CLASSES];

/* Function prototypes for internal helper routines */
static struct f_block* extend_heap(size_t words);
static void place(struct f_block* bp, size_t asize);
static struct f_block* find_fit(size_t asize);
static struct f_block* coalesce(struct f_block* bp);
static int get_size_class(size_t words);

annnnnd some last helper cruft for working with our blocks:

/* Given an f_block, obtain previous block's footer.
 * Works for left-most block also.
 */
static struct boundary_tag* prev_blk_footer(struct f_block* blk) {
    return &blk->header - 1;
}

/* Takes a pointer to a block and returns true if that block is free.
 * Returns false otherwise.
 */
static bool blk_free(void* blk) {
    struct f_block* freed = (struct f_block*) blk;
    return !freed->header.inuse;
}

/* Takes a pointer to a block and returns its size */
static size_t blk_size(void* ptr) {
    struct f_block* blk = (struct f_block *) ptr;
    return blk->header.size;
}

/* Given a block, obtain pointer to previous block in the heap.
 * Not meaningful for left-most block.
 */
static struct f_block* prev_blk(struct f_block* blk) {
    struct boundary_tag* prevfooter = prev_blk_footer(blk);
    assert(prevfooter->size != 0);
    return ((void *)blk - WSIZE * prevfooter->size);
}

/* Given a block, obtain pointer to next block in the heap.
 * Not meaningful for right-most block.
 */
static struct f_block* next_blk(struct f_block* blk) {
    assert(blk_size(blk) != 0);
    return ((void *) blk + WSIZE * blk->header.size);
}

/* Given a block, obtain its footer boundary tag */
static struct boundary_tag* get_footer(void* blk) {
    struct f_block* block = (struct f_block *) blk;
    return ((void *)block + WSIZE * block->header.size)
                   - sizeof(struct boundary_tag);
}

/* Mark a block as used and set its size (words). */
static void mark_block_used(void* blk, int size) {
    struct f_block* freed = (struct f_block*) blk;
    list_remove(&freed->elem);    /* remove the a_block from it's list */

    struct a_block* allocd = (struct a_block* ) blk;
    allocd->header.inuse = true;
    allocd->header.size = size;
    * get_footer(allocd) = allocd->header;   /* copy header to footer */
}

/* Mark a block as free and set its size (words). */
static void mark_block_free(void* blk, int size) {
    struct f_block* freed = (struct f_block*) blk;

    freed->header.inuse = false;
    freed->header.size = size;
    * get_footer(freed) = freed->header; /* copy header to footer */

    int class = get_size_class((size_t) size);

    /* push the extended or coalesced f_block to it's list */
    list_push_front(&free_lists[class], &freed->elem);
}

Finally diving into the implementation of our actual API, we'll begin with m_init() which needs to initialize our size classes, our boundary tag fence, create our initial free block and "heap":

/* Initializes the N_SIZE_CLASSES free lists, then pushes a free block of size
 *  CHUNKSIZE to the heap and adds it to the appropriate free list.
 */
int mm_init(void) {
    assert (offsetof(struct a_block, payload) == 4);
    assert (sizeof(struct boundary_tag) == 4);

    /* Create the initial empty heap */
    struct boundary_tag * initial = mem_sbrk(4 * sizeof(struct boundary_tag));
    if (initial == (void *)-1)
        return -1;

    initial[2] = FENCE;                     /* Prologue footer */
    heap_init = true;
    initial[3] = FENCE;                     /* Epilogue header */

    /* initialize the free_lists arrray and the sizes array */
    for(int i = 0; i < N_SIZE_CLASSES; i++) {
        sizes[i] = (size_t) 1 << (i+4); /* min block size = 2^4 words */
        list_init(&free_lists[i]);
    }
    assert(sizes[0] >= MIN_BLOCK_SIZE_WORDS);

    /* Extend the empty heap with a free block of CHUNKSIZE bytes */
    if ((extend_heap(CHUNKSIZE)) == NULL)
        return -1;
    return 0;
}

For the mm_alloc() method, we need to find a block of at least the requested size. We'll pad the found block with enough space for the requisite boundary tags, and extend the heap as necessary if no such sufficiently-sized blocks are available in the free list:

/* Allocates a block with at least `size` bytes of payload.
 * `size` is padded with enough room to also include header and footer
 *  boundary_tags.
 * First, the free lists are searched for a free block sufficiently large to
 *  hold the padded size request.
 * If none is found, the heap is extended, and in either case, a pointer to the
 *  payload is returned.
 * Returns NULL if the requested `size` is 0 or if the heap cannot be extended
 *  any further.
 */
void* mm_malloc(size_t size) {

    struct f_block* bp;

    if (!heap_init)
        mm_init();

    /* Ignore spurious requests */
    if (size == 0)
        return NULL;

    /* Adjust block size to include overhead and alignment reqs. */
    size += 2 * sizeof(struct boundary_tag);

    /* respect minimum size and work with adjusted block size in words */
    size_t awords = max(MIN_BLOCK_SIZE_WORDS, align(size)/WSIZE);

    /* Search the free list for a fit */
    if ((bp = find_fit(awords)) != NULL) {
        place(bp, awords);

        /* once placed, an f_block becomes an a_block */
        struct a_block* result = (struct a_block *) bp;
        return result->payload;
    }

    /* No fit found. Get more memory and place the block */

    /* Amount to extend heap if no fit */
    size_t extendwords = max(awords,CHUNKSIZE);
    if ((bp = extend_heap(extendwords)) == NULL)
        return NULL;

    place(bp, awords);
    /* once placed, an f_block becomes an a_block */
    struct a_block* result = (struct a_block *) bp;

    return result->payload;
}

Freeing a given block is relatively simple, we just mark it as free and attempt to coalesce it with its neighboring blocks in the benevolent chance that they're also free:

/* Frees the block pointed to by `bp` and adds the resultant free block to the
 *  relevant free list based on its size.
 */
void mm_free(void* bp) {
    assert(heap_init);       // assert that mm_init was called
    if (bp == 0)
        return;

    /* Find block from user pointer */
    struct a_block* blk = bp - offsetof(struct a_block, payload);

    mark_block_free(blk, blk_size(blk));
    coalesce((struct f_block *) blk);
}

Coalescing looks harder than it is, we just check each case where blocks to the left, right, both, or none are free, and join them together:

/*
 * Attempts to combine the given free block `bp` with its adjacent blocks.
 * If a neighboring block is not allocated, it is freed from it's respective
 *  free list and combined with `bp`.
 */
static struct f_block* coalesce(struct f_block* bp)  {
    bool prev_alloc = prev_blk_footer(bp)->inuse;   /* prev_blk allocated? */
    bool next_alloc = ! blk_free(next_blk(bp));     /* next_blk allocated? */
    size_t size = blk_size(bp);

    if (prev_alloc && next_alloc) {            /* Case 1 */
        // both are allocated, nothing to coalesce
        return bp;
    }

    else if (prev_alloc && !next_alloc) {      /* Case 2 */
        // combine this block and next block by extending it
        // remove current and next block from free_lists,
        // call mark_free on current
        list_remove(&bp->elem);
        list_remove(&next_blk(bp)->elem);
        mark_block_free(bp, size + blk_size(next_blk(bp)));
    }

    else if (!prev_alloc && next_alloc) {      /* Case 3 */
        // combine previous and this block by extending previous
        // remove prev and curr block from free_lists, call mark_free on prev
        list_remove(&bp->elem);
        bp = prev_blk(bp);
        list_remove(&bp->elem);
        mark_block_free(bp, size + blk_size(bp));
    }

    else {                                     /* Case 4 */
        // combine previous, this, and next block into one
        // remove prev, curr, and next from free_lists, call mark_free on prev
        list_remove(&prev_blk(bp)->elem);
        list_remove(&bp->elem);
        list_remove(&next_blk(bp)->elem);
        size_t n_size = blk_size(next_blk(bp));
        bp = prev_blk(bp);
        mark_block_free(bp, blk_size(bp) + size + n_size);
    }

    return bp;
}

realloc copies an existing block into a new block of requested size:

/* A naive implementation of realloc which could be improved in terms of
 *  utilization by checking right neighbors for coalesce/re-use opportunities.
 */
void* mm_realloc(void* ptr, size_t size) {
    /* If size == 0 then this is just free, and we return NULL. */
    if (size == 0) {
        mm_free(ptr);
        return 0;
    }

    /* If oldptr is NULL, then this is just malloc. */
    if (ptr == NULL)
        return mm_malloc(size);

    void* newptr = mm_malloc(size);

    /* If realloc() fails the original block is left untouched  */
    if (!newptr)
        return 0;

    /* Copy the old data. */
    struct a_block* oldblock = ptr - offsetof(struct a_block, payload);
    size_t oldsize = blk_size(oldblock) * WSIZE;
    if (size < oldsize) oldsize = size;
    memcpy(newptr, ptr, oldsize);

    /* Free the old block. */
    mm_free(ptr);

    return newptr;
}

And finally, the remaining helper methods interspersed throughout the above implementation:

/* Extends heap with a free block of size `words`, attempts to coalesce the new
 *  block with the left-adjacent free block, then returns a pointer to the
 *  potentially-coalesced block.
 * Returns NULL if mem_sbrk() was not able to extend the heap.
 */
static struct f_block* extend_heap(size_t words) {
    void* bp = mem_sbrk(words * WSIZE);

    if ((intptr_t) bp == -1)
        return NULL;

    /* Initialize free block header/footer and the epilogue header.
     * Note that we overwrite the previous epilogue here. */
    struct f_block* blk = bp - sizeof(FENCE);
    mark_block_free(blk, words);
    next_blk(blk)->header = FENCE;

    return coalesce(blk);
}

/*
 * Places a block of `asize` words at start of free block bp and splits if
 *  the remainder would be at least minimum block size.
 */
static void place(struct f_block* bp, size_t asize) {
    size_t csize = blk_size(bp);

    if ((csize - asize) >= MIN_BLOCK_SIZE_WORDS) {
        mark_block_used(bp, asize);
        bp = next_blk(bp);
        mark_block_free(bp, csize-asize);
    } else {
        mark_block_used(bp, csize);
    }
}

/* Find a fit for a block with asize words.
 * Starts by checking the smallest size class that `asize` might be in, then
 *  checks bigger size classes if there are no free blocks big enough in that
 *  class.
 * Returns NULL if no free blocks can hold that size.
 */
static struct f_block* find_fit(size_t asize) {
    /* Segregated First fit search */
    int class = get_size_class(asize);
    while(class < N_SIZE_CLASSES) {
        for (struct list_elem* e = list_begin(&free_lists[class]); e != list_end(&free_lists[class]); e = list_next(e)) {
            struct f_block* bp = list_entry(e, struct f_block, elem);
            if (asize <= blk_size(bp))
                return bp;
        } // if list_end
        class++;
    } // if end of class list and still can't find a fit, return NULL

    return NULL; /* No fit */
}

/* Given the size of a block in words,
 *  returns the index of the size class it does or should belong to.
 */
static int get_size_class(size_t words) {
    int i = 0;
    while((i < N_SIZE_CLASSES) && (sizes[i] < words))
        i++;

    if (i >= N_SIZE_CLASSES)
        return N_SIZE_CLASSES - 1;

    return i;
}

To actually make use of this API within the confines of a stack rather than the heap-proper, we need some driver sitting between our API and our application. The following snippet carves out a chunk of the heap, but smarter people than I can hack the mmap syscall (or one of its esoteric variants) to target the stack itself:8

/*
 * memlib.c - a module that simulates the memory system.  Needed because it
 *            allows us to interleave calls from the our malloc API
 *            with the system's malloc package in libc.
 */
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>

#include "memlib.h"

/* private variables */
static char *mem_start_brk;  /* points to first byte of heap */
static char *mem_brk;        /* points to last byte of heap */
static char *mem_max_addr;   /* largest legal heap address */
static int use_mmap;         /* Use mmap instead of malloc */
static void * mmap_addr = (void *)0x58000000;

/*
 * mem_init - initialize the memory system model
 */
void mem_init(int _use_mmap) {
    use_mmap = _use_mmap;

    /* allocate the storage we will use to model the available VM */
    if (use_mmap) {
        mem_start_brk = (char *)mmap(mmap_addr, MAX_HEAP, PROT_READ|PROT_WRITE,
                                 MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
        if (mem_start_brk == MAP_FAILED) {
            perror("mem_init_vm: mmap error:");
            exit(1);
        }
        if (mem_start_brk != mmap_addr) {
            perror("mmap");
            fprintf(stderr,
                "mem_init_vm: could not obtain memory at address %p\n",
                mmap_addr);
            exit(1);
        }
    } else {
        if ((mem_start_brk = (char *)malloc(MAX_HEAP)) == NULL) {
            fprintf(stderr, "mem_init_vm: malloc error\n");
            exit(1);
        }
    }

    mem_max_addr = mem_start_brk + MAX_HEAP;  /* max legal heap address */
    mem_brk = mem_start_brk;                  /* heap is empty initially */
}

/*
 * mem_deinit - free the storage used by the memory system model
 */
void mem_deinit(void) {
    if (use_mmap) {
        if (munmap(mem_start_brk, MAX_HEAP))
            perror("munmap");
    } else {
        free(mem_start_brk);
    }
}

/*
 * mem_reset_brk - reset the simulated brk pointer to make an empty heap
 */
void mem_reset_brk() { mem_brk = mem_start_brk; }

/*
 * mem_sbrk - simple model of the sbrk function. Extends the heap
 *    by incr bytes and returns the start address of the new area. In
 *    this model, the heap cannot be shrunk.
 */
void *mem_sbrk(int incr) {
    char *old_brk = mem_brk;

    if ( (incr < 0) || ((mem_brk + incr) > mem_max_addr)) {
	errno = ENOMEM;
	fprintf(stderr, "ERROR: mem_sbrk(%d) failed. Ran out of memory...\n", incr);
	return NULL;
    }
    mem_brk += incr;
    return (void *)old_brk;
}

/*
 * mem_heap_lo - return address of the first heap byte
 */
void *mem_heap_lo() { return (void *)mem_start_brk; }

/*
 * mem_heap_hi - return address of last heap byte
 */
void *mem_heap_hi() { return (void *)(mem_brk - 1); }

/*
 * mem_heapsize() - returns the heap size in bytes
 */
size_t mem_heapsize() { return (size_t)(mem_brk - mem_start_brk); }

/*
 * mem_pagesize() - returns the page size of the system
 */
size_t mem_pagesize() { return (size_t)getpagesize(); }

Additional Reading

Appendix

list.h
#ifndef __LIST_H
#define __LIST_H
/* This code is taken from the Pintos education OS.
 * For copyright information, see www.pintos-os.org */

/* Doubly linked list.

   This implementation of a doubly linked list does not require
   use of dynamically allocated memory.  Instead, each structure
   that is a potential list element must embed a struct list_elem
   member.  All of the list functions operate on these `struct
   list_elem's.  The list_entry macro allows conversion from a
   struct list_elem back to a structure object that contains it.

   For example, suppose there is a needed for a list of `struct
   foo'.  `struct foo' should contain a `struct list_elem'
   member, like so:

      struct foo
        {
          struct list_elem elem;
          int bar;
          ...other members...
        };

   Then a list of `struct foo' can be be declared and initialized
   like so:

      struct list foo_list;

      list_init (&foo_list);

   Iteration is a typical situation where it is necessary to
   convert from a struct list_elem back to its enclosing
   structure.  Here's an example using foo_list:

      struct list_elem *e;

      for (e = list_begin (&foo_list); e != list_end (&foo_list);
           e = list_next (e))
        {
          struct foo *f = list_entry (e, struct foo, elem);
          ...do something with f...
        }

   You can find real examples of list usage throughout the
   source; for example, malloc.c, palloc.c, and thread.c in the
   threads directory all use lists.

   The interface for this list is inspired by the list<> template
   in the C++ STL.  If you're familiar with list<>, you should
   find this easy to use.  However, it should be emphasized that
   these lists do *no* type checking and can't do much other
   correctness checking.  If you screw up, it will bite you.

   Glossary of list terms:

     - "front": The first element in a list.  Undefined in an
       empty list.  Returned by list_front().

     - "back": The last element in a list.  Undefined in an empty
       list.  Returned by list_back().

     - "tail": The element figuratively just after the last
       element of a list.  Well defined even in an empty list.
       Returned by list_end().  Used as the end sentinel for an
       iteration from front to back.

     - "beginning": In a non-empty list, the front.  In an empty
       list, the tail.  Returned by list_begin().  Used as the
       starting point for an iteration from front to back.

     - "head": The element figuratively just before the first
       element of a list.  Well defined even in an empty list.
       Returned by list_rend().  Used as the end sentinel for an
       iteration from back to front.

     - "reverse beginning": In a non-empty list, the back.  In an
       empty list, the head.  Returned by list_rbegin().  Used as
       the starting point for an iteration from back to front.

     - "interior element": An element that is not the head or
       tail, that is, a real list element.  An empty list does
       not have any interior elements.
*/

#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>

/* List element. */
struct list_elem {
    struct list_elem *prev;     /* Previous list element. */
    struct list_elem *next;     /* Next list element. */
  };

/* List. */
struct list {
    struct list_elem head;      /* List head. */
    struct list_elem tail;      /* List tail. */
  };

/* Converts pointer to list element LIST_ELEM into a pointer to
   the structure that LIST_ELEM is embedded inside.  Supply the
   name of the outer structure STRUCT and the member name MEMBER
   of the list element.  See the big comment at the top of the
   file for an example. */
#define list_entry(LIST_ELEM, STRUCT, MEMBER)           \
        ((STRUCT *) ((uint8_t *) &(LIST_ELEM)->next     \
                     - offsetof (STRUCT, MEMBER.next)))

void list_init (struct list *);

/* List traversal. */
struct list_elem *list_begin (struct list *);
struct list_elem *list_next (struct list_elem *);
struct list_elem *list_end (struct list *);

struct list_elem *list_rbegin (struct list *);
struct list_elem *list_prev (struct list_elem *);
struct list_elem *list_rend (struct list *);

struct list_elem *list_head (struct list *);
struct list_elem *list_tail (struct list *);

/* List insertion. */
void list_insert (struct list_elem *, struct list_elem *);
void list_splice (struct list_elem *before,
                  struct list_elem *first, struct list_elem *last);
void list_push_front (struct list *, struct list_elem *);
void list_push_back (struct list *, struct list_elem *);

/* List removal. */
struct list_elem *list_remove (struct list_elem *);
struct list_elem *list_pop_front (struct list *);
struct list_elem *list_pop_back (struct list *);

/* List elements. */
struct list_elem *list_front (struct list *);
struct list_elem *list_back (struct list *);

/* List properties. */
size_t list_size (struct list *);
bool list_empty (struct list *);

/* Miscellaneous. */
void list_reverse (struct list *);

/* Compares the value of two list elements A and B, given
   auxiliary data AUX.  Returns true if A is less than B, or
   false if A is greater than or equal to B. */
typedef bool list_less_func (const struct list_elem *a,
                             const struct list_elem *b,
                             void *aux);

/* Operations on lists with ordered elements. */
void list_sort (struct list *,
                list_less_func *, void *aux);
void list_insert_ordered (struct list *, struct list_elem *,
                          list_less_func *, void *aux);
void list_unique (struct list *, struct list *duplicates,
                  list_less_func *, void *aux);

/* Max and min. */
struct list_elem *list_max (struct list *, list_less_func *, void *aux);
struct list_elem *list_min (struct list *, list_less_func *, void *aux);

#endif /* list.h */
list.c
#include "list.h"
#include <assert.h>

/* Our doubly linked lists have two header elements: the "head"
   just before the first element and the "tail" just after the
   last element.  The `prev' link of the front header is null, as
   is the `next' link of the back header.  Their other two links
   point toward each other via the interior elements of the list.

   An empty list looks like this:

                      +------+     +------+
                  <---| head |<--->| tail |--->
                      +------+     +------+

   A list with two elements in it looks like this:

        +------+     +-------+     +-------+     +------+
    <---| head |<--->|   1   |<--->|   2   |<--->| tail |<--->
        +------+     +-------+     +-------+     +------+

   The symmetry of this arrangement eliminates lots of special
   cases in list processing.  For example, take a look at
   list_remove(): it takes only two pointer assignments and no
   conditionals.  That's a lot simpler than the code would be
   without header elements.

   (Because only one of the pointers in each header element is used,
   we could in fact combine them into a single header element
   without sacrificing this simplicity.  But using two separate
   elements allows us to do a little bit of checking on some
   operations, which can be valuable.) */

static bool is_sorted (struct list_elem *a, struct list_elem *b,
                       list_less_func *less, void *aux);

/* Returns true if ELEM is a head, false otherwise. */
static inline bool is_head (struct list_elem *elem) {
  return elem != NULL && elem->prev == NULL && elem->next != NULL;
}

/* Returns true if ELEM is an interior element,
   false otherwise. */
static inline bool is_interior (struct list_elem *elem) {
  return elem != NULL && elem->prev != NULL && elem->next != NULL;
}

/* Returns true if ELEM is a tail, false otherwise. */
static inline bool is_tail (struct list_elem *elem) {
  return elem != NULL && elem->prev != NULL && elem->next == NULL;
}

/* Initializes LIST as an empty list. */
void list_init (struct list *list) {
  assert (list != NULL);
  list->head.prev = NULL;
  list->head.next = &list->tail;
  list->tail.prev = &list->head;
  list->tail.next = NULL;
}

/* Returns the beginning of LIST.  */
struct list_elem* list_begin(struct list *list) {
  assert (list != NULL);
  return list->head.next;
}

/* Returns the element after ELEM in its list.  If ELEM is the
   last element in its list, returns the list tail.  Results are
   undefined if ELEM is itself a list tail. */
struct list_elem* list_next(struct list_elem *elem) {
  assert (is_head (elem) || is_interior (elem));
  return elem->next;
}

/* Returns LIST's tail.

   list_end() is often used in iterating through a list from
   front to back.  See the big comment at the top of list.h for
   an example. */
struct list_elem *
list_end (struct list *list)
{
  assert (list != NULL);
  return &list->tail;
}

/* Returns the LIST's reverse beginning, for iterating through
   LIST in reverse order, from back to front. */
struct list_elem* list_rbegin (struct list *list) {
  assert (list != NULL);
  return list->tail.prev;
}

/* Returns the element before ELEM in its list.  If ELEM is the
   first element in its list, returns the list head.  Results are
   undefined if ELEM is itself a list head. */
struct list_elem* list_prev (struct list_elem *elem) {
  assert (is_interior (elem) || is_tail (elem));
  return elem->prev;
}

/* Returns LIST's head.

   list_rend() is often used in iterating through a list in
   reverse order, from back to front.  Here's typical usage,
   following the example from the top of list.h:

      for (e = list_rbegin (&foo_list); e != list_rend (&foo_list);
           e = list_prev (e))
        {
          struct foo *f = list_entry (e, struct foo, elem);
          ...do something with f...
        }
*/
struct list_elem* list_rend (struct list *list) {
  assert (list != NULL);
  return &list->head;
}

/* Return's LIST's head.

   list_head() can be used for an alternate style of iterating
   through a list, e.g.:

      e = list_head (&list);
      while ((e = list_next (e)) != list_end (&list))
        {
          ...
        }
*/
struct list_elem* list_head (struct list *list) {
  assert (list != NULL);
  return &list->head;
}

/* Return's LIST's tail. */
struct list_elem* list_tail (struct list *list) {
  assert (list != NULL);
  return &list->tail;
}

/* Inserts ELEM just before BEFORE, which may be either an
   interior element or a tail.  The latter case is equivalent to
   list_push_back(). */
void list_insert (struct list_elem *before, struct list_elem *elem) {
  assert (is_interior (before) || is_tail (before));
  assert (elem != NULL);

  elem->prev = before->prev;
  elem->next = before;
  before->prev->next = elem;
  before->prev = elem;
}

/* Removes elements FIRST though LAST (exclusive) from their
   current list, then inserts them just before BEFORE, which may
   be either an interior element or a tail. */
void list_splice (struct list_elem *before,
             struct list_elem *first, struct list_elem *last) {
  assert (is_interior (before) || is_tail (before));
  if (first == last)
    return;
  last = list_prev (last);

  assert (is_interior (first));
  assert (is_interior (last));

  /* Cleanly remove FIRST...LAST from its current list. */
  first->prev->next = last->next;
  last->next->prev = first->prev;

  /* Splice FIRST...LAST into new list. */
  first->prev = before->prev;
  last->next = before;
  before->prev->next = first;
  before->prev = last;
}

/* Inserts ELEM at the beginning of LIST, so that it becomes the
   front in LIST. */
void list_push_front(struct list* list, struct list_elem* elem) {
  list_insert (list_begin (list), elem);
}

/* Inserts ELEM at the end of LIST, so that it becomes the
   back in LIST. */
void list_push_back(struct list* list, struct list_elem* elem) {
  list_insert (list_end (list), elem);
}

/* Removes ELEM from its list and returns the element that
   followed it.  Undefined behavior if ELEM is not in a list.

   It's not safe to treat ELEM as an element in a list after
   removing it.  In particular, using list_next() or list_prev()
   on ELEM after removal yields undefined behavior.  This means
   that a naive loop to remove the elements in a list will fail:

   ** DON'T DO THIS **
   for (e = list_begin (&list); e != list_end (&list); e = list_next (e))
     {
       ...do something with e...
       list_remove (e);
     }
   ** DON'T DO THIS **

   Here is one correct way to iterate and remove elements from a
   list:

   for (e = list_begin (&list); e != list_end (&list); e = list_remove (e))
     {
       ...do something with e...
     }

   If you need to free() elements of the list then you need to be
   more conservative.  Here's an alternate strategy that works
   even in that case:

   while (!list_empty (&list))
     {
       struct list_elem *e = list_pop_front (&list);
       ...do something with e...
     }
*/
struct list_elem* list_remove(struct list_elem* elem) {
  assert (is_interior (elem));
  elem->prev->next = elem->next;
  elem->next->prev = elem->prev;
  return elem->next;
}

/* Removes the front element from LIST and returns it.
   Undefined behavior if LIST is empty before removal. */
struct list_elem* list_pop_front(struct list* list) {
  struct list_elem *front = list_front (list);
  list_remove (front);
  return front;
}

/* Removes the back element from LIST and returns it.
   Undefined behavior if LIST is empty before removal. */
struct list_elem* list_pop_back(struct list* list) {
  struct list_elem *back = list_back (list);
  list_remove (back);
  return back;
}

/* Returns the front element in LIST.
   Undefined behavior if LIST is empty. */
struct list_elem* list_front(struct list* list) {
  assert (!list_empty (list));
  return list->head.next;
}

/* Returns the back element in LIST.
   Undefined behavior if LIST is empty. */
struct list_elem* list_back(struct list* list) {
  assert (!list_empty (list));
  return list->tail.prev;
}

/* Returns the number of elements in LIST.
   Runs in O(n) in the number of elements. */
size_t list_size (struct list* list) {
  struct list_elem *e;
  size_t cnt = 0;

  for (e = list_begin (list); e != list_end (list); e = list_next (e))
    cnt++;
  return cnt;
}

/* Returns true if LIST is empty, false otherwise. */
bool list_empty(struct list* list) {
  return list_begin (list) == list_end (list);
}

/* Swaps the `struct list_elem *'s that A and B point to. */
static void swap(struct list_elem** a, struct list_elem** b) {
  struct list_elem *t = *a;
  *a = *b;
  *b = t;
}

/* Reverses the order of LIST. */
void list_reverse (struct list* list) {
  if (!list_empty (list)) {
      struct list_elem *e;

      for (e = list_begin (list); e != list_end (list); e = e->prev)
        swap (&e->prev, &e->next);
      swap (&list->head.next, &list->tail.prev);
      swap (&list->head.next->prev, &list->tail.prev->next);
    }
}

static bool is_sorted(struct list_elem* a, struct list_elem* b,
           list_less_func* less, void* aux) __attribute__((__unused__));

/* Returns true only if the list elements A through B (exclusive)
   are in order according to LESS given auxiliary data AUX. */
static bool is_sorted(struct list_elem* a, struct list_elem* b,
           list_less_func *less, void *aux) {
  if (a != b)
    while ((a = list_next (a)) != b)
      if (less (a, list_prev (a), aux))
        return false;
  return true;
}

/* Finds a run, starting at A and ending not after B, of list
   elements that are in nondecreasing order according to LESS
   given auxiliary data AUX.  Returns the (exclusive) end of the
   run.
   A through B (exclusive) must form a non-empty range. */
static struct list_elem* find_end_of_run(struct list_elem* a, struct list_elem* b, list_less_func* less, void* aux) {
  assert (a != NULL);
  assert (b != NULL);
  assert (less != NULL);
  assert (a != b);

  do
    {
      a = list_next (a);
    }
  while (a != b && !less (a, list_prev (a), aux));
  return a;
}

/* Merges A0 through A1B0 (exclusive) with A1B0 through B1
   (exclusive) to form a combined range also ending at B1
   (exclusive).  Both input ranges must be nonempty and sorted in
   nondecreasing order according to LESS given auxiliary data
   AUX.  The output range will be sorted the same way. */
static void inplace_merge (struct list_elem* a0, struct list_elem* a1b0,
               struct list_elem* b1,
               list_less_func* less, void* aux)
{
  assert (a0 != NULL);
  assert (a1b0 != NULL);
  assert (b1 != NULL);
  assert (less != NULL);
  assert (is_sorted (a0, a1b0, less, aux));
  assert (is_sorted (a1b0, b1, less, aux));

  while (a0 != a1b0 && a1b0 != b1)
    if (!less (a1b0, a0, aux))
      a0 = list_next (a0);
    else {
        a1b0 = list_next (a1b0);
        list_splice (a0, list_prev (a1b0), a1b0);
      }
}

/* Sorts LIST according to LESS given auxiliary data AUX, using a
   natural iterative merge sort that runs in O(n lg n) time and
   O(1) space in the number of elements in LIST. */
void list_sort(struct list* list, list_less_func* less, void* aux)
{
  size_t output_run_cnt;        /* Number of runs output in current pass. */

  assert (list != NULL);
  assert (less != NULL);

  /* Pass over the list repeatedly, merging adjacent runs of
     nondecreasing elements, until only one run is left. */
  do {
      struct list_elem *a0;     /* Start of first run. */
      struct list_elem *a1b0;   /* End of first run, start of second. */
      struct list_elem *b1;     /* End of second run. */

      output_run_cnt = 0;
      for (a0 = list_begin (list); a0 != list_end (list); a0 = b1)
        {
          /* Each iteration produces one output run. */
          output_run_cnt++;

          /* Locate two adjacent runs of nondecreasing elements
             A0...A1B0 and A1B0...B1. */
          a1b0 = find_end_of_run (a0, list_end (list), less, aux);
          if (a1b0 == list_end (list))
            break;
          b1 = find_end_of_run (a1b0, list_end (list), less, aux);

          /* Merge the runs. */
          inplace_merge (a0, a1b0, b1, less, aux);
        }
    }
  while (output_run_cnt > 1);

  assert (is_sorted (list_begin (list), list_end (list), less, aux));
}

/* Inserts ELEM in the proper position in LIST, which must be
   sorted according to LESS given auxiliary data AUX.
   Runs in O(n) average case in the number of elements in LIST. */
void
list_insert_ordered (struct list *list, struct list_elem *elem,
                     list_less_func *less, void *aux)
{
  struct list_elem *e;

  assert (list != NULL);
  assert (elem != NULL);
  assert (less != NULL);

  for (e = list_begin (list); e != list_end (list); e = list_next (e))
    if (less (elem, e, aux))
      break;
  return list_insert (e, elem);
}

/* Iterates through LIST and removes all but the first in each
   set of adjacent elements that are equal according to LESS
   given auxiliary data AUX.  If DUPLICATES is non-null, then the
   elements from LIST are appended to DUPLICATES. */
void
list_unique (struct list *list, struct list *duplicates,
             list_less_func *less, void *aux)
{
  struct list_elem *elem, *next;

  assert (list != NULL);
  assert (less != NULL);
  if (list_empty (list))
    return;

  elem = list_begin (list);
  while ((next = list_next (elem)) != list_end (list))
    if (!less (elem, next, aux) && !less (next, elem, aux))
      {
        list_remove (next);
        if (duplicates != NULL)
          list_push_back (duplicates, next);
      }
    else
      elem = next;
}

/* Returns the element in LIST with the largest value according
   to LESS given auxiliary data AUX.  If there is more than one
   maximum, returns the one that appears earlier in the list.  If
   the list is empty, returns its tail. */
struct list_elem *
list_max (struct list *list, list_less_func *less, void *aux)
{
  struct list_elem *max = list_begin (list);
  if (max != list_end (list))
    {
      struct list_elem *e;

      for (e = list_next (max); e != list_end (list); e = list_next (e))
        if (less (max, e, aux))
          max = e;
    }
  return max;
}

/* Returns the element in LIST with the smallest value according
   to LESS given auxiliary data AUX.  If there is more than one
   minimum, returns the one that appears earlier in the list.  If
   the list is empty, returns its tail. */
struct list_elem *
list_min (struct list *list, list_less_func *less, void *aux)
{
  struct list_elem *min = list_begin (list);
  if (min != list_end (list))
    {
      struct list_elem *e;

      for (e = list_next (min); e != list_end (list); e = list_next (e))
        if (less (e, min, aux))
          min = e;
    }
  return min;
}

Footnotes

Footnotes

  1. Adapted from Fig 1.11 from Silberschatz, A., Galvin P.B. & Gagne, G (2013), Operating Systems Concepts.

  2. Value speculation

  3. Adding Nested Loops Makes this Algorithm 120x FASTER? Sick video which demonstrates how memory locality is one of the most effective optimization techniques.

  4. https://stackoverflow.com/a/1826072

  5. The Apollo Guidance Computer

  6. https://twitter.com/i2talics/status/1784049202739200219

  7. This implementation assumes the availability of a linked list API provided by list.h which can be found in the appendix

  8. https://stackoverflow.com/a/69677311