CPU Schedulers
- Published on
- ∘ 33 min read ∘ ––– views
Previous Article
Next Article
Covenants, without the sword, are but words and of no strength to secure a man at all.
– Thomas Hobbes1
Continuing in the series on systems architecture, this post aims to explain what CPU scheduling is, how it works, and why we care.
Scheduling
Scheduling refers to the technique(s) employed by an operating system to determine which process gets to use the CPU at any given time based on criteria including:
- CPU utilization,
- throughput,
- turnaround time,
- waiting time,
- response time.
There are a number of scheduling algorithms we might devise to allocate CPU resources to a collection of processes that need to be executed, the simplest being a naïve First Come First Served approach wherein compute is granted to the first process that requests the CPU.
CPU Allocation
Before specifying how this might work in practice, we'll first define what it means to allocate the CPU to a process.
When a program is executed, it becomes a process at which point the OS reserves a memory block for it.
Within this block of memory, there's a segment called the TEXT section which contains the executable code of the program.
Recall that the CPU tracks a special address register called the Program Counter which contains the address of the next instruction to be executed. When we say that a CPU is allocated to a specific process, we mean that the PC register is pointing to an address within that process's TEXT section. When the CPU is reallocated to another process, the OS updates the PC to point to the executable code in the TEXT section of a different process. Allocation and reallocation is a comparatively costly operation which requires a context switch2
Syscalls can be expensive because they require us to:
- capture the state of our process known as the execution context which is represented in CPU registers,
- store them in memory,
- yield to the OS to perform the syscall,
- store the syscall's result in memory,
- restore the execution context of our process,
- and finally retrieve the result over the syscall.
Even in this oversimplification, we see that several memory accesses are required for something as simple as e.g. I/O.
Context switching like this, between threads or processes, or even just large and fragmented swaths of code is resource expensive & slow.
It's also worth noting that a process is wholly modeled by the OS as a record called a Process Control Block which contains all the information about the process's state, PC, registers,, and other management data which proves useful for scheduling and necessary for context switching:
pub struct ProcessControlBlock {
pid: u16,
state: ProcessState,
program_counter: u32,
registers: [u8; 4],
insruction: u8,
flags: [u1: 3],
stack_pointer: u32,
index_registers: [u32; 2],
memory_limits: [u32, 2],
io_devices: Vec<IO_Device>,
open_files: Vec<File>,
parent: Arc<ProcessControlBlock>,
children: Vec<Arc<ProcessControlBlock>>
}
First Come First Served
Returning to consideration of the simplest scheduling algorithm, we can implement this as a queue containing the PCBs representing the processes to be allocated. When a process requests the CPU, the OS pushes its PCB to the tail of the queue, and whenever the CPU is free, it is then immediately allocated to whatever processes's PCB is at the head of the queue.
I'm glossing over a lot of assumptions about CPU/OS behavior here, namely that the CPU is only free to be reallocated after a process terminates – which is rarely the case since many processes never terminate, e.g. servers or background tasks like the OS itself.3 If we were to truly allocate the CPU to a process that never terminates, we'd block all other processes in the queue indefinitely. Additionally, relying on termination of a process before reallocating the CPU ignores the fact that not all processes are equally important.
Bursting
Programs in execution do not use the CPU continuously. If a terminating task takes two minutes to complete execution, that does not imply that it kept the CPU busy for a full 120,000ms.
During the execution of any arbitrary, non-trivial process, there will be periods of CPU idle time when the process will be awaiting something else that is independent of the CPU.
For example, consider this program which buffers text from one file to another in eight byte chunks so as not to consume a ton of memory:
fn main() -> Result<()> {
let mut source = File::open("src.txt")?;
let mut destination = File::open("out.txt")?;
let mut buf = [0u8; 64];
loop {
let bytes_read = source.read(&mut buf)?;
if bytes_read == 0 {
break;
}
destination.write_all(&buffer[..bytes_read])?;
}
Ok(())
}
The assembly corresponding to the loop which is doing the copying would look something like this:
copy_loop:
; read from src. text
mov rax, 0 ; sys_read
mov rdi, [fd_source] ; file descriptor of src.txt
lea rsi, [buf] ; address of buffer
mov rdx, 64 ; number of bits to read
syscall ; perform sys_read
test rax, rax ; check if EOF or error
js handle_error ; if rax < 0, handle error
jz exit_loop ; if rax == 0, EOF
mov [bytes_read], rax ; store number of bytes read
; write the read_bytes to out.txt
mov rax, 1 ; sys_write
mov rdi, [fd_destination] ; file descriptor of out.txt
lea rsi, [buf] ; address of buffer
mov rdx, [bytes_read] ; number of bytes to write
syscall ; perform sys_write
test rax, rax ; check for errors
js handle_error ; if rax < 0, handle error
jmp copy_loop ; repeat loop
exit_loop:
If we were to profile this machine code, we'd note that –despite the appearance of sequential/synchronous execution that loops at the jmp
call back to the top of the labeled section– the sys_read
and sys_write
system calls take far more clock cycles than the innocuous instructions targeting local registers. This is because reading from and writing to files requires the program to invoke the OS, since file manipulation oftentimes requires hardware manipulation to retrieve the IO targets from disk. And IO is slow as hell. therefore, we cannot execute the subsequent test
instructions immediately after the syscall
s they follow since they are dependent on hardware which is external to the CPU itself.
As a result, the program does not use the CPU continuously, and instead is peppered with numerous windows of idle time during which the CPU could be completing erstwhile segments of other processes. Context switching processes during this idle time is not a premature optimization either since, like Bohr's electron model, this negative space is not the exception, but instead the norm for IO-bound applications and the systems governing those processes which account for like 95% of general programs and 100% of the programs relevant to my claim.
shaded regions where a process is actually using the CPU are called CPU bursts and the gaps between are called IO bursts. We can make some relevant, general observations about execution behavior which will inform our schedule design.
- Every process begins and ends with a CPU burst,
- CPU bursts are always followed by an IO burst and vice versa, with the final CPU burst which terminates a program.
While the duration of CPU bursts is process-dependent, the general distribution resembles the following histogram with a much higher density surrounding short bursts than long bursts:
To account for all this down time, our ideal scheduler should take advantage of IO burst windows and context switch to another process:
It's worth noting that, implicitly between each burst we incur a small-though-non-zero context switch penalty which is necessary in order to give the illusion of concurrency for even a single-core CPU:
Scheduling Criteria
At this point, it makes sense to revisit and define our scheduling criteria.
- CPU utilization: a measure of the % of time the CPU is actively executing processes as oppose to idling
- Throughput: processes completed per time unit,
- Turnaround time = completion - arrival
- Waiting time = turnaround - burst time
- Response time = time of execution - arrival time
We'll work our way through each criterion, tweaking our schedulers design along the way. First, an intelligent scheduler should (all else equal) allocate a process only for the duration of a single CPU burst of that process, rather than till termination (which may never even occur).
Lifecycle of a Process
Achieving burst-granular scheduling adds a layer of complexity since we now must account for the fact that processes can be in different state.
Every process begins in the new state which describes the modal existence of a launched-but-not-yet-executing process since its executable code is still being loaded into memory.
Once loaded and sufficiently resourced, a process enters the ready state meaning its ready for execution if/when the CPU is allocated to it.
When a CPU is allocated to the process, it is in the running state.
At some point, the process will likely make a system call (such as an IO request) at which point it will/should voluntarily yield the CPU to the OS so the scheduler can reallocate the CPU to a process in the waiting state.
Bear in mind as well that IO is not limited to thing like filesystem access or memory allocation. It can also include simply waiting for some external event such as user input or a network response (which we know is just a fancy filesystem access). In any event, the process should enter the waiting state until the IO request is completed such that it may then re-enter the ready state to be scheduled again.
Note that a process rarely/never immediately transitions from waiting straight to running since, in the limit, there are always other processes competing for CPU time, and so the gatekeeping scheduling will always considered the queue(s) of ready processes before the one that just had the CPU. All together, the lifecycle can be diagrammed like so:
Most application processes spend the majority of their lifecycles in the event loop and are designed to run indefinitely (barring SIGTERM from the user). General purpose operating systems oftentimes model these broad categories with more granularity granularity (e.g. more stages for new/terminating states to indicate a process initializing/initialized, and terminating/terminated). Note also that it is common to represent all but the running state as additional queues:
Dispatcher
The dispatcher is responsible for allocating a CPU to the process at the head of the ready queue whenever able. It's the element of OS which actually handles context switches in response to IO bursts or CPU interruptions, managing the record of the running process such that its CPU state can be captured in the relevant PCB before relegating it to the waiting queue. It's also responsible for restoring the CPU state from a queue'd PCB when it is allocated.
Together, the scheduler manages ingress to the ready queue, and the dispatcher manages running processes and waiting queue(s). Accordingly, the dispatcher adds some constant overhead to CPU bursts when performing context switches which, since they occur in proportion to our system's throughput, are not negligible. We denote this penalty as dispatch latency, defined as the time it takes to start/stop a running process.
Since the dispatcher is invoked for every context switch, it's highly optimized at both the hardware and software layers.
FCFS
Under a naïve First Come First Served policy, the scheduler employs a single FIFO queue for the ready state swimlane. As soon as a process makes an IO request, a multi-way pop & shuffle is executed by the dispatcher, moving the running process to the waiting queue, and popping the head of the ready queue into the CPU:
While this FIFO system does take CPU utilization into account, it's still not the most optimal scheduling technique we can devise. To understand why let's revisit the CPU burst density graph:
Observing the tail behavior for longer CPU bursts, we can note that, while they're less frequent, they still occur with non-zero probability/frequency.4 Processes characterized by a high number of sort bursts are said to be IO bound since their execution duration is primarily determine by the speed of the IO subsystems they depend on. In other words, their overall performance would improve proportionally to improvements to subsystem optimizations/improvements.
CPU bound processes, on the other hand, are those that are characterized by long CPU bursts. Unlike IO bound processes, these would realize comparatively less overall performance gains if IO subsystems were improved.
The existence of CPU bound processes causes an adverse effect on our current FCFS design. Consider the not-too-contrived example comprised of even just a single CPU bound process in a ready queue with several IO bound processes.
Initially, the CPU might be executing a mundane IO bound process.
Eventually, that process will require some IO response, and the OS will move it to the waiting queue and allocate the CPU to the next process in the ready queue
That process, too, will eventually be moved to the waiting queue and the CPU bound process will begin execution:
This process may now hog the CPU for an indeterminate amount of time, despite the other processes IO requests to complete and for the ready queue to begin to get backed up:
Thus, for any system with a mixed distribution of IO- and CPU-bound processes, at some point the CPU bound process will acquire the CPU, and all the IO bound processes will await (and possibly complete) their IO requests, so the ready queue will be stuck waiting for the CPU, leaving IO devices idle. Once the CPU burst of the CPU bound process is completed, it will be moved to the (possibly idling) waiting queue, all IO bound processes will accumulate behind it, leaving the CPI idle as well until the CPU bound processes IO request is completed at which point we repeat from the top.
It is both possible and likely that either the CPU or the IO devices will be severely underutilized when subjected to the presence of even a single CPU bound process absent any safeguards against this convoy effect.
Turns out that this is usually slower than if we just let IO bound processes "cut the line" and go ahead of CPU bound processes. This observation gives rise to the next scheduling technique:
Earliest Finish Time First
All else equal, this is actually the most optimal algorithm in terms of average wait time for a set of jobs (see Appendix)
The long and short is that schedule priority is given to the process with the shortest next CPU burst (ties can be broken arbitrarily I do not care).
This can be implemented as a priority queue where the priority is proportionate to the duration of the next CPU burst s.t. a lower priority value corresponds to a "higher" connotational priority. This allows for brief IO bound processes to overtake CPU bound processes in the ready queue to avoid aforementioned traffic jams. However, this approach conversely penalizes CPU bound processes which (sans adjustment of our design) may never be allowed so long as an IO bound process is in a competing queue.
N.B. that EFT is optimal for a fixed collection of processes which does not change after the optimal schedule is determined. This means that we can rely on it to make the correct scheduling time slice at each discrete time step, but taken together in the presence of a constantly changing ready queue, vanilla EFT is insufficient.
Thus we're faced with two issues. the first being that it's not .... possible?? to know the duration of a processes next CPU burst in order to even determine its finish time with certainty. The second is the issue of starvation of CPU bound processes.
Issue the First: Predicting the Future
It is impossible to know with certainty the duration of the next CPU burst. Even analyzing the machine code following the PC to determine when the next system call will occur is unreliable (isomorphic to the halting problem) since duration/amount of instruction primitives preceding the boundary calls could depend on external input like user input.
The best we can do is attempt to predict burst length (and prioritize accordingly) based on past CPU bursts from that process, assuming that the next burst will likely be similar in profile to the exponentially weighted average:
where:
- is the next burst we're predicting,
- is the actual duration of the most recent burst,
- is the corresponding predicted duration of the most recent burst, and
- is a discount hyperparameter.
Since is recursively defined, we can expand all the way back to (usually initialized to some constant like 100ms), thus capturing the entire history of the process with a single term:
Here, determines how much weight to assign to historical predictions and observations versus the most recently observed value: ignores the most recent burst and only considers the history of the process, ignores the historical bursts and fully assumes that the next burst will be identical to the most recent burst. A value like gives equal weight to both, with older bursts being increasingly discounted by the exponential term.
Issue the Second: Finish your compute, there are processes starving in the L6 queue
As mentioned previously, EFT is optimal only for fixed sets of processes to be scheduled, but this is never the case in practice, begging the question: what should an ideal scheduler do if a CPU bound process has been allocated, and then an IO bound process with a predicted next burst shorter than the remainder of the predicted next burst of the CPU bound process enters the ready queue?
Should the CPU bound process be interrupted to yield to the IO bound process with an earlier finish time? For most modern, general purpose schedulers, the answer is yes. This is namely due to the non-zero likelihood of infinitely long CPU bursts. Without preemption, all other processes would freeze from the user's perspective, ruining the illusion of concurrency which is achievable via preemption on even a single core'd CPU.
Additionally, absent preemption, malware could easily monopolize the CPU if the OS relied on processes to voluntarily yield the CPU. Hence we need some sort of Hobbesian sword in order to enforce cooperation and prevent process starvation.
Round Robin
In order to account for this criterion of minimizing starvation, we can use a preemptive FIFO queue with the addition of a CPU hardware-level "shot clock" which will uniformly preempt any long-running process whose time on the CPU exceeds some threshold (say 100ms), and force a context switch to the next available process in the ready queue, placing the evicted process at the tail of the ready queue:
This adds a mechanism of fairly distributing CPU time and solves the issue of starvation. For a ready queue with processes, and a shot clock threshold of , each process will receive of the CPU time in chunks of at most time units, meaning that no process will wait more than time units before it gets to drive the boat. Under this system, wait time optimality is dictated by our threshold . If the scheduler chooses a too large, our system reduces to FCFS as we'll never utilize the shot clock. Conversely, a stringently small can lead to too many unnecessary context switches which result in pure overhead penalty since, despite consuming clock cycles, they produce no valuable compute. We can measure the optimality of in terms of turnaround time which is the time a process takes from start to finish, including all the time spent in the ready queue, the time spent running, and the time spent awaiting IO completion.
Relatedly, throughput of a scheduler is defined in terms of the processes completed (or, more specifically, CPU bursts completed) per time unit. It's important to account for throughput as well when measuring the overall efficacy of our schedule since, for some shorter than the time it takes to execute a context switch, we could achieve 100% CPU utilization but zero throughput since we'd be evicting-on-arrival every process. Utilization as a metric alone is useless if the work being done on the CPU is not related to process execution.
Thus, we want to choose to be just longer than the average next CPU burst in order to force the minimum possible number of context switches, and only preempt long-running processes that may otherwise continue to hog compute.
To quantify this goldilocks zone, we introduce yet another criterion of response time which captures the delta between the time of a process's first request to its first response.
For some "optimal" time quanta, round robin is able to give the impression concurrency across some number of processes beneath a certain threshold which is dictated by the CPUs clock speed. I.e., for any CPU, there is some number of processes at which point the addition of more processes will decrease responsiveness. In order to address this without reducing and paying undue penalties in the way of a context switch, we can either:
- upgrade our CPU so we can context switch faster, allowing us to decrease without compromising turnaround time;
- utilize multiple processors to decrease the load per CPU.
Both of these are copout answers. Given the constraints of fixed hardware, we must revisit our OS/scheduler's design if we want to milk out any more throughput. The next step would be to add more granular priorities to our processes, perhaps weighted towards user window focus, applications running in the "foreground," and other critical OS processes.
Priority Scheduling
As observed previously, vanilla priority scheduling is still subject to starvation. We can combine round robin with priority scheduling, but this will still falter under a heavy load of high priority processes which may prevent lower priority processes from every being allocated.
One fix we might consider is to adjust the priority of a process based on the amount of time that its spent in the ready queue in order to guarantee that even the lowest of priority jobs will eventually mature into high enough priority (even if only until its next IO burst).
However, sorted insertion into just a single queue can become costly for large systems with many processes as an binary search must be executed every time a process enters the ready queue. thus, we have motivation to introduce different queues for different tiers of priority.
Multilevel Queue Scheduling
Suppose we have a -bit priority, distributed across priority classes such that the class is determined by the kind of process (which we denote with the leading 1-bit in the priority), and a granularity specified by the predicted next CPU burst duration.
Under this system, the scheduler will traverse the queues in order of priority, selecting an available job from the first non-empty queue. To account for new possibilities of higher order starvation, we can also implement a round robin mechanism on top of this multilevel queue, rotating the "head" of the scheduler across the queues (even if they still have ready processes available for selection) and lingering for some amount of periods in proportion to that queue's priority class.
We can also reprioritize a process each time it enters the ready state, for example – bringing an application from the background to the foreground should increase its priority. For non-interactive processes, we can infer the priority class based on the shotclock for that class (since, while the on-board shot clock exists as a singular timer, the scheduler can maintain different values for each queue). For example, we can initialize each process to the highest priority class, and then gradually deprioritize it as needed if it behaves like a CPU bound process relative to that queue's period. Otherwise, if its CPU burst concludes before it would get evicted, it can return to that same ready queue.
As mentioned, within each ready queue, we can order processes by predicted next burst time. With priority aging, this scheduling approach still guarantees that low-initial-priority CPU bound processes will still be executed eventually, but this sortition does indeed penalize them.
In practice
It's worth noting as well that IO bound processes are not necessarily interactive applications. Recall that IO refers to any interaction with an external device. The majority of processes are IO bound to some extent. Additionally, CPUs are fast. Even with hundreds of processes of running concurrently under light load, the majority of a modern computer's CPUs will be idle:
This implies that the higher priority ready queues are empty most of the time, and quickly scheduled when needed – responding in milliseconds. Several, multi-core processors typify modern personal computers, so many of the chokepoints/drawbacks of the aforementioned scheduler designs are obsolete outside of e.g. benchmarking scenarios. Nevertheless, modern CPUs, OSes, and their schedulers have their own modern problems. Nowadays, schedulers manage individual threads, rather than whole processes, which allows for the "illusion" of concurrency on a single-core'd CPU internal to a single process itself.
Sophisticated processes will divide compute tasks by thread, with CPU bound-like tasks being delegated to their own thread(s), segregated from the IO/interactive functionality which can be prioritized accordingly. This allows for CPU intensive applications to remain responsive by prioritizing UI thread(s) while still allowing for multi-queue feedback scheduling to appropriately handle the CPU intensive portions "in the background."
We also neglected to enhance our representation of the waiting queue for this post, but it –of course– is never just a singular FIFO queue. It makes far more sense to organize waiting queues by the IO device/nature of the event the processes within are awaiting since it doesn't make sense to cause a waiting queue pileup if a lackadaisical disk IO task is in front of a GPU task.
Appendix I: Optimality of EFT for Interval scheduling
Suppose that is the set of jobs computed by the Earliest First Time algorithm, that . We can sort the jobs in ascending order of finish time such that the jobs are ordered
As a result of this ordering, we know that for every then the finishing times are related by the inequality for whatever process.
Now, suppose that EFT has not produced an optimal schedule. This implies that there exists some other set of jobs with jobs. Since is a solution to the problem, the jobs it contains must be mutually compatible. We can sort the same way we did in order to better compare the properties of the two schedules. Let
be the jobs in this order, allowing us to infer then .
The key idea now is to compare the jobs at the same indices in both and . By construction, they must differ at some index (having different starting and/or ending times), otherwise we would have . Let be the first index where they differ, i.e. for every index
Note that it is possible that , that is the schedules differ from the very first job. Next, we replace with and show that still contains a compatible set of jobs such that the smallest index where differs from "bubbles up" by at least one index. We have three cases to consider:
Case 1:
Suppose like so:
We can make some productive observations. Since is the first job scheduled by EFT, its finish time must be the smallest among all jobs in the input. Therefore, we can be sure that:
In other words, the situation illustrated in the Figure 1(a) above is not possible. Moreover, since the jobs in are mutually compatible, we have:
where is function assigning the start time to the job in the schedule:
Chaining these inequalities together, we get the scenario in Figure 1(b) where:
Therefore, if we replace with , then the jobs in remain mutually compatible as illustrated in Figure 1(c).
for some
Case 2:Next, suppose that the smallest index where differs from is somewhere inclusively between the 2nd index and the last index: . Recall that this implies that for every index
We can make a similar argument as before in two parts (since we now must consider the jobs preceding ). First, we can assert that
since they're the same job. Moreover, since the jobs in are mutually compatible, we also have:
Combining these inequalities, we get:
Therefore, is compatible with and would have been in the list of jobs available to the EFT algorithm when it selected . Since EFT selects the available job with the smallest finishing time, we can conclude that:
For jobs following , we know that they all must be compatible with . Since we have just shown that , we can conclude that is also compatible with all jobs after . In other words, if we replace with in , the set of jobs in will remain mutually compatible.
We can iterate this exchange argument for every index where and differ. We must, however, make this argument index-by-index, starting at the smallest index and working our way forwards in order to guarantee the base assertion that .
for all except
Case 2:As long as the index of the differing job is less than or equal to , we can exchange the job in with the job in . Therefore, we can ensure that this sequence of jobs is mutually compatible (noting the change at index ):
We still have not precluded the possibility that implying that has higher throughput than ~somehow, despite being unable to pin down how. Fortunately, it is easy to address this (im)possibility. If is indeed structured as above, then is compatible with . Therefore, after EFT selected , it would not have processed all the jobs, which contradicts our assumption that EFT resulted in upon termination. Therefore, must also have jobs.