Some Are More Equal Than Others
A decade ago, a desktop scheduler had to deal with a single processor. Very few people were selling desktops with more than one CPU. Be Inc. and one of the Apple clone makers sold dual-processor systems, but everyone else viewed SMP as a technology for workstation and server class systems.
These days, it’s hard to find a single-core system. Even the laptop on which I’m typing this article has two cores, and desktops are starting to have 4–8 without costing the earth. This change places some extra demands on a scheduler.
The simplest way of addressing this extra complexity is simply to split your processes equally among the available processors. But that’s unlikely to yield good results, since you’ll often end up with one CPU running at 100% while the other idles. To fix this disparity, you need a mechanism for moving tasks between CPUs.
Here comes another set of conflicting requirements. For good CPU usage, you want to perform this type of load balancing quite often. The problem is that load balancing is very expensive, because it typically requires synchronization between the CPUs. Migrating a task from one CPU to another is also quite expensive. If the CPUs don’t share a common cache, the newly migrated process will generate a large number of cache misses when it starts running on the new CPU.
Relationships between processes are also important. If two processes are in a producer-consumer relationship, it’s more efficient to have both running on the same processor, or on processors with a shared cache, since this plan allows them to communicate without having to go via main memory. For example, this is important when scheduling processes running on modern Intel chips, on which each pair of cores has a shared cache. In a four-core system, migrating a process from one core to another on the same die is a much cheaper operation than moving it to a core on a different die.
A similar problem occurs with AMD chips; each has its own on-die memory controller. When a process is running, it typically is allocated memory from the block controlled by the processor on which it’s running. If the process is migrated, all requests to main memory have to go an extra hop along the HyperTransport network, making cache misses slightly more expensive. All of these concerns should be taken into account (ideally) when making migration decisions.
The situation gets even more interesting when the cores are not all the same. The UltraSPARC T1 ("Niagara" to its friends) is a perfect example. Each chip has eight cores, each of which has four contexts. Since the contexts on a single core share execution units, you get very different performances if you schedule four processes to run on separate cores, versus scheduling them all on four contexts of the same core. An added issue comes from the fact that each chip has a single floating-point unit. A dual-socket system running two floating-point-heavy processes runs a lot better if each is running on a separate chip.
A similar problem in somewhat cheaper hardware has recently gained some attention in the OpenBSD world. Cryptographic coprocessors (typically on the PCI bus) often are much faster for encryption, but they have a fairly heavy setup cost with each use. For simple tasks, it’s faster to perform the encryption operation in software, rather than set up and use the coprocessor. Finding a good heuristic for when the dedicated chip should be used is not a trivial problem.
So far, most of the requirements for a scheduler have focused around making it fair and efficient, but in some cases it’s beneficial to make it unfair. Some processes are considered more important than others, and they should be given more CPU time. Playing video is a typical example; it can take a significant amount of CPU time—more than a fair algorithm would want to give—but the user on a desktop or laptop machine probably considers this sacrifice worthwhile, in order to avoid dropping frames. Similarly, certain important accounting tasks on a server system might be prioritized. At the opposite end of the spectrum are things like SETI@home, which should run only when nothing important is being done by the CPU.
In the traditional UNIX scheduler, the priority of a running process decays as it runs. Processes that consume their full allocation gradually lose their priority so that others can run. This happy arrangement is skewed by the nice() system call, however, which allows the initial priority value to be increased or decreased.
It’s important that a scheduler support a mechanism such as nice(), or users won’t be able to perform this kind of tweaking.
So far I’ve talked about the needs of a scheduler, but not really about how it’s implemented. The recently removed Linux scheduler was known as the "O(1) scheduler," to differentiate it from the older one, which ran in O(n) time. In the worst case, the time taken to make a scheduler decision was proportional to the number of processes, which was a problem for Linux systems with very large numbers of processes.
Traditional UNIX schedulers use run queues to determine which process should run next. These are linked lists of data structures representing processes. Every time a process is de-scheduled, it’s placed at the end of the list, and the scheduler iterates over the list to find one that’s ready to run.
A slightly better design maintains separate lists. The Linux O(1) scheduler used a priority array—an array of run queues indexed by priority. Since there are a fixed number of priorities, this array can be inspected in constant time. The scheduler simply iterates over the array until it finds a priority with a non-empty queue, and then runs these tasks in a round-robin fashion.
The newer Completely Fair Scheduler uses a kind of binary tree known as a red-black tree to store processes. Technically, it’s not O(1), so in terms of worst case performance it’s less optimal than the older one. In the average case, however, it’s believed to perform better.
There are a lot of conflicting requirements on a scheduler, so any change is likely to make some people unhappy. This is why systems like Xen, Solaris, and FreeBSD allow the user to select between a number of different CPU schedulers.