4.4 Thread Scheduling
Traditionally, the FreeBSD scheduler had an ill-defined set of hooks spread through the kernel. In FreeBSD 5.0, these hooks were regularized and a well-defined API was created so that different schedulers could be developed. Since FreeBSD 5.0, the kernel has had two schedulers available:
The historic 4.4BSD scheduler found in the file /sys/kern/sched_4bsd.c. This scheduler is described at the beginning of this section.
The new ULE scheduler first introduced in FreeBSD 5.0 and found in the file /sys/kern/sched_ule.c [Roberson, 2003]. The name is not an acronym. If the underscore in its file name is removed, the rationale for its name becomes apparent. This scheduler is described at the end of this section.
Because a busy system makes thousands of scheduling decisions per second, the speed with which scheduling decisions are made is critical to the performance of the system as a whole. Other UNIX systems have added a dynamic scheduler switch that must be traversed for every scheduling decision. To avoid this overhead, FreeBSD requires that the scheduler be selected at the time the kernel is built. Thus, all calls into the scheduling code are resolved at compile time rather than going through the overhead of an indirect function call for every scheduling decision. By default, kernels up through FreeBSD 5.1 use the 4.4BSD scheduler. Beginning with FreeBSD 5.2, the ULE scheduler is used by default.
The 4.4BSD Scheduler
All threads that are runnable are assigned a scheduling priority that determines in which run queue they are placed. In selecting a new thread to run, the system scans the run queues from highest to lowest priority and chooses the first thread on the first nonempty queue. If multiple threads reside on a queue, the system runs them round robin; that is, it runs them in the order that they are found on the queue, with equal amounts of time allowed. If a thread blocks, it is not put back onto any run queue. If a thread uses up the time quantum (or time slice) allowed it, it is placed at the end of the queue from which it came, and the thread at the front of the queue is selected to run.
The shorter the time quantum, the better the interactive response. However, longer time quanta provide higher system throughput because the system will have less overhead from doing context switches and processor caches will be flushed less often. The time quantum used by FreeBSD is 0.1 second. This value was empirically found to be the longest quantum that could be used without loss of the desired response for interactive jobs such as editors. Perhaps surprisingly, the time quantum has remained unchanged over the past 20 years. Although the time quantum was originally selected on centralized timesharing systems with many users, it is still correct for decentralized workstations today. While workstation users expect a response time faster than that anticipated by the timesharing users of 20 years ago, the shorter run queues on the typical workstation makes a shorter quantum unnecessary.
Time-Share Thread Scheduling
The FreeBSD time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements (e.g., being blocked awaiting an event) and the amount of resources consumed by the thread (e.g., CPU time). Threads are moved between run queues based on changes in their scheduling priority (hence the word feedback in the name multilevel feedback queue). When a thread other than the currently running thread attains a higher priority (by having that priority either assigned or given when it is awakened), the system switches to that thread immediately if the current thread is in user mode. Otherwise, the system switches to the higher-priority thread as soon as the current thread exits the kernel. The system tailors this short-term scheduling algorithm to favor interactive jobs by raising the scheduling priority of threads that are blocked waiting for I/O for 1 or more seconds and by lowering the priority of threads that accumulate significant amounts of CPU time.
Short-term thread scheduling is broken up into two parts. The next section describes when and how a thread's scheduling priority is altered; the section after that describes the management of the run queues and the interaction between thread scheduling and context switching.
Calculations of Thread Priority
A thread's scheduling priority is determined directly by two values associated with the thread structure: kg_estcpu and kg_nice. The value of kg_estcpu provides an estimate of the recent CPU utilization of the thread. The value of kg_nice is a user-settable weighting factor that ranges numerically between -20 and 20. The normal value for kg_nice is 0. Negative values increase a thread's priority, whereas positive values decrease its priority.
A thread's user-mode scheduling priority is calculated after every four clock ticks (typically 40 milliseconds) that it has been found running by this equation:
Values less than PRI_MIN_TIMESHARE (160) are set to PRI_MIN_TIMESHARE (see Table 4.2); values greater than PRI_MAX_TIMESHARE (223) are set to PRI_MAX_TIMESHARE. This calculation causes the priority to decrease linearly based on recent CPU utilization. The user-controllable kg_nice parameter acts as a limited weighting factor. Negative values retard the effect of heavy CPU utilization by offsetting the additive term containing kg_estcpu. Otherwise, if we ignore the second term, kg_nice simply shifts the priority by a constant factor.
The CPU utilization, kg_estcpu, is incremented each time that the system clock ticks and the thread is found to be executing. In addition, kg_estcpu is adjusted once per second via a digital decay filter. The decay causes about 90 percent of the CPU usage accumulated in a 1-second interval to be forgotten over a period of time that is dependent on the system load average. To be exact, kg_estcpu is adjusted according to
where the load is a sampled average of the sum of the lengths of the run queue and of the short-term sleep queue over the previous 1-minute interval of system operation.
To understand the effect of the decay filter, we can consider the case where a single compute-bound thread monopolizes the CPU. The thread's CPU utilization will accumulate clock ticks at a rate dependent on the clock frequency. The load average will be effectively 1, resulting in a decay of
If we assume that the thread accumulates Ti clock ticks over time interval i and that kg_nice is zero, then the CPU utilization for each time interval will count into the current value of kg estcpu according to
Thus, after five decay calculations, only 13 percent of T0 remains present in the current CPU utilization value for the thread. Since the decay filter is applied once per second, we can also say that about 90 percent of the CPU utilization is forgotten after 5 seconds.
Threads that are runnable have their priority adjusted periodically as just described. However, the system ignores threads blocked awaiting an event: These threads cannot accumulate CPU usage, so an estimate of their filtered CPU usage can be calculated in one step. This optimization can significantly reduce a system's scheduling overhead when many blocked threads are present. The system recomputes a thread's priority when that thread is awakened and has been sleeping for longer than 1 second. The system maintains a value, kg_slptime, that is an estimate of the time a thread has spent blocked waiting for an event. The value of kg_slptime is set to 0 when a thread calls sleep() and is incremented once per second while the thread remains in a SLEEPING or STOPPED state. When the thread is awakened, the system computes the value of kg_estcpu according to
and then recalculates the scheduling priority using Eq. 4.1. This analysis ignores the influence of kg_nice; also, the load used is the current load average rather than the load average at the time that the thread blocked.
The priority calculations used in the short-term scheduling algorithm are spread out in several areas of the system. Two routines, schedcpu() and roundrobin(), run periodically. Schedcpu() recomputes thread priorities once per second, using Eq. 4.2, and updates the value of kg_slptime for threads blocked by a call to sleep(). The roundrobin() routine runs 10 times per second and causes the system to reschedule the threads in the highest-priority (nonempty) queue in a round-robin fashion, which allows each thread a 100-millisecond time quantum.
The CPU usage estimates are updated in the system clock-processing module, hardclock(), which executes 100 times per second. Each time that a thread accumulates four ticks in its CPU usage estimate, kg_estcpu, the system recalculates the priority of the thread. This recalculation uses Eq. 4.1 and is done by the resetpriority() routine. The decision to recalculate after four ticks is related to the management of the run queues described in the next section. In addition to issuing the call from hardclock(), each time setrunnable() places a thread on a run queue, it also calls resetpriority() to recompute the thread's scheduling priority. This call from wakeup() to setrunnable() operates on a thread other than the currently running thread. So setrunnable() invokes updatepri() to recalculate the CPU usage estimate according to Eq. 4.3 before calling resetpriority(). The relationship of these functions is shown in Figure 4.5.
Figure 4.5 Procedural interface to priority calculation.
Thread Run Queues and Context Switching
The kernel has a single set of run queues to manage all the thread scheduling classes shown in Table 4.2. The scheduling-priority calculations described in the previous section are used to order the set of timesharing threads into the priority ranges between 160 and 223. The real-time threads and the idle threads priorities are set by the applications themselves but are constrained by the kernel to be within the ranges 128 to 159 and 224 to 255, respectively. The number of queues used to hold the collection of all runnable threads in the system affects the cost of managing the queues. If only a single (ordered) queue is maintained, then selecting the next runnable thread becomes simple but other operations become expensive. Using 256 different queues can significantly increase the cost of identifying the next thread to run. The system uses 64 run queues, selecting a run queue for a thread by dividing the thread's priority by 4. To save time, the threads on each queue are not further sorted by their priorities.
The run queues contain all the runnable threads in main memory except the currently running thread. Figure 4.6 (on page 104) shows how each queue is organized as a doubly linked list of thread structures. The head of each run queue is kept in an array. Associated with this array is a bit vector, rq_status, that is used in identifying the nonempty run queues. Two routines, runq_add() and runq_remove (), are used to place a thread at the tail of a run queue, and to take a thread off the head of a run queue. The heart of the scheduling algorithm is the runq_choose() routine. The runq_choose() routine is responsible for selecting a new thread to run; it operates as follows:
Ensure that our caller acquired the sched_lock.
Locate a nonempty run queue by finding the location of the first nonzero bit in the rq_status bit vector. If rq_status is zero, there are no threads to run, so select the idle loop thread.
Given a nonempty run queue, remove the first thread on the queue.
If this run queue is now empty as a result of removing the thread, reset the appropriate bit in rq_status.
Return the selected thread.
Figure 4.6 Queueing structure for runnable threads.
The context-switch code is broken into two parts. The machine-independent code resides in mi_switch(); the machine-dependent part resides in cpu_switch(). On most architectures, cpu_switch() is coded in assembly language for efficiency.
Given the mi_switch() routine and the thread-priority calculations, the only missing piece in the scheduling facility is how the system forces an involuntary context switch. Remember that voluntary context switches occur when a thread calls the sleep() routine. Sleep() can be invoked by only a runnable thread, so sleep() needs only to place the thread on a sleep queue and to invoke mi_switch() to schedule the next thread to run. Often an interrupt thread will not want to sleep() itself but will be delivering data that will cause the kernel to want to run a different thread than the one that was running before the interrupt. Thus, the kernel needs a mechanism to request that an involuntary context switch be done at the conclusion of the interrupt.
This mechanism is handled by setting the currently running thread's TDF_NEEDRESCHED flag and then posting an asynchronous system trap (AST). An AST is a trap that is delivered to a thread the next time that that thread is preparing to return from an interrupt, a trap, or a system call. Some architectures support ASTs directly in hardware; other systems emulate ASTs by checking an AST flag at the end of every system call, trap, and interrupt. When the hardware AST trap occurs or the AST flag is set, the mi_switch() routine is called, instead of the current thread resuming execution. Rescheduling requests are made by the swi_sched(), resetpriority(), setrunnable(), wake up(), roundrobin(), and schedcpu() routines.
With the advent of multiprocessor support FreeBSD can preempt threads executing in kernel mode. However, such preemption is generally not done, so the worst-case real-time response to events is defined by the longest path through the top half of the kernel. Since the system guarantees no upper bounds on the duration of a system call, FreeBSD is decidedly not a real-time system. Attempts to retrofit BSD with real-time thread scheduling have addressed this problem in different ways [Ferrin & Langridge, 1980; Sanderson et al., 1986].
The ULE Scheduler
The ULE scheduler was developed as part of the overhaul of FreeBSD to support SMP. A new scheduler was undertaken for several reasons:
To address the need for processor affinity in SMP systems
To provide better support for symmetric multithreading (SMT)processors with multiple, on chip, CPU cores
To improve the performance of the scheduling algorithm so that it is no longer dependent on the number of threads in the system
The goal of a multiprocessor system is to apply the power of multiple CPUs to a problem, or set of problems, to achieve a result in less time than it would run on a single-processor system. If a system has the same number of runnable threads as it does CPUs, then achieving this goal is easy. Each runnable thread gets a CPU to itself and runs to completion. Typically, there are many runnable threads competing for a few processors. One job of the scheduler is to ensure that the CPUs are always busy and are not wasting their cycles. When a thread completes its work, or is blocked waiting for resources, it is removed from the processor on which it was running. While a thread is running on a processor, it brings its working setthe instructions it is executing and the data on which it is operatinginto the memory cache of the CPU. Migrating a thread has a cost. When a thread is moved from one processor to another, its in-cache working set is lost and must be removed from the processor on which it was running and then loaded into the new CPU to which it has been migrated. The performance of an SMP system with a naive scheduler that does not take this cost into account can fall beneath that of a single-processor system. The term processor affinity describes a scheduler that only migrates threads when necessary to give an idle processor something to do.
Many microprocessors now provide support for symmetric multithreading where the processor is built out of multiple CPU cores, each of which can execute a thread. The CPU cores in an SMT processor share all the processor's resources, such as memory caches and access to main memory, so they are more tightly synchronized than the processors in an SMP system. From a thread's perspective, it does not know that there are other threads running on the same processor because the processor is handling them independently. The one piece of code in the system that needs to be aware of the multiple cores is the scheduling algorithm. The SMT case is a slightly different version of the processor affinity problem presented by an SMP system. Each CPU core can be seen as a processor with its own set of threads. In an SMP system composed of CPUs that support SMT, the scheduler treats each core on a processor as a less powerful resource but one to which it is cheaper to migrate threads.
The original FreeBSD scheduler maintains a global list of threads that it traverses once per second to recalculate their priorities. The use of a single list for all threads means that the performance of the scheduler is dependent on the number of tasks in the system, and as the number of tasks grow, more CPU time must be spent in the scheduler maintaining the list. A design goal of the ULE scheduler was to avoid the need to consider all the runnable threads in the system to make a scheduling decision.
The ULE scheduler creates a set of three queues for each CPU in the system. Having per-processor queues makes it possible to implement processor affinity in an SMP system.
One queue is the idle queue, where all idle threads are stored. The other two queues are designated current and next. Threads are picked to run, in priority order, from the current queue until it is empty, at which point the current and next queues are swapped and scheduling is started again. Threads in the idle queue are run only when the other two queues are empty. Real-time and interrupt threads are always inserted into the current queue so that they will have the least possible scheduling latency. Interactive threads are also inserted into the current queue to keep the interactive response of the system acceptable. A thread is considered to be interactive if the ratio of its voluntary sleep time versus its run time is below a certain threshold. The interactivity threshold is defined in the ULE code and is not configurable. ULE uses two equations to compute the interactivity score of a thread. For threads whose sleep time exceeds their run time Eq 4.4 is used:
When a thread's run time exceeds its sleep time, Eq. 4.5 is used instead.
The scaling factor is the maximum interactivity score divided by two. Threads that score below the interactivity threshold are considered to be interactive; all others are noninteractive. The sched_interact_update() routine is called at several points in a threads existencefor example when the thread is awakened by a wakeup() callto update the thread's run time and sleep time. The sleep and runtime values are only allowed to grow to a certain limit. When the sum of the run time and sleep time pass the limit, they are reduced to bring them back into range. An interactive thread whose sleep history was not remembered at all would not remain interactive, resulting in a poor user experience. Remembering an interactive thread's sleep time for too long would allow the thread to more than its fair share of the CPU. The amount of history that is kept and the interactivity threshold are the two values that most strongly influence a user's interactive experience on the system.
Noninteractive threads are put into the next queue and are scheduled to run when the queues are switched. Switching the queues guarantees that a thread gets to run at least once every two queue switches regardless of priority, which ensures fair sharing of the processor.
There are two mechanisms used to migrate threads among multiple processors. When a CPU has no work to do in any of its queues, it marks a bit in a bit-mask shared by all processors that says that it is idle. Whenever an active CPU is about to add work to its own run queue, it first checks to see if it has excess work and if another processor in the system is idle. If an idle processor is found, then the thread is migrated to the idle processor using an interprocessor interrupt (IPI). Making a migration decision by inspecting a shared bitmask is much faster than scanning the run queues of all the other processors. Seeking out idle processors when adding a new task works well because it spreads the load when it is presented to the system.
The second form of migration, called push migration, is done by the system on a periodic basis and more aggressively offloads work to other processors in the system. Twice per second the sched_balance() routine picks the most-loaded and least-loaded processors in the system and equalizes their run queues. The balancing is done only between two processors because it was thought that two processor systems would be the most common and to prevent the balancing algorithm from being too complex and adversely affecting the performance of the scheduler. Push migration ensures fairness among the runnable threads. For example, with three runnable threads on a two-processor system, it would be unfair for one thread to get a processor to itself while the other two had to share the second processor. By pushing a thread from the processor with two threads to the processor with one thread, no single thread would get to run alone indefinitely.
Handling the SMT case is a derivative form of load balancing among full-fledged CPUs and is handled by processor groups. Each CPU core in an SMT processor is given its own kseq structure, and these structures are grouped under a kseq group structure. An example of a single processor with two cores is shown in Figure 4.7 (on page 108). In an SMP system with multiple SMT capable processors there would be one processor group per CPU. When the scheduler is deciding to which processor or core to migrate a thread, it will try to pick a core on the same processor before picking one on another processor because that is the lowest-cost migration path.
Figure 4.7 Processor with two cores.